Forum Topic - Help with tuning for SSD performance?: (4 Items)
   
Help with tuning for SSD performance?  
Hi All,

I hoping someone can help with some advice re tuning our driver and filesystem parameters to optimise write performance 
with SSD SATA drives.

We have acquired some OCZ Colussus SSD drives to help boot the disk performance of some of our deployed systems. These 
drives have a 3.5" form factor and standard SATA connection to the system. Reviews (like http://www.pcper.com/article.
php?aid=821) suggest that these drives are capable of read and write performance of over 200 MBytes/s.

When connected to our deployed QNX systems (QNX 6.3.2A) these drives are recognised as regular SATA drives (devb-ahci) 
and we can format and mount them as we would expect (although of course stuck with qnx4fs as we are on 6.3.2).

Testing shows that we are getting good read speed off these drives - around or above 100MByte/s. This is good enough for
 our requirements (although still less than 1/2 their theoretical capability).

However initial write testing showed we were only getting around 6-8MByte/s for writes. We have started to play with 
various options to devb-ahci and have managed to get the write speed up to about 16 MByte/s, but it seems to be a very 
unscientific process as far as what parameters we should be changing and what would have most impact. Often the 
performance moves in the opposite direction to what we might expect :-(

Ideally we need to get the write speed up to at least 40 MByte/s.

Note also that we want to optimise performance for small numbers of very large files (1GB+) which get written once, read
 two or three times, then erased. Caching of the file content is not practical (or necessary).

So my questions are:
- Does anyone have any suggestions for parameters to tune? Ideally someone else has played with similar drives and has 
some cook-book recipes they can share?
- If not, what are the opinions on the important parameters to play with? Cache size seems to have most effect so far - 
smaller cache means faster writes.
- Would the situation be substantially improved by going to 6.4.1? 

Any thoughts/ideas/suggestions would be gratefully accepted.

Thanks,

Rob Rutherford
Ruzz TV
Re: Help with tuning for SSD performance?  
Robert Rutherford wrote:
> I hoping someone can help with some advice re tuning our driver and filesystem parameters to optimise write 
performance with SSD SATA drives.

You don't provide many actual h/w or configuration details?  Those sorts
of write speeds would seem to be a h/w issue (a sloginfo would indicate
the UDMA mode in use, etc)?  Or what your benchmarking test is (rd/wr
how, what size, what total size, etc)?


 > So my questions are:
 > - Does anyone have any suggestions for parameters to tune? Ideally
 > someone else has played with similar drives and has some cook-book
 > recipes they can share?

Sure, I will wander off into reporting my SSD benchmarking, as the
numbers are so much better than what you see!  This is with 6.4.2, on
a 3GHz x86, ICH7, with vanilla "devb-eide" (EIDE emulation of SATA
drives), no options (ends up with 75MB buffer cache).  Another machine
I run devb-ahci and native AHCI mode, and see similar SSD numbers.
Anyway ...

I have benchmarked near-quoted performance of the OCZ Vertex SSD.

     Sequential File Write/Read Benchmark
     OS:      QNX 6.4.2 x86pc
     Filesys: OCZ-VERTEX, UDMA5, SSD, blk-partition, 100% full
     Config:  512MiB file, 64KiB record, fd, fsync, pregrown, malloc
     Write:     3433 msec,  419 usec/write(),  39% CPU, 152.71 MiB/sec
     Read:      2373 msec,  289 usec/read(),   25% CPU, 220.93 MiB/sec

But further investigation suggests it obtains the write performance
via runtime compression (I had thought that was only with the SandForce
and not Indilinx controller, but my numbers would suggest otherwise), as
using a random record data pattern reduces the throughput considerably.

     Sequential File Write/Read Benchmark
     OS:      QNX 6.4.2 x86pc
     Filesys: OCZ-VERTEX, UDMA5, SSD, blk-partition, 100% full
     Config:  512MiB file, 64KiB record, fd, fsync, pregrown, malloc
     Write:     5247 msec,  640 usec/write(),  25% CPU, 99.92 MiB/sec
     Read:      2739 msec,  334 usec/read(),   21% CPU, 191.41 MiB/sec

Putting a filesystem on top of that is another slight decrease,
although fsq4 and fsq6 are fairly similar in performance.

     Sequential File Write/Read Benchmark
     OS:      QNX 6.4.2 x86pc
     Filesys: OCZ-VERTEX, UDMA5, SSD, qnx4, 1% full
     Config:  512MiB file, 64KiB record, fd, fsync, malloc
     Create:       0 msec
     Write:     5559 msec,  678 usec/write(),  22% CPU, 94.31 MiB/sec
     Read:      2637 msec,  321 usec/read(),   23% CPU, 198.81 MiB/sec
     Delete:       5 msec

     Sequential File Write/Read Benchmark
     OS:      QNX 6.4.2 x86pc
     Filesys: OCZ-VERTEX, UDMA5, SSD, qnx6, 4% full
     Config:  512MiB file, 64KiB record, fd, fsync, malloc
     Create:       0 msec
     Write:     5226 msec,  637 usec/write(),  31% CPU, 100.32 MiB/sec
     Read:      2879 msec,  351 usec/read(),   23% CPU, 182.10 MiB/sec
     Delete:       9 msec

Repeated runs are +/- 10MB/s, so reasonably consistent.

Aside, the Intel X25-M seems to read slightly faster (I saw 240 raw),
but artifically limits its write throughput to 80MB/s (although seems
to be able to consistently maintain it at that rate).  I read that
other Windows-based benchmarking reached the same conclusion.

Make sure you have the latest drive firmware (this really helps!)
There is some talk that aligning your partition helps (faking geometry
or using an LBA fdisk, or just not partitioning at all), but I
haven't really noticed that having much effect - the SSD controllers
are pretty smart these days.  TRIM can be useful, but hard to
balance the cost/reward of when to do this (most SSD vendor websites
will have a Windows tool you can download to erase the device - have
you tried restoring one blank with this?).


 > - If not, what are the opinions on the important parameters to play
 > with? Cache size seems to have most effect so far - smaller cache
 > means faster...
View Full Message
Re: Help with tuning for SSD performance?  
Hi John,

I'm afraid our benchmarking isn't very scientific - basically it's a combination of 'dd' and 'cp -V'. What tool do you 
use? Is it publicly available?

However we have made a little bit of progress. Our root problem does seem to be perhaps DMA-related. What we are now 
seeing is that immediately after a re-boot, the writes are "slow" (8 MByte/s)  and during the write we appear to be CPU-
bound. This would seem to indicate DMA not being used, right? However after some (apparently random??) amount of time, 
the writes suddenly speed up to around 50-60 MByte/s and the CPU usage drops significantly. 

It's almost like the driver thinks it can't use DMA and then later decides it can. Is this even possible?

The test machine is an x86 with Core 2 Duo and ICH7 with 4GB of RAM. The boot drive is an older IDE drive. I am 
currently running without SMP to reduce variables from the equation, but we saw the same problem with SMP enabled.

I have attached the output of "sloginfo" and "pci -v", also the version of devb-eide. Can you have a look at the 
sloginfo to see if anything stands out?

Thanks again,

Rob
Attachment: Text ssdinfo.tgz 3.29 KB
Re: Help with tuning for SSD performance?  
Robert Rutherford wrote:
> I'm afraid our benchmarking isn't very scientific - basically it's a combination of 'dd' and 'cp -V'. What tool do you
 use? Is it publicly available?

The problem with dd and cp is that they are neither a producer nor
consumer of the data, they just move it, so get/put the content from
somewhere else, and so you double-bill the filesystem for the
associated msg-pass/context-switch costs.  And 'cp' uses a 4k buffer,
so you're just showing micro-kernel costs of the overhead of moving
small amounts of data around (although dd at least has bs=).  Try
something like iozone.  If I get less busy I'll post my 'rw' binary,
which has lots of configuration options and reporting.

> However we have made a little bit of progress. Our root problem does seem to be perhaps DMA-related. What we are now 
seeing is that immediately after a re-boot, the writes are "slow" (8 MByte/s)  and during the write we appear to be CPU-
bound. This would seem to indicate DMA not being used, right? However after some (apparently random??) amount of time, 
the writes suddenly speed up to around 50-60 MByte/s and the CPU usage drops significantly. 

No, that to me sounds like the cost of the memmgr performing 4k mmaps,
and is a well-known issue (why you always want to do a few hundred MB
dcheck or something first to allocate/warm up the data structures).
Various control structures are needed, which are allocated by the
memchunk allocator on-demand in 4k units; a thousand might be needed;
the 6.3 VM has some scaling issues here;  you'll notice 'pidin mem'
slowly growing to its full size during this setup phase.  This is
also tied in, I think, with your earlier report of seeemingly-anomalous
less-cache-goes-faster.

This would be an advantage in going to the later stuff; the 4k cache
page needs 1/8th of the control structures to begin with, and also
the alloc=upfront option is supported by the new slab allocator as
being a single large mmap carved up internally as opposed to single
pages from the OS (optimisation due to above observation).

> It's almost like the driver thinks it can't use DMA and then later decides it can. Is this even possible?

No.  It can go the other way (DMA timeouts or CRC errors can fallback
to PIO, but from there never go back to DMA).  Your sloginfo seems
to indicate UDMA5 is being used.  Our h/w expert would have to comment
on anything else in the pci or slog output.

> The test machine is an x86 with Core 2 Duo and ICH7 with 4GB of RAM. The boot drive is an older IDE drive.

A better setup than my SSD figures were obtained from.  And I read the
Colusses would be expected to be better than my Vertex.  So I guess it
might be worthwhile trying 6.4.1 and/or headbranch fsys, to see if that
gets you nearer what I measure?  But try 64k-128k IO operations first.

> I have attached the output of "sloginfo" and "pci -v", also the version of devb-eide.

In your first post you said you were using devb-ahci?  Now devb-eide?