Discussion:
Obtaining high IOPs numbers with SSDs
Ian Collins
2013-08-03 08:44:33 UTC
Permalink
I've been comparing the differences in throughput and IOP performance
for a number of potential log device SSDs. The numbers I've obtained
make me wonder how the manufacturers obtain their random write numbers.

The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and
throughput. I ran this test on a single device as a pool and with the
device as a log on a small (stripe of 4 mirrors) pool.

The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000
figure is a hardware limit of the motherboard. I get 250K IOPs running
the test in /tmp, so I'm pretty sure the test is pushing the drives as
hard as it can. So what do manufacturers use to generate their numbers?

I was also a little surprised that the best numbers came form the on
board SATA on a consumer (Z77) motherboard! On both the systems I
tested (a gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave
better numbers than an LSI 9211 SAS card.
--
Ian.
Saso Kiselkov
2013-08-03 08:56:19 UTC
Permalink
Post by Ian Collins
I've been comparing the differences in throughput and IOP performance
for a number of potential log device SSDs. The numbers I've obtained
make me wonder how the manufacturers obtain their random write numbers.
The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and
throughput. I ran this test on a single device as a pool and with the
device as a log on a small (stripe of 4 mirrors) pool.
The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000
figure is a hardware limit of the motherboard. I get 250K IOPs running
the test in /tmp, so I'm pretty sure the test is pushing the drives as
hard as it can. So what do manufacturers use to generate their numbers?
I was also a little surprised that the best numbers came form the on
board SATA on a consumer (Z77) motherboard! On both the systems I
tested (a gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave
better numbers than an LSI 9211 SAS card.
How many concurrent threads were you running and what was your queue
depth? SSDs like a lot of parallelism, so if you were running
single-threaded with a shallow queue depth (optimized for hard drives),
then chances are you were not saturating it.

Cheers,
--
Saso
Ian Collins
2013-08-03 09:47:20 UTC
Permalink
Post by Saso Kiselkov
Post by Ian Collins
I've been comparing the differences in throughput and IOP performance
for a number of potential log device SSDs. The numbers I've obtained
make me wonder how the manufacturers obtain their random write numbers.
The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and
throughput. I ran this test on a single device as a pool and with the
device as a log on a small (stripe of 4 mirrors) pool.
The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000
figure is a hardware limit of the motherboard. I get 250K IOPs running
the test in /tmp, so I'm pretty sure the test is pushing the drives as
hard as it can. So what do manufacturers use to generate their numbers?
I was also a little surprised that the best numbers came form the on
board SATA on a consumer (Z77) motherboard! On both the systems I
tested (a gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave
better numbers than an LSI 9211 SAS card.
How many concurrent threads were you running and what was your queue
depth? SSDs like a lot of parallelism, so if you were running
single-threaded with a shallow queue depth (optimized for hard drives),
then chances are you were not saturating it.
I guess the queue depth is one as I wait for each write to complete. I
tried multiple threads, but the 5K limit looks like a brick wall. I
also tried striping two log devices, still 5K.
--
Ian.
Ian Collins
2013-08-03 10:00:06 UTC
Permalink
Post by Ian Collins
Post by Saso Kiselkov
Post by Ian Collins
I've been comparing the differences in throughput and IOP performance
for a number of potential log device SSDs. The numbers I've obtained
make me wonder how the manufacturers obtain their random write numbers.
The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and
throughput. I ran this test on a single device as a pool and with the
device as a log on a small (stripe of 4 mirrors) pool.
The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000
figure is a hardware limit of the motherboard. I get 250K IOPs running
the test in /tmp, so I'm pretty sure the test is pushing the drives as
hard as it can. So what do manufacturers use to generate their numbers?
I was also a little surprised that the best numbers came form the on
board SATA on a consumer (Z77) motherboard! On both the systems I
tested (a gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave
better numbers than an LSI 9211 SAS card.
How many concurrent threads were you running and what was your queue
depth? SSDs like a lot of parallelism, so if you were running
single-threaded with a shallow queue depth (optimized for hard drives),
then chances are you were not saturating it.
I guess the queue depth is one as I wait for each write to complete. I
tried multiple threads, but the 5K limit looks like a brick wall. I
also tried striping two log devices, still 5K.
I'll have to take back my comment regarding threads...

I as paying too much attention to the numbers shown by zpool iostat and
not trusting my own counters. I was surprised to see iostat reporting
significantly higher numbers for the log device alone than for the
pool. With 464 treads I see 650 IOPs per thread, which is about 40K in
total.

Why are the reported iostat numbers significantly less? Command queuing?

Thanks,
--
Ian.
Saso Kiselkov
2013-08-05 10:17:44 UTC
Permalink
Post by Ian Collins
Post by Ian Collins
Post by Saso Kiselkov
Post by Ian Collins
I've been comparing the differences in throughput and IOP performance
for a number of potential log device SSDs. The numbers I've obtained
make me wonder how the manufacturers obtain their random write numbers.
The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and
throughput. I ran this test on a single device as a pool and with the
device as a log on a small (stripe of 4 mirrors) pool.
The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000
figure is a hardware limit of the motherboard. I get 250K IOPs running
the test in /tmp, so I'm pretty sure the test is pushing the drives as
hard as it can. So what do manufacturers use to generate their numbers?
I was also a little surprised that the best numbers came form the on
board SATA on a consumer (Z77) motherboard! On both the systems I
tested (a gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave
better numbers than an LSI 9211 SAS card.
How many concurrent threads were you running and what was your queue
depth? SSDs like a lot of parallelism, so if you were running
single-threaded with a shallow queue depth (optimized for hard drives),
then chances are you were not saturating it.
I guess the queue depth is one as I wait for each write to complete. I
tried multiple threads, but the 5K limit looks like a brick wall. I
also tried striping two log devices, still 5K.
I'll have to take back my comment regarding threads...
I as paying too much attention to the numbers shown by zpool iostat and
not trusting my own counters. I was surprised to see iostat reporting
significantly higher numbers for the log device alone than for the
pool. With 464 treads I see 650 IOPs per thread, which is about 40K in
total.
Why are the reported iostat numbers significantly less? Command queuing?
My guess is you didn't modify recordsize to 4k, so you're seeing ZFS
aggregate several 4k writes into a larger block. The ZIL doesn't do that.

Cheers,
--
Saso
Ian Collins
2013-08-07 22:10:07 UTC
Permalink
Post by Saso Kiselkov
Post by Ian Collins
I'll have to take back my comment regarding threads...
I as paying too much attention to the numbers shown by zpool iostat and
not trusting my own counters. I was surprised to see iostat reporting
significantly higher numbers for the log device alone than for the
pool. With 464 treads I see 650 IOPs per thread, which is about 40K in
total.
Why are the reported iostat numbers significantly less? Command queuing?
My guess is you didn't modify recordsize to 4k, so you're seeing ZFS
aggregate several 4k writes into a larger block. The ZIL doesn't do that.
I considered that, but does ZFS aggregate sync writes?
--
Ian.
Timothy Coalson
2013-08-03 16:06:01 UTC
Permalink
I've been comparing the differences in throughput and IOP performance for
a number of potential log device SSDs. The numbers I've obtained make me
wonder how the manufacturers obtain their random write numbers.
They don't use a filesystem. Thus, no cache flush commands, no filesystem
code (checksums, etc) to delay things.
The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and throughput.
I ran this test on a single device as a pool and with the device as a log
on a small (stripe of 4 mirrors) pool.
The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000 figure
is a hardware limit of the motherboard. I get 250K IOPs running the test
in /tmp, so I'm pretty sure the test is pushing the drives as hard as it
can. So what do manufacturers use to generate their numbers?
I was also a little surprised that the best numbers came form the on board
SATA on a consumer (Z77) motherboard! On both the systems I tested (a
gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave better numbers
than an LSI 9211 SAS card.
The 3700 is a SATA drive, according to their site. To connect it to SAS,
some protocol translation occurs, which can't be free. Might be
interesting to know how much different your numbers are, though.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-08-05 13:21:31 UTC
Permalink
I've been comparing the differences in throughput and IOP performance for a number of potential log device SSDs. The numbers I've obtained make me wonder how the manufacturers obtain their random write numbers.
Prior to SSS-PTS, vendors chose the best combination of workload and architecture to get
the best number. SSS-PTS brings two important changes:

1. Preconditioning tests are run until the SSD demonstrates that it is not gaining advantage
by being new -- tests start after the SSD has been completely filled with random data and
shows consistent response times

2. No single number is provided as the result. Rather, a full characterization matrix of
mixed workload types is shown. For example, the number of outstanding I/Os (threads),
read/write ratio, and I/O size are varied. The vendor can then choose to publish the
results based on their choice of thread count.

SSS-PTS version 1.0 was pretty simple and easily coded into vdbench or iometer profiles.
It is evolving into a client and enterprise -- version 1.1 is much more complicated, IMHO overly
complex for little gain. In any case, the method is sound and characterization data is useful
for systems engineering.
The synthetic benchmark I used was to open a file with O_SYNC and write random blocks of random data for a minute and sum the IOs and throughput. I ran this test on a single device as a pool and with the device as a log on a small (stripe of 4 mirrors) pool.
Did you happen to collect the actual I/O data as seen by the SSD? I'm not sure what OS you
are using, but if you have dtrace, then I find iosnoop (OSX, illumos) or scsi.d to be useful.
For iosnoop, I use the -Dast flags, then post-process in a data analytics tool.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-08-06 05:35:46 UTC
Permalink
Post by Ian Collins
I've been comparing the differences in throughput and IOP performance
for a number of potential log device SSDs. The numbers I've obtained
make me wonder how the manufacturers obtain their random write numbers.
The synthetic benchmark I used was to open a file with O_SYNC and write
random blocks of random data for a minute and sum the IOs and
throughput. I ran this test on a single device as a pool and with the
device as a log on a small (stripe of 4 mirrors) pool.
The highest IOPs figure I obtained was about 5000 4k writes on an Intel
3700. Intel claim something like 40,000. I'm pretty sure the 5000
figure is a hardware limit of the motherboard. I get 250K IOPs running
the test in /tmp, so I'm pretty sure the test is pushing the drives as
hard as it can. So what do manufacturers use to generate their numbers?
I was also a little surprised that the best numbers came form the on
board SATA on a consumer (Z77) motherboard! On both the systems I
tested (a gigabyte Z77 and a Supermicro X9DRH), the on board SATA gave
better numbers than an LSI 9211 SAS card.
Someone in IRC also reported a 5,000 IOPS barrier a few months ago. He
was using ZFSOnLinux on Fusion-IO hardware, which should have been
capable of far higher IOPS. It would be nice to understand the reason
for the discrepancy.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Bob Friesenhahn
2013-08-07 20:00:21 UTC
Permalink
Post by Richard Yao
Someone in IRC also reported a 5,000 IOPS barrier a few months ago. He
was using ZFSOnLinux on Fusion-IO hardware, which should have been
capable of far higher IOPS. It would be nice to understand the reason
for the discrepancy.
With a single I/O thread and non-overlapping I/Os, it is likely
determined entirely by the request/response latency, which is some
fixed value unrelated to total I/O throughput capacity. It may still
be an accurate representation of the I/O possible given a single
client application.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
jason matthews
2013-08-08 17:24:02 UTC
Permalink
With a single I/O thread and non-overlapping I/Os, it is likely determined entirely by the request/response latency, which is some fixed value unrelated to total I/O throughput capacity. It may still be an accurate representation of the I/O possible given a single client application.
I agree with this. I did extensive benchmarking Intel 910 SSDs with file bench and pgbench and came to the conclusion that the 910 at parallelized queries for breakfast.

https://broken.net/zfs/intel-910-ssd-zfs-benchmark-results-on-openindiana-151a1-for-8k-iops-using-filebench-and-pgbench/

j.

jason matthews
2013-08-07 18:52:20 UTC
Permalink
Why use an MLC/SLC when you can use DRAM? If the s3700 is in your price range then the ddr drive x1 might be as well. i use these and they work well. You may want to consider them.


I have also noticed that my bench mark numbers are no where near the drive manufacturers numbers. My 2.5 inch SSD configuration is always 9207/9211 on Intel or Micron SSDs depending on the application. I use filebench to generate my numbers and figured that was more portable than making my own.

i have no idea what they use to benchmark their drives, it is a good question though.

j.
Timothy Coalson
2013-08-07 20:41:31 UTC
Permalink
Post by jason matthews
Why use an MLC/SLC when you can use DRAM? If the s3700 is in your price
range then the ddr drive x1 might be as well. i use these and they work
well. You may want to consider them.
I have also noticed that my bench mark numbers are no where near the drive
manufacturers numbers. My 2.5 inch SSD configuration is always 9207/9211 on
Intel or Micron SSDs depending on the application. I use filebench to
generate my numbers and figured that was more portable than making my own.
i have no idea what they use to benchmark their drives, it is a good question though.
AS-SSD is often used by ssd reviewers at least (though I believe it is
windows only), but even before SSDs, hard drive benchmarks didn't use a
filesystem, they used the raw block device (filebench is meant to benchmark
filesystems, not raw hardware). Which filesystem is used on top of a block
device is a variable that the hardware manufacture can't and shouldn't
account for in their hardware performance numbers.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins
2013-08-07 20:43:29 UTC
Permalink
Post by jason matthews
Why use an MLC/SLC when you can use DRAM? If the s3700 is in your price range then the ddr drive x1 might be as well. i use these and they work well. You may want to consider them.
Well I guess they only cost an order of magnitude more than a 100G
s370! Do they fit in a 2U chassis?

The figures they claim for 4K random write (here:
http://www.ddrdrive.com/performance.html) is only double that of the
100G s370, similar to the 200G model.
Post by jason matthews
I have also noticed that my bench mark numbers are no where near the drive manufacturers numbers. My 2.5 inch SSD configuration is always 9207/9211 on Intel or Micron SSDs depending on the application. I use filebench to generate my numbers and figured that was more portable than making my own.
i have no idea what they use to benchmark their drives, it is a good question though.
According the s3700 data sheet:

"Performance measured using Iometer* with Queue Depth 32. Measurements
are performed on a full Logical Block Address (LBA) span of the drive."
--
Ian.
Bob Friesenhahn
2013-08-07 22:09:59 UTC
Permalink
Post by jason matthews
Why use an MLC/SLC when you can use DRAM? If the s3700 is in your price
range then the ddr drive x1 might be as well. i use these and they work
well. You may want to consider them.
Well I guess they only cost an order of magnitude more than a 100G s370! Do
they fit in a 2U chassis?
http://www.ddrdrive.com/performance.html) is only double that of the 100G
s370, similar to the 200G model.
It may be that one offers readily achievable performance whereas the
other is "peak" performance only available under specially-constructed
extraordinary conditions.
Post by jason matthews
I have also noticed that my bench mark numbers are no where near the drive
manufacturers numbers. My 2.5 inch SSD configuration is always 9207/9211 on
Intel or Micron SSDs depending on the application. I use filebench to
generate my numbers and figured that was more portable than making my own.
i have no idea what they use to benchmark their drives, it is a good question though.
"Performance measured using Iometer* with Queue Depth 32. Measurements are
performed on a full Logical Block Address (LBA) span of the drive."
It is not normal for I/O to be ideally distributed across the whole
LBA span of the drive. This seems like a bogus benchmark for most
purposes.

DDR drive x1 is DRAM-based so it is not likely to be sensitive to
write locations.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Timothy Coalson
2013-08-07 23:36:39 UTC
Permalink
On Wed, Aug 7, 2013 at 5:09 PM, Bob Friesenhahn <
Post by jason matthews
I have also noticed that my bench mark numbers are no where near the
Post by Ian Collins
Post by jason matthews
drive manufacturers numbers. My 2.5 inch SSD configuration is always
9207/9211 on Intel or Micron SSDs depending on the application. I use
filebench to generate my numbers and figured that was more portable than
making my own.
i have no idea what they use to benchmark their drives, it is a good question though.
"Performance measured using Iometer* with Queue Depth 32. Measurements
are performed on a full Logical Block Address (LBA) span of the drive."
It is not normal for I/O to be ideally distributed across the whole LBA
span of the drive. This seems like a bogus benchmark for most purposes.
Random distribution across all LBAs is generally the worst distribution for
SSD steady state IOPS performance, less-random workloads of the same record
size should fare better (they make garbage collection easier), I would not
call it "bogus" to report worst-case performance.

Whether they waited for the drive to reach that steady state is not made
clear in the spec sheet. However, looking at some reviews of the s3700
suggests that they did wait for steady state to report those numbers
(though it appears intel didn't precondition the drive with random writes
before doing the random read test):

http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/3

Scroll down, and they have a linear scale graph on a section of steady
state random write performance.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Mattoon
2013-08-08 14:04:25 UTC
Permalink
What would be the benefit to using DDR in L2ARC instead of in the ARC?

Also assuming that you are looking to use this as part of the SLOG
instead of the L2ARC, how would the DDR Drive x1 ensure that data that
is written is persistent across a reboot (for a situation of a sudden
loss of power). Seems like one of the main lures of ZFS is the data
consistency, if that is compromised then I am not sure what we are doing
here.

Could be that I am just misunderstanding the premise, so if that is the
case please let me know.


Matthew Mattoon
All Angles IT

***@allanglesit.com
http://blog.allanglesit.com
Post by jason matthews
Why use an MLC/SLC when you can use DRAM? If the s3700 is in your price range then the ddr drive x1 might be as well. i use these and they work well. You may want to consider them.
I have also noticed that my bench mark numbers are no where near the drive manufacturers numbers. My 2.5 inch SSD configuration is always 9207/9211 on Intel or Micron SSDs depending on the application. I use filebench to generate my numbers and figured that was more portable than making my own.
i have no idea what they use to benchmark their drives, it is a good question though.
j.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22148029-fd9af592
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Bob Friesenhahn
2013-08-08 14:22:38 UTC
Permalink
Post by Matthew Mattoon
What would be the benefit to using DDR in L2ARC instead of in the ARC?
There would be virtually no benefit to using DDR X1 for L2ARC. This
would be just an extreme waste of money as compared to buying an
equivalant amount of additional system RAM.
Post by Matthew Mattoon
Also assuming that you are looking to use this as part of the SLOG
instead of the L2ARC, how would the DDR Drive x1 ensure that data that
is written is persistent across a reboot (for a situation of a sudden
loss of power). Seems like one of the main lures of ZFS is the data
consistency, if that is compromised then I am not sure what we are doing
here.
An external dedicated UPS + wall-wart or a super-cap are used to
provide enough time for the DDR drive to persist DRAM content to
FLASH. Both of these have the appearance of add ons, although the
super-cap is attached somewhere inside the chassis.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Loading...