Discussion:
all ssd pool
Richard Kojedzinszky
2014-05-06 09:32:08 UTC
Permalink
And also a performance and reliability test would worth it.
(https://github.com/rkojedzinszky/zfsziltest)

I would be interested in comparing it to an Intel SSD DC3700, which has a
very impressive performance, and with Intel's promise, its endurance is
comparable to SLC based SSDs. And the cost is very reasonable.

Kojedzinszky Richard

On Tue, 6 May 2014, Steven Hartland wrote:

> I can't really comment on OI but we have quite a bit of experience of all SSD
> pools under FreeBSD.
>
> The biggest issue is single strength when going though expanders when using
> 6Gbps devices. We've tested a number of chassis with hotswap backplanes
> which have turned out to have bad signal strength which results in unstable
> devices which will drop under load.
>
> Once you have a setup which is confirmed to have good signaling then things
> become a lot easier.
>
> I cant say I've used Seagate SSD's as we mainly use consumer grade disks
> which have served us well for what we do.
>
> One thing that may be an issue is SSD's generally require TRIM support to
> remain performant. Currently OI doesn't have TRIM support for ZFS where
> as FreeBSD does, which myself and other actively maintain so it maybe
> something worth considering.
>
> FW is also very important, particularly when it comes to TRIM support so
> I'd definitely recommend testing a single disk before buying in bulk.
>
> Regards
> Steve
>
>
> ----- Original Message ----- From: "Luke Iggleden" <***@lists.illumos.org>
> To: <***@lists.illumos.org>
> Sent: Tuesday, May 06, 2014 8:45 AM
> Subject: [zfs] all ssd pool
>
>
>> Hi All,
>>
>> We're looking at deploying an all SSD pool with the following hardware:
>>
>> Dual Node
>>
>> Supermicro SSG-2027B-DE2R24L
>> (includes LSI 2308 Controller)
>> 128GB RAM per node
>> 24 x Seagate PRO 600 480GB SSD
>>
>> 24 x LSI interposers (sata > sas) ?? (maybe, see post)
>> RSF-1 High Availability Suite to failover between nodes
>> Open Indiana or Omni OS
>>
>> My question really relates to the issues with SATA on SAS expanders and ZFS
>> and are modern LSI interposers with this combo working ok now with the
>> mpt_sas driver?
>>
>> I've seen some posts on forums which suggest that a couple of interposers
>> have died and have crashed the mpt_sas driver due to resets, but I'm
>> wondering if that is related to the bug in illumos which crashes the
>> mpt_sas driver (illumos bugs 4403, 4682 & 4819)
>>
>> https://www.illumos.org/issues/4403
>> ‹https://www.illumos.org/issues/4682
>> ‹https://www.illumos.org/issues/4819
>>
>> If LSI interposers are a no go, has anyone got these (or other) SATA SSD's
>> running on supermicro SAS2 expanders and getting a reliable platform,
>> specifically when a SSD dies or performance is at max?
>>
>> A few years ago we were burned by putting Hitachi 7200rpm SATA disks on an
>> expander, this was before most of the posts about 'sata on sas DONT!' posts
>> came out. That was 2009/10 then, so things could have changed?
>>
>> Also, there were some other posts suggesting that the WWN for SSD's with
>> LSI interposers were not being passed through, but it was suggested that
>> this was an issue with the SSD and not the interposer.
>>
>> Thanks in advance.
>>
>>
>> Luke Iggleden
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed:
>> https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/25402478-0858cafa
> Modify Your Subscription:
> https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
Steven Hartland
2014-05-06 09:11:43 UTC
Permalink
I can't really comment on OI but we have quite a bit of experience of all SSD
pools under FreeBSD.

The biggest issue is single strength when going though expanders when using
6Gbps devices. We've tested a number of chassis with hotswap backplanes
which have turned out to have bad signal strength which results in unstable
devices which will drop under load.

Once you have a setup which is confirmed to have good signaling then things
become a lot easier.

I cant say I've used Seagate SSD's as we mainly use consumer grade disks
which have served us well for what we do.

One thing that may be an issue is SSD's generally require TRIM support to
remain performant. Currently OI doesn't have TRIM support for ZFS where
as FreeBSD does, which myself and other actively maintain so it maybe
something worth considering.

FW is also very important, particularly when it comes to TRIM support so
I'd definitely recommend testing a single disk before buying in bulk.

Regards
Steve


----- Original Message -----
From: "Luke Iggleden" <***@lists.illumos.org>
To: <***@lists.illumos.org>
Sent: Tuesday, May 06, 2014 8:45 AM
Subject: [zfs] all ssd pool


> Hi All,
>
> We're looking at deploying an all SSD pool with the following hardware:
>
> Dual Node
>
> Supermicro SSG-2027B-DE2R24L
> (includes LSI 2308 Controller)
> 128GB RAM per node
> 24 x Seagate PRO 600 480GB SSD
>
> 24 x LSI interposers (sata > sas) ?? (maybe, see post)
> RSF-1 High Availability Suite to failover between nodes
> Open Indiana or Omni OS
>
> My question really relates to the issues with SATA on SAS expanders and ZFS and are modern LSI interposers with this combo
> working ok now with the mpt_sas driver?
>
> I've seen some posts on forums which suggest that a couple of interposers have died and have crashed the mpt_sas driver due to
> resets, but I'm wondering if that is related to the bug in illumos which crashes the mpt_sas driver (illumos bugs 4403, 4682 &
> 4819)
>
> https://www.illumos.org/issues/4403
> 
https://www.illumos.org/issues/4682
> 
https://www.illumos.org/issues/4819
>
> If LSI interposers are a no go, has anyone got these (or other) SATA SSD's running on supermicro SAS2 expanders and getting a
> reliable platform, specifically when a SSD dies or performance is at max?
>
> A few years ago we were burned by putting Hitachi 7200rpm SATA disks on an expander, this was before most of the posts about
> 'sata on sas DONT!' posts came out. That was 2009/10 then, so things could have changed?
>
> Also, there were some other posts suggesting that the WWN for SSD's with LSI interposers were not being passed through, but it
> was suggested that this was an issue with the SSD and not the interposer.
>
> Thanks in advance.
>
>
> Luke Iggleden
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
>
Schweiss, Chip
2014-05-06 12:53:47 UTC
Permalink
On Tue, May 6, 2014 at 2:45 AM, Luke Iggleden <***@lists.illumos.org> wrote:

> Hi All,
>
> We're looking at deploying an all SSD pool with the following hardware:
>
> Dual Node
>
> Supermicro SSG-2027B-DE2R24L
> (includes LSI 2308 Controller)
> 128GB RAM per node
> 24 x Seagate PRO 600 480GB SSD
>
> 24 x LSI interposers (sata > sas) ?? (maybe, see post)
> RSF-1 High Availability Suite to failover between nodes
> Open Indiana or Omni OS
>
>
I would add that the Supermicro JBODs do not do well in an HA
environment. I have frequently had expanders go out to lunch when two
hosts are connected to them. It took full power cycles to bring them back.
Multipath with one host works nicely.

DataON JBODs have worked the best for me with HA setups.

Either way if your using interposers, they are your weak link and building
an HA setup is not going to help. Been there, tried like hell to make it
work. It's much more stable as a single host.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Luke Iggleden
2014-05-06 07:45:16 UTC
Permalink
Hi All,

We're looking at deploying an all SSD pool with the following hardware:

Dual Node

Supermicro SSG-2027B-DE2R24L
(includes LSI 2308 Controller)
128GB RAM per node
24 x Seagate PRO 600 480GB SSD

24 x LSI interposers (sata > sas) ?? (maybe, see post)
RSF-1 High Availability Suite to failover between nodes
Open Indiana or Omni OS

My question really relates to the issues with SATA on SAS expanders and
ZFS and are modern LSI interposers with this combo working ok now with
the mpt_sas driver?

I've seen some posts on forums which suggest that a couple of
interposers have died and have crashed the mpt_sas driver due to resets,
but I'm wondering if that is related to the bug in illumos which crashes
the mpt_sas driver (illumos bugs 4403, 4682 & 4819)

https://www.illumos.org/issues/4403

https://www.illumos.org/issues/4682

https://www.illumos.org/issues/4819

If LSI interposers are a no go, has anyone got these (or other) SATA
SSD's running on supermicro SAS2 expanders and getting a reliable
platform, specifically when a SSD dies or performance is at max?

A few years ago we were burned by putting Hitachi 7200rpm SATA disks on
an expander, this was before most of the posts about 'sata on sas DONT!'
posts came out. That was 2009/10 then, so things could have changed?

Also, there were some other posts suggesting that the WWN for SSD's with
LSI interposers were not being passed through, but it was suggested that
this was an issue with the SSD and not the interposer.

Thanks in advance.


Luke Iggleden
Keith Wesolowski
2014-05-06 16:37:33 UTC
Permalink
On Tue, May 06, 2014 at 05:45:16PM +1000, Luke Iggleden wrote:

> We're looking at deploying an all SSD pool with the following hardware:
>
> Dual Node
>
> Supermicro SSG-2027B-DE2R24L
> (includes LSI 2308 Controller)
> 128GB RAM per node
> 24 x Seagate PRO 600 480GB SSD
>
> 24 x LSI interposers (sata > sas) ?? (maybe, see post)
> RSF-1 High Availability Suite to failover between nodes
> Open Indiana or Omni OS
>
> My question really relates to the issues with SATA on SAS expanders and
> ZFS and are modern LSI interposers with this combo working ok now with
> the mpt_sas driver?

You're nuts. The only thing worse than SATA behind SAS expanders is
SATA with interposers behind SAS expanders. This configuration barely
worked for us at Fishworks with spinning disks; we frequently had
problems getting it to work with SSDs at all (and that was with a
helpful, cooperative, motivated vendor in STEC; we used ZeusIOPS).

If you want to use SAS expanders, you need real SAS end devices. An
alternative is to do SATA (no interposers) with direct-attach. We've
had reasonable success with that configuration at Joyent using Intel
DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
haven't used the Seagate "PRO" model you're considering; the only
Seagate device I've evaluated was the Pulsar.2, which worked. I never
recommend SATA, but if SAS just isn't an option, this is the way to go.

> A few years ago we were burned by putting Hitachi 7200rpm SATA disks on
> an expander, this was before most of the posts about 'sata on sas DONT!'
> posts came out. That was 2009/10 then, so things could have changed?

Nope.

> Also, there were some other posts suggesting that the WWN for SSD's with
> LSI interposers were not being passed through, but it was suggested that
> this was an issue with the SSD and not the interposer.

It could be either; there's a specification (SAT) that covers how mode
pages and VPD are to be translated. If the end device doesn't provide
the necessary data to translate, you won't get the right thing. If the
interposer has a buggy or incomplete SATL, you won't get the right
thing. ISTR having SAS addresses bound to the interposer rather than
the end device itself, but I could be misremembering; our focus was on
getting to an all-SAS solution, not dealing with the madness of
interposers.
Schweiss, Chip
2014-05-06 20:42:39 UTC
Permalink
On Tue, May 6, 2014 at 11:37 AM, Keith Wesolowski <***@lists.illumos.org>wrote:

> On Tue, May 06, 2014 at 05:45:16PM +1000, Luke Iggleden wrote:
>
> You're nuts. The only thing worse than SATA behind SAS expanders is
> SATA with interposers behind SAS expanders. This configuration barely
> worked for us at Fishworks with spinning disks; we frequently had
> problems getting it to work with SSDs at all (and that was with a
> helpful, cooperative, motivated vendor in STEC; we used ZeusIOPS).
>
> If you want to use SAS expanders, you need real SAS end devices. An
> alternative is to do SATA (no interposers) with direct-attach. We've
> had reasonable success with that configuration at Joyent using Intel
> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
> haven't used the Seagate "PRO" model you're considering; the only
> Seagate device I've evaluated was the Pulsar.2, which worked. I never
> recommend SATA, but if SAS just isn't an option, this is the way to go.
>
>
If you think your going to get a highly available system with SATA SSD on
Illumos you may be nuts, but I would argue that SATA SSDs have their
place. Considering you can get some very good performance at 1/4 the
price of a SAS SSD, the effort has its merits.

The best use case I have found thus far is for high speed scratch space.
If there is a storage failure it only means restarting some batch
processing, never will it cause data loss.

With Illumos kernels there seems to be lots of problems with SATA
handling. Even when direct connected, without a SAS expander had the
mpt_sas driver become completely wedged once. I have heard that Oracle has
since fixed this. On other platforms SATA behind a SAS expander is
considered perfectly stable. It more a matter of finding all the problem
cases and gracefully dealing with them.

I've moved nearly 1PB on and off my pool of Samsung 840 Pro SSDs without a
hiccup. I do expect to have hiccups, but the hiccups are much less costly
than SAS SSD.

One approach I am working on is watching system logs at real time and doing
a zpool offline at the first hint that a SATA SSD is having an issue.
This will put the pool in a degraded state but stop ZFS from trying to read
or write from the troubled SSD. This approach works very well on spinning
SAS disk that start having a sector problem. Once I see my next SSD or
interposer problem I will know how well it works. This should already be
better handled in the fault manager, but is seems to take a complete
failure of a disk to give up on it.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling via illumos-zfs
2014-05-06 23:59:38 UTC
Permalink
On May 6, 2014, at 1:42 PM, Schweiss, Chip <***@lists.illumos.org> wrote:

> On Tue, May 6, 2014 at 11:37 AM, Keith Wesolowski <***@lists.illumos.org> wrote:
> On Tue, May 06, 2014 at 05:45:16PM +1000, Luke Iggleden wrote:
>
> You're nuts. The only thing worse than SATA behind SAS expanders is
> SATA with interposers behind SAS expanders. This configuration barely
> worked for us at Fishworks with spinning disks; we frequently had
> problems getting it to work with SSDs at all (and that was with a
> helpful, cooperative, motivated vendor in STEC; we used ZeusIOPS).
>
> If you want to use SAS expanders, you need real SAS end devices. An
> alternative is to do SATA (no interposers) with direct-attach. We've
> had reasonable success with that configuration at Joyent using Intel
> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
> haven't used the Seagate "PRO" model you're considering; the only
> Seagate device I've evaluated was the Pulsar.2, which worked. I never
> recommend SATA, but if SAS just isn't an option, this is the way to go.
>
>
> If you think your going to get a highly available system with SATA SSD on Illumos you may be nuts, but I would argue that SATA SSDs have their place. Considering you can get some very good performance at 1/4 the price of a SAS SSD, the effort has its merits.
>
> The best use case I have found thus far is for high speed scratch space. If there is a storage failure it only means restarting some batch processing, never will it cause data loss.
>
> With Illumos kernels there seems to be lots of problems with SATA handling. Even when direct connected, without a SAS expander had the mpt_sas driver become completely wedged once. I have heard that Oracle has since fixed this. On other platforms SATA behind a SAS expander is considered perfectly stable.

Unfortunately, many of the issues we've seen in SATA+SAS+expander cases are due to bugs
in the SATA disks, expander firmware, and HBA firmware -- nothing to do with the OS. Do you
know that it is possible for a disk to fail, take out all of the expanders and prevent the servers
from passing POST? Been there. Seen that. Got the scar and a T-shirt.

The horror stories you often hear from Solaris/illumos folks are because we've been doing these
things for 10+ years and have been scarred many times. Claiming it is all due to an OS or
OS-level drivers is simply not true. Similarly, fixing many of these problems is not possible in an OS.
These can be complicated systems, so proper systems engineering and supplier management
is crucial to delivering reliable services.

What we can say is that more modern systems tend to be better than the older systems.
For example, the 6G SAS parts are a helluva lot better than anything in the 3G genre.
We're cautiously optimistic about the 12G parts, but they bring other constraints that are
not always well managed in the supply chain. To wit, the faster you go, the more susceptible
you are to cabling issues, including noise. The basic SAS/SATA protocols won't help
because they were designed in world where "interconnects are reliable." So when you order
cables, do you ask for the signal integrity specs? If not, why not?

For a good time, google "lsi expander firmware release notes" look at some of the bugs fixed
and be glad you're running ZFS! :-)
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-05-07 00:08:21 UTC
Permalink
----- Original Message -----
From: "Richard Elling via illumos-zfs" <***@lists.illumos.org>

> For a good time, google "lsi expander firmware release notes" look at some
> of the bugs fixed and be glad you're running ZFS! :-)

I think you may be putting too much stock in correctness of SAS drives / SAS
controller firmware and drivers, I've been there and debugged some horrific
bugs in SAS which weren't present in the cheeper SATA options.

So while there are features of enterprises SAS drivers which are arguably
better than their consumer SATA counterparts its not all roses, so its a
matter of picking which one works for you at the price you can afford.

Regards
Steve
Richard Elling via illumos-zfs
2014-05-07 00:17:39 UTC
Permalink
On May 6, 2014, at 5:08 PM, Steven Hartland <***@multiplay.co.uk> wrote:

> ----- Original Message ----- From: "Richard Elling via illumos-zfs" <***@lists.illumos.org>
>
>> For a good time, google "lsi expander firmware release notes" look at some
>> of the bugs fixed and be glad you're running ZFS! :-)
>
> I think you may be putting too much stock in correctness of SAS drives / SAS
> controller firmware and drivers, I've been there and debugged some horrific
> bugs in SAS which weren't present in the cheeper SATA options.

I don't disagree.

> So while there are features of enterprises SAS drivers which are arguably
> better than their consumer SATA counterparts its not all roses, so its a
> matter of picking which one works for you at the price you can afford.

For the unfortunates who must mix the two (SAS + SATA) they get twice the
complexity plus the inherent mismatch of the two protocols. Use SAS when you
need it. Use SATA if you can. Avoid mixing the two and live a happier life.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Luke Iggleden via illumos-zfs
2014-05-07 02:07:41 UTC
Permalink
On 7/05/2014 2:37 am, Keith Wesolowski wrote:
> If you want to use SAS expanders, you need real SAS end devices. An
> alternative is to do SATA (no interposers) with direct-attach. We've
> had reasonable success with that configuration at Joyent using Intel
> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
> haven't used the Seagate "PRO" model you're considering; the only
> Seagate device I've evaluated was the Pulsar.2, which worked. I never
> recommend SATA, but if SAS just isn't an option, this is the way to go.

Do you remember what storage bays you used at Joyent?

Using something like the intel 24 port jbod as Chip suggested, means we
have to use 6 x External SAS cables connected to a single host. Not
ideal, but I suppose we could make that work if we could get a 'yes,
illumos and sata directly connected is fine' Seems that isn't the case
either with others noting that a disk can bring down the whole zpool.

If we use sata direct connect with an external storage bay, then we lose
the ability to provide a fail over mechanism if we need to upgrade oi or
if it crashes? I don't like 3am runs to the DC any more and I don't
really want to be thinking about what ifs before I go to sleep at night ;)

Seems everywhere you turn, there is a gotchya with this. I'd love to be
able to go straight to some SAS SSD's, but the reality is the cost per
GB is Double and the performance of the flash does not scale with the
Dollar.
Ian Collins via illumos-zfs
2014-05-07 02:15:07 UTC
Permalink
Luke Iggleden via illumos-zfs wrote:
> Seems everywhere you turn, there is a gotchya with this. I'd love to be
> able to go straight to some SAS SSD's, but the reality is the cost per
> GB is Double and the performance of the flash does not scale with the
> Dollar.

It appears to scale inversely! If you exclude specialised units, the
best performing "Enterprise" SSDs are SATA.

I've often wondered why there is such a huge difference in price between
SATA and SAS SSDs while the gap is almost noise for spinning rust.

--
Ian.
Richard Elling via illumos-zfs
2014-05-07 17:51:53 UTC
Permalink
On May 6, 2014, at 7:15 PM, Ian Collins via illumos-zfs <***@lists.illumos.org> wrote:

> Luke Iggleden via illumos-zfs wrote:
>> Seems everywhere you turn, there is a gotchya with this. I'd love to be
>> able to go straight to some SAS SSD's, but the reality is the cost per
>> GB is Double and the performance of the flash does not scale with the
>> Dollar.
>
> It appears to scale inversely! If you exclude specialised units, the best performing "Enterprise" SSDs are SATA.

I think you meant to say "exclude specialized, super-fast units (PCIe, NVMe), the best performing "Enterprise" SSDs
are SAS3." :-) Independent tests show SAS3 SSDs are about half the latency of SATA SSDs.
http://www.storagereview.com/hgst_ultrastar_ssd800mm_sas3_enterprise_ssd_review
http://www.storagereview.com/micron_m500dc_enterprise_ssd_review

>
> I've often wondered why there is such a huge difference in price between SATA and SAS SSDs while the gap is almost noise for spinning rust.

Priced as the market will bear. But there is also market segmentation. For example, the products developed
by Intel/Micron are SATA while the SAS equivalent (similar?) products are HGST/Intel. Same flash, different
controller, different market, different performance, different price.

The flash SSD vendors are becoming more consistent in segmenting product lines by endurance. However,
this is not something you can measure easily (at all?), and is difficult to track in your production systems.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins via illumos-zfs
2014-05-08 00:05:23 UTC
Permalink
Richard Elling via illumos-zfs wrote:
> On May 6, 2014, at 7:15 PM, Ian Collins via illumos-zfs
> <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>
>> Luke Iggleden via illumos-zfs wrote:
>>> Seems everywhere you turn, there is a gotchya with this. I'd love to be
>>> able to go straight to some SAS SSD's, but the reality is the cost per
>>> GB is Double and the performance of the flash does not scale with the
>>> Dollar.
>>
>> It appears to scale inversely! If you exclude specialised units, the
>> best performing "Enterprise" SSDs are SATA.
>
> I think you meant to say "exclude specialized, super-fast units (PCIe,
> NVMe), the best performing "Enterprise" SSDs
> are SAS3." :-) Independent tests show SAS3 SSDs are about half the
> latency of SATA SSDs.
> http://www.storagereview.com/hgst_ultrastar_ssd800mm_sas3_enterprise_ssd_review
> http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
>

Interesting, I guess because I currently only use SSDs for log devices
I've been focusing on write IOPs performance. There is an interesting
comparison of the Intel 3700 with SAS devices towards the end here:

http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review

although I guess there are newer and better SAS devices on the market now.

--
Ian.
Luke Iggleden
2014-05-07 23:35:57 UTC
Permalink
If you are running an all SSD pool would you bother about a ZIL? Most of
the blogs / info out there is related to hybrid pools with rust.

At first I thought it probably wouldn't be needed, but if we're running
sync=always on our datasets, we could look at using a ZIL to take the
writes off the vdevs and increase the lifespan of them?

Our individual drives are capable of delivering 36k write iops (4k
random) and 90k read.

The work load is primarily read based, virtual machine block storage,
some SQL transactions for Mysql/Postgres (e-commerce) and a massive
(2tb) MS-MSQL DB that is mainly read hungry.

My concern is a ZIL will slow down the writes with sync on the datasets.
What would you guys roll with or what have you tried (hopefully) and had
success with - in an all SSD pool?
Garrett D'Amore via illumos-zfs
2014-05-08 02:51:29 UTC
Permalink
I can see an argument for an SLOG even with an all SSD pool *if* you had a configuration where the SLOG was something along the lines of a ZeusRAM or a DDRdrive or FusionIO.  In this case the decreased latencies are still going to be a net win, but this is a very high performance and very spendy configuration, I think.

For “normal users”, if you’re already on SSD, there is no benefit to moving your ZIL to an SLOG.

(As an exercise, it would be interesting to see just how much faster a PCIe RAM based flash device would have to be to make such a configuration worthwhile, given “typical” SSD performance numbers.  This is a cache analysis problem, I think. :-)

-- 
Garrett D'Amore
Sent with Airmail

On May 7, 2014 at 7:34:22 PM, Luke Iggleden (***@sisgroup.com.au) wrote:

If you are running an all SSD pool would you bother about a ZIL? Most of
the blogs / info out there is related to hybrid pools with rust.

At first I thought it probably wouldn't be needed, but if we're running
sync=always on our datasets, we could look at using a ZIL to take the
writes off the vdevs and increase the lifespan of them?

Our individual drives are capable of delivering 36k write iops (4k
random) and 90k read.

The work load is primarily read based, virtual machine block storage,
some SQL transactions for Mysql/Postgres (e-commerce) and a massive
(2tb) MS-MSQL DB that is mainly read hungry.

My concern is a ZIL will slow down the writes with sync on the datasets.
What would you guys roll with or what have you tried (hopefully) and had
success with - in an all SSD pool?




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 11:06:44 UTC
Permalink
On Wed, May 7, 2014 at 9:51 PM, Garrett D'Amore via illumos-zfs <
***@lists.illumos.org> wrote:

> I can see an argument for an SLOG even with an all SSD pool *if* you had a
> configuration where the SLOG was something along the lines of a ZeusRAM or
> a DDRdrive or FusionIO. In this case the decreased latencies are still
> going to be a net win, but this is a very high performance and very spendy
> configuration, I think.
>
> For “normal users”, if you’re already on SSD, there is no benefit to
> moving your ZIL to an SLOG.
>

I run a small SSD pool (10 400gb ZeusIOPs in 2 zvols raidz1) for a VMware
datastore, the pool kept having really bad latency issues. Turning off ZIL
I could read/write anything up to the 10Gb nic limit. I added a single
ZeusRAM and I now get the nearly the same performance as without ZIL.

I even made sure the cache was set to non-volatile in sd.conf. That helped
a little but it didn't take much I/O to get the latencies high. Setting
logbias to throughput made things worse. No tuning seemed to help much,
but the log device fixed the problem. Even adding the same ZeusIOPS SSD
as a log device significantly improved throughput and latencies.

-Chip



> (As an exercise, it would be interesting to see just how much faster a
> PCIe RAM based flash device would have to be to make such a configuration
> worthwhile, given “typical” SSD performance numbers. This is a cache
> analysis problem, I think. :-)
>
> --
> Garrett D'Amore
> Sent with Airmail
>
> On May 7, 2014 at 7:34:22 PM, Luke Iggleden (***@sisgroup.com.au) wrote:
>
> If you are running an all SSD pool would you bother about a ZIL? Most of
> the blogs / info out there is related to hybrid pools with rust.
>
> At first I thought it probably wouldn't be needed, but if we're running
> sync=always on our datasets, we could look at using a ZIL to take the
> writes off the vdevs and increase the lifespan of them?
>
> Our individual drives are capable of delivering 36k write iops (4k
> random) and 90k read.
>
> The work load is primarily read based, virtual machine block storage,
> some SQL transactions for Mysql/Postgres (e-commerce) and a massive
> (2tb) MS-MSQL DB that is mainly read hungry.
>
> My concern is a ZIL will slow down the writes with sync on the datasets.
> What would you guys roll with or what have you tried (hopefully) and had
> success with - in an all SSD pool?
>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
>
> *illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
> <https://www.listbox.com/member/archive/rss/182191/21878139-69539aca> |
> Modify<https://www.listbox.com/member/?&>Your Subscription
> <http://www.listbox.com>
>
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Sam Zaydel via illumos-zfs
2014-05-08 12:14:21 UTC
Permalink
I have done some of this testing at RackTop, though surely not enough and
without having enough data collected for a scientific side-by-side
comparison, but what I saw agrees completely with what Chip is saying. I
saw similar gains in performance, but I think the key here is that workload
was fully utilizing the slog. I believe this all suggests case-based
specificity. If your consumers are something akin to VMware datastores,
chances are, you will see improvements Chip and I observed with dedicated
low-latency SLOG.

Sam.


On Thu, May 8, 2014 at 4:06 AM, Schweiss, Chip via illumos-zfs <
***@lists.illumos.org> wrote:

>
> On Wed, May 7, 2014 at 9:51 PM, Garrett D'Amore via illumos-zfs <
> ***@lists.illumos.org> wrote:
>
>> I can see an argument for an SLOG even with an all SSD pool *if* you had
>> a configuration where the SLOG was something along the lines of a ZeusRAM
>> or a DDRdrive or FusionIO. In this case the decreased latencies are still
>> going to be a net win, but this is a very high performance and very spendy
>> configuration, I think.
>>
>> For “normal users”, if you’re already on SSD, there is no benefit to
>> moving your ZIL to an SLOG.
>>
>
> I run a small SSD pool (10 400gb ZeusIOPs in 2 zvols raidz1) for a VMware
> datastore, the pool kept having really bad latency issues. Turning off ZIL
> I could read/write anything up to the 10Gb nic limit. I added a single
> ZeusRAM and I now get the nearly the same performance as without ZIL.
>
> I even made sure the cache was set to non-volatile in sd.conf. That
> helped a little but it didn't take much I/O to get the latencies high.
> Setting logbias to throughput made things worse. No tuning seemed to help
> much, but the log device fixed the problem. Even adding the same ZeusIOPS
> SSD as a log device significantly improved throughput and latencies.
>
> -Chip
>
>
>
>> (As an exercise, it would be interesting to see just how much faster a
>> PCIe RAM based flash device would have to be to make such a configuration
>> worthwhile, given “typical” SSD performance numbers. This is a cache
>> analysis problem, I think. :-)
>>
>> --
>> Garrett D'Amore
>> Sent with Airmail
>>
>> On May 7, 2014 at 7:34:22 PM, Luke Iggleden (***@sisgroup.com.au) wrote:
>>
>> If you are running an all SSD pool would you bother about a ZIL? Most of
>> the blogs / info out there is related to hybrid pools with rust.
>>
>> At first I thought it probably wouldn't be needed, but if we're running
>> sync=always on our datasets, we could look at using a ZIL to take the
>> writes off the vdevs and increase the lifespan of them?
>>
>> Our individual drives are capable of delivering 36k write iops (4k
>> random) and 90k read.
>>
>> The work load is primarily read based, virtual machine block storage,
>> some SQL transactions for Mysql/Postgres (e-commerce) and a massive
>> (2tb) MS-MSQL DB that is mainly read hungry.
>>
>> My concern is a ZIL will slow down the writes with sync on the datasets.
>> What would you guys roll with or what have you tried (hopefully) and had
>> success with - in an all SSD pool?
>>
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed:
>> https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>>
>> *illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
>> <https://www.listbox.com/member/archive/rss/182191/21878139-69539aca> |
>> Modify <https://www.listbox.com/member/?&> Your Subscription
>> <http://www.listbox.com>
>>
>>
> *illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
> <https://www.listbox.com/member/archive/rss/182191/24342081-7731472e> |
> Modify<https://www.listbox.com/member/?&>Your Subscription
> <http://www.listbox.com>
>



--
Join the geek side, we have π!

Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 14:40:48 UTC
Permalink
This is an interesting, and somewhat (at least thinking naively about it) surprising result.  One would not expect a separate SLOG to have much impact on performance.  And indeed, I’d have guessed that a separate SLOG that performs no better than primary pool vdevs would hurt performance.

The question I’d ask myself is if adding an SLOG brought a benefit that was more akin to adding another write-dedicated spindle rather than just ordinary latency reduction benefits most typically associated with the SLOG.  Also, one starts to wonder if this gets to be a situation where using a tiny amount of the drive for SSD (say 10GB as Keith suggests) means that you are effectively getting much better performance because this massively short-stroked device never has to wait for garbage collection. 

The other thing to consider is whether a single SLOG (or the SLOG configuration you are adding) can keep up with the sustained workload.  It might not be able to.  But if you have an all-up SSD pool, it starts to beg the question as to where the bottlenecks in that pool are.  (Again, see SSD garbage collection as one possible theorized culprit.  There may be others, such as pool configuration, contention for HBA resources, contention with reads, etc.  Perhaps there is even something coming about as a result of a ‘streaming’ workload vs a random workload.   One wouldn’t necessarily expect this to be a big a difference in SSDs, but if we can minimize write amplifications on non-over-provisioned drives, it can make a measurable difference I guess.)

-- 
Garrett D'Amore
Sent with Airmail

On May 8, 2014 at 5:15:24 AM, Sam Zaydel via illumos-zfs (***@lists.illumos.org) wrote:

I have done some of this testing at RackTop, though surely not enough and without having enough data collected for a scientific side-by-side comparison, but what I saw agrees completely with what Chip is saying. I saw similar gains in performance, but I think the key here is that workload was fully utilizing the slog. I believe this all suggests case-based specificity. If your consumers are something akin to VMware datastores, chances are, you will see improvements Chip and I observed with dedicated low-latency SLOG.

Sam.


On Thu, May 8, 2014 at 4:06 AM, Schweiss, Chip via illumos-zfs <***@lists.illumos.org> wrote:

On Wed, May 7, 2014 at 9:51 PM, Garrett D'Amore via illumos-zfs <***@lists.illumos.org> wrote:
I can see an argument for an SLOG even with an all SSD pool *if* you had a configuration where the SLOG was something along the lines of a ZeusRAM or a DDRdrive or FusionIO.  In this case the decreased latencies are still going to be a net win, but this is a very high performance and very spendy configuration, I think.

For “normal users”, if you’re already on SSD, there is no benefit to moving your ZIL to an SLOG.

I run a small SSD pool (10 400gb ZeusIOPs in 2 zvols raidz1) for a VMware datastore, the pool kept having really bad latency issues.  Turning off ZIL I could read/write anything up to the 10Gb nic limit.   I added a single ZeusRAM and I now get the nearly the same performance as without ZIL.

I even made sure the cache was set to non-volatile in sd.conf.  That helped a little but it didn't take much I/O to get the latencies high.  Setting logbias to throughput made things worse.  No tuning seemed to help much, but the log device fixed the problem.   Even adding the same ZeusIOPS SSD as a log device significantly improved throughput and latencies.

-Chip



(As an exercise, it would be interesting to see just how much faster a PCIe RAM based flash device would have to be to make such a configuration worthwhile, given “typical” SSD performance numbers.  This is a cache analysis problem, I think. :-)

-- 
Garrett D'Amore
Sent with Airmail

On May 7, 2014 at 7:34:22 PM, Luke Iggleden (***@sisgroup.com.au) wrote:
If you are running an all SSD pool would you bother about a ZIL? Most of
the blogs / info out there is related to hybrid pools with rust.

At first I thought it probably wouldn't be needed, but if we're running
sync=always on our datasets, we could look at using a ZIL to take the
writes off the vdevs and increase the lifespan of them?

Our individual drives are capable of delivering 36k write iops (4k
random) and 90k read.

The work load is primarily read based, virtual machine block storage,
some SQL transactions for Mysql/Postgres (e-commerce) and a massive
(2tb) MS-MSQL DB that is mainly read hungry.

My concern is a ZIL will slow down the writes with sync on the datasets.
What would you guys roll with or what have you tried (hopefully) and had
success with - in an all SSD pool?




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription


--
Join the geek side, we have π!

Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel
illumos-zfs | Archives | Modify Your Subscription


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-05-08 14:37:15 UTC
Permalink
I would ask if these results are before the ZIO queuing rework?

If so they may well be invalid now, so retesting would be required.

Regards
Steve

----- Original Message -----
From: "Garrett D'Amore via illumos-zfs" <***@lists.illumos.org>
To: "Schweiss, Chip" <***@innovates.com>; <***@lists.illumos.org>
Sent: Thursday, May 08, 2014 3:40 PM
Subject: Re: [zfs] all ssd pool


This is an interesting, and somewhat (at least thinking naively about it) surprising result. One would not expect a separate SLOG
to have much impact on performance. And indeed, I’d have guessed that a separate SLOG that performs no better than primary pool
vdevs would hurt performance.

The question I’d ask myself is if adding an SLOG brought a benefit that was more akin to adding another write-dedicated spindle
rather than just ordinary latency reduction benefits most typically associated with the SLOG. Also, one starts to wonder if this
gets to be a situation where using a tiny amount of the drive for SSD (say 10GB as Keith suggests) means that you are effectively
getting much better performance because this massively short-stroked device never has to wait for garbage collection.

The other thing to consider is whether a single SLOG (or the SLOG configuration you are adding) can keep up with the sustained
workload. It might not be able to. But if you have an all-up SSD pool, it starts to beg the question as to where the bottlenecks
in that pool are. (Again, see SSD garbage collection as one possible theorized culprit. There may be others, such as pool
configuration, contention for HBA resources, contention with reads, etc. Perhaps there is even something coming about as a result
of a ‘streaming’ workload vs a random workload. One wouldn’t necessarily expect this to be a big a difference in SSDs, but if we
can minimize write amplifications on non-over-provisioned drives, it can make a measurable difference I guess.)

--
Garrett D'Amore
Sent with Airmail

On May 8, 2014 at 5:15:24 AM, Sam Zaydel via illumos-zfs (***@lists.illumos.org) wrote:

I have done some of this testing at RackTop, though surely not enough and without having enough data collected for a scientific
side-by-side comparison, but what I saw agrees completely with what Chip is saying. I saw similar gains in performance, but I
think the key here is that workload was fully utilizing the slog. I believe this all suggests case-based specificity. If your
consumers are something akin to VMware datastores, chances are, you will see improvements Chip and I observed with dedicated
low-latency SLOG.

Sam.


On Thu, May 8, 2014 at 4:06 AM, Schweiss, Chip via illumos-zfs <***@lists.illumos.org> wrote:

On Wed, May 7, 2014 at 9:51 PM, Garrett D'Amore via illumos-zfs <***@lists.illumos.org> wrote:
I can see an argument for an SLOG even with an all SSD pool *if* you had a configuration where the SLOG was something along the
lines of a ZeusRAM or a DDRdrive or FusionIO. In this case the decreased latencies are still going to be a net win, but this is a
very high performance and very spendy configuration, I think.

For “normal users”, if you’re already on SSD, there is no benefit to moving your ZIL to an SLOG.

I run a small SSD pool (10 400gb ZeusIOPs in 2 zvols raidz1) for a VMware datastore, the pool kept having really bad latency
issues. Turning off ZIL I could read/write anything up to the 10Gb nic limit. I added a single ZeusRAM and I now get the nearly
the same performance as without ZIL.

I even made sure the cache was set to non-volatile in sd.conf. That helped a little but it didn't take much I/O to get the
latencies high. Setting logbias to throughput made things worse. No tuning seemed to help much, but the log device fixed the
problem. Even adding the same ZeusIOPS SSD as a log device significantly improved throughput and latencies.

-Chip



(As an exercise, it would be interesting to see just how much faster a PCIe RAM based flash device would have to be to make such a
configuration worthwhile, given “typical” SSD performance numbers. This is a cache analysis problem, I think. :-)

--
Garrett D'Amore
Sent with Airmail

On May 7, 2014 at 7:34:22 PM, Luke Iggleden (***@sisgroup.com.au) wrote:
If you are running an all SSD pool would you bother about a ZIL? Most of
the blogs / info out there is related to hybrid pools with rust.

At first I thought it probably wouldn't be needed, but if we're running
sync=always on our datasets, we could look at using a ZIL to take the
writes off the vdevs and increase the lifespan of them?

Our individual drives are capable of delivering 36k write iops (4k
random) and 90k read.

The work load is primarily read based, virtual machine block storage,
some SQL transactions for Mysql/Postgres (e-commerce) and a massive
(2tb) MS-MSQL DB that is mainly read hungry.

My concern is a ZIL will slow down the writes with sync on the datasets.
What would you guys roll with or what have you tried (hopefully) and had
success with - in an all SSD pool?




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription


--
Join the geek side, we have π!

Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel
illumos-zfs | Archives | Modify Your Subscription


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 14:51:42 UTC
Permalink
On Thu, May 8, 2014 at 9:37 AM, Steven Hartland <***@multiplay.co.uk>wrote:

> I would ask if these results are before the ZIO queuing rework?
>
> If so they may well be invalid now, so retesting would be required.
>
>
Negative. This has been a problem both before and after. I suspected the
same thing and re-tested without a log device after the ZIO queuing was
released.

I just recently replaced the spare ZeusIOPS SSD that was being used as a
log device with a ZeusRAM. Without a log device this pool gets latencies
spikes in the 3K ms range and throughput tanks as well. With a ZeusIOPs
log it got spikes around 20ms and with the ZeusRAM the worst I've seen is
3ms. These measurements are from the VMware side. Similar measurements
were observable from nfssvrtop.

On my experimental scratch SSD pool the same thing happens with sync
workload. It consists of 60 Samsung 840 Pro, 512GB SSDs. With ZIL
turned on and no log device it's performance tanks too. Since this is
purely a scratch pool, ZIL is turned off and it's performance is through
the roof with any workload.

-Chip



> Regards
> Steve
>
> ----- Original Message ----- From: "Garrett D'Amore via illumos-zfs" <
> ***@lists.illumos.org>
> To: "Schweiss, Chip" <***@innovates.com>; <***@lists.illumos.org>
> Sent: Thursday, May 08, 2014 3:40 PM
> Subject: Re: [zfs] all ssd pool
>
>
>
> This is an interesting, and somewhat (at least thinking naively about it)
> surprising result. One would not expect a separate SLOG to have much impact
> on performance. And indeed, I’d have guessed that a separate SLOG that
> performs no better than primary pool vdevs would hurt performance.
>
> The question I’d ask myself is if adding an SLOG brought a benefit that
> was more akin to adding another write-dedicated spindle rather than just
> ordinary latency reduction benefits most typically associated with the
> SLOG. Also, one starts to wonder if this gets to be a situation where using
> a tiny amount of the drive for SSD (say 10GB as Keith suggests) means that
> you are effectively getting much better performance because this massively
> short-stroked device never has to wait for garbage collection.
>
> The other thing to consider is whether a single SLOG (or the SLOG
> configuration you are adding) can keep up with the sustained workload. It
> might not be able to. But if you have an all-up SSD pool, it starts to beg
> the question as to where the bottlenecks in that pool are. (Again, see SSD
> garbage collection as one possible theorized culprit. There may be others,
> such as pool configuration, contention for HBA resources, contention with
> reads, etc. Perhaps there is even something coming about as a result of a
> ‘streaming’ workload vs a random workload. One wouldn’t necessarily expect
> this to be a big a difference in SSDs, but if we can minimize write
> amplifications on non-over-provisioned drives, it can make a measurable
> difference I guess.)
>
> --
> Garrett D'Amore
> Sent with Airmail
>
> On May 8, 2014 at 5:15:24 AM, Sam Zaydel via illumos-zfs (
> ***@lists.illumos.org) wrote:
>
> I have done some of this testing at RackTop, though surely not enough and
> without having enough data collected for a scientific side-by-side
> comparison, but what I saw agrees completely with what Chip is saying. I
> saw similar gains in performance, but I think the key here is that workload
> was fully utilizing the slog. I believe this all suggests case-based
> specificity. If your consumers are something akin to VMware datastores,
> chances are, you will see improvements Chip and I observed with dedicated
> low-latency SLOG.
>
> Sam.
>
>
> On Thu, May 8, 2014 at 4:06 AM, Schweiss, Chip via illumos-zfs <
> ***@lists.illumos.org> wrote:
>
> On Wed, May 7, 2014 at 9:51 PM, Garrett D'Amore via illumos-zfs <
> ***@lists.illumos.org> wrote:
> I can see an argument for an SLOG even with an all SSD pool *if* you had a
> configuration where the SLOG was something along the lines of a ZeusRAM or
> a DDRdrive or FusionIO. In this case the decreased latencies are still
> going to be a net win, but this is a very high performance and very spendy
> configuration, I think.
>
> For “normal users”, if you’re already on SSD, there is no benefit to
> moving your ZIL to an SLOG.
>
> I run a small SSD pool (10 400gb ZeusIOPs in 2 zvols raidz1) for a VMware
> datastore, the pool kept having really bad latency issues. Turning off ZIL
> I could read/write anything up to the 10Gb nic limit. I added a single
> ZeusRAM and I now get the nearly the same performance as without ZIL.
>
> I even made sure the cache was set to non-volatile in sd.conf. That helped
> a little but it didn't take much I/O to get the latencies high. Setting
> logbias to throughput made things worse. No tuning seemed to help much, but
> the log device fixed the problem. Even adding the same ZeusIOPS SSD as a
> log device significantly improved throughput and latencies.
>
> -Chip
>
>
>
> (As an exercise, it would be interesting to see just how much faster a
> PCIe RAM based flash device would have to be to make such a configuration
> worthwhile, given “typical” SSD performance numbers. This is a cache
> analysis problem, I think. :-)
>
> --
> Garrett D'Amore
> Sent with Airmail
>
> On May 7, 2014 at 7:34:22 PM, Luke Iggleden (***@sisgroup.com.au) wrote:
> If you are running an all SSD pool would you bother about a ZIL? Most of
> the blogs / info out there is related to hybrid pools with rust.
>
> At first I thought it probably wouldn't be needed, but if we're running
> sync=always on our datasets, we could look at using a ZIL to take the
> writes off the vdevs and increase the lifespan of them?
>
> Our individual drives are capable of delivering 36k write iops (4k
> random) and 90k read.
>
> The work load is primarily read based, virtual machine block storage,
> some SQL transactions for Mysql/Postgres (e-commerce) and a massive
> (2tb) MS-MSQL DB that is mainly read hungry.
>
> My concern is a ZIL will slow down the writes with sync on the datasets.
> What would you guys roll with or what have you tried (hopefully) and had
> success with - in an all SSD pool?
>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
> 22035932-85c5d227
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
> illumos-zfs | Archives | Modify Your Subscription
> illumos-zfs | Archives | Modify Your Subscription
>
>
>
> --
> Join the geek side, we have π!
>
> Please feel free to connect with me on LinkedIn.
> http://www.linkedin.com/in/samzaydel
> illumos-zfs | Archives | Modify Your Subscription
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
> 24401717-fdfe502b
> Modify Your Subscription: https://www.listbox.com/
> member/?&
>
> Powered by Listbox: http://www.listbox.com
>
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-05-08 15:08:45 UTC
Permalink
----- Original Message -----
From: "Schweiss, Chip" <***@innovates.com>
> On Thu, May 8, 2014 at 9:37 AM, Steven Hartland <***@multiplay.co.uk>wrote:
>
> > I would ask if these results are before the ZIO queuing rework?
> >
> > If so they may well be invalid now, so retesting would be required.
> >
> >
> Negative. This has been a problem both before and after. I suspected the
> same thing and re-tested without a log device after the ZIO queuing was
> released.
>
> I just recently replaced the spare ZeusIOPS SSD that was being used as a
> log device with a ZeusRAM. Without a log device this pool gets latencies
> spikes in the 3K ms range and throughput tanks as well. With a ZeusIOPs
> log it got spikes around 20ms and with the ZeusRAM the worst I've seen is
> 3ms. These measurements are from the VMware side. Similar measurements
> were observable from nfssvrtop.
>
> On my experimental scratch SSD pool the same thing happens with sync
> workload. It consists of 60 Samsung 840 Pro, 512GB SSDs. With ZIL
> turned on and no log device it's performance tanks too. Since this is
> purely a scratch pool, ZIL is turned off and it's performance is through
> the roof with any workload.

Interesting a few other questions spring to mind about 840 Pro setup:
1. Have you tried tuning sync_write_max_active and sync_write_min_active?
2. Whats the impact of having a dedicate 840 Pro as SLOG?
3. Does moving SLOG on to a different controller have any impact?
4. What sort of IOP's are you seeing and whats the breakdown of reads / writes?

Regards
Steve
Schweiss, Chip via illumos-zfs
2014-05-08 15:56:49 UTC
Permalink
On Thu, May 8, 2014 at 10:08 AM, Steven Hartland <***@multiplay.co.uk>wrote:

> ----- Original Message ----- From: "Schweiss, Chip" <***@innovates.com>
>
> Negative. This has been a problem both before and after. I suspected
>> the
>> same thing and re-tested without a log device after the ZIO queuing was
>> released.
>>
>> I just recently replaced the spare ZeusIOPS SSD that was being used as a
>> log device with a ZeusRAM. Without a log device this pool gets latencies
>> spikes in the 3K ms range and throughput tanks as well. With a ZeusIOPs
>> log it got spikes around 20ms and with the ZeusRAM the worst I've seen is
>> 3ms. These measurements are from the VMware side. Similar measurements
>> were observable from nfssvrtop.
>>
>> On my experimental scratch SSD pool the same thing happens with sync
>> workload. It consists of 60 Samsung 840 Pro, 512GB SSDs. With ZIL
>> turned on and no log device it's performance tanks too. Since this is
>> purely a scratch pool, ZIL is turned off and it's performance is through
>> the roof with any workload.
>>
>
> Interesting a few other questions spring to mind about 840 Pro setup:
> 1. Have you tried tuning sync_write_max_active and sync_write_min_active?
>

I have not.

2. Whats the impact of having a dedicate 840 Pro as SLOG?

I haven't tried that. Like I said this is a dedicated scratch pool, ZIL
was unimportant to me. I encounter this when I temporarily created a VM
datastore on it to migrate VMs to it temporarily. When performance tanked
shortly after starting to migrate, I simply shut off the ZIL on the
datastore.

3. Does moving SLOG on to a different controller have any impact?
>

Again couldn't tell you on this pool. On the ZeusIOPs pool the ZeusRAM is
on separate controllers on dedicated SAS paths.


> 4. What sort of IOP's are you seeing and whats the breakdown of reads /
> writes?
>

When we've hit it the hardest it was sustaining about 100K nfs ops as
measured by nfssvrtop. At this point our all the bottlenecks appeared to
be external. NFS latencies were fluctuating between 1 and 5 ms. It's a
pretty even workload of read/write at this point at about 7 Gb/s each
way. I've seen spikes both reading and writing hit 20Gb/s across the two
IP multipath'd 10Gb nics.

This system has only a single 4C 3.3Ghz E5-2643 CPU. It appears under
some workloads the CPU becomes the bottleneck servicing NFS.

IOzone run locally streams about 4.4GB/s.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling via illumos-zfs
2014-05-09 01:06:04 UTC
Permalink
Good work Chip! Thanks for sharing. Comments below...

On May 8, 2014, at 8:56 AM, Schweiss, Chip via illumos-zfs <***@lists.illumos.org> wrote:
>
> On Thu, May 8, 2014 at 10:08 AM, Steven Hartland <***@multiplay.co.uk> wrote:
> ----- Original Message ----- From: "Schweiss, Chip" <***@innovates.com>
>
> Negative. This has been a problem both before and after. I suspected the
> same thing and re-tested without a log device after the ZIO queuing was
> released.
>
> I just recently replaced the spare ZeusIOPS SSD that was being used as a
> log device with a ZeusRAM. Without a log device this pool gets latencies
> spikes in the 3K ms range and throughput tanks as well. With a ZeusIOPs
> log it got spikes around 20ms and with the ZeusRAM the worst I've seen is
> 3ms. These measurements are from the VMware side. Similar measurements
> were observable from nfssvrtop.
>
> On my experimental scratch SSD pool the same thing happens with sync
> workload. It consists of 60 Samsung 840 Pro, 512GB SSDs. With ZIL
> turned on and no log device it's performance tanks too. Since this is
> purely a scratch pool, ZIL is turned off and it's performance is through
> the roof with any workload.
>
> Interesting a few other questions spring to mind about 840 Pro setup:
> 1. Have you tried tuning sync_write_max_active and sync_write_min_active?
>
> I have not.

Many folks are in this boat: we know they will be interesting to test, but don't
have the spare time. In general, I expect that we'll need to tune sync_write_max_active
higher for SSD pools and lower for HDD pools.

>
> 2. Whats the impact of having a dedicate 840 Pro as SLOG?
>
> I haven't tried that. Like I said this is a dedicated scratch pool, ZIL was unimportant to me. I encounter this when I temporarily created a VM datastore on it to migrate VMs to it temporarily. When performance tanked shortly after starting to migrate, I simply shut off the ZIL on the datastore.

I have some 840 Pro pools handy, but not right now on the same systems as suitable
for measuring slog performance. My workplace sells ZeusRAMs, so most of the current
performance work centers around using them for slogs.

>
> 3. Does moving SLOG on to a different controller have any impact?
>
> Again couldn't tell you on this pool. On the ZeusIOPs pool the ZeusRAM is on separate controllers on dedicated SAS paths.

Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.

>
> 4. What sort of IOP's are you seeing and whats the breakdown of reads / writes?
>
> When we've hit it the hardest it was sustaining about 100K nfs ops as measured by nfssvrtop. At this point our all the bottlenecks appeared to be external. NFS latencies were fluctuating between 1 and 5 ms. It's a pretty even workload of read/write at this point at about 7 Gb/s each way. I've seen spikes both reading and writing hit 20Gb/s across the two IP multipath'd 10Gb nics.

This is very impressive. Well done!

>
> This system has only a single 4C 3.3Ghz E5-2643 CPU. It appears under some workloads the CPU becomes the bottleneck servicing NFS.

Check prefetching. It can consume a lot of CPU for random workloads such as VMs and databases.
If you're not getting any benefit from prefetching, then disabling can be a big win.

>
> IOzone run locally streams about 4.4GB/s.
>

nice.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
aurfalien via illumos-zfs
2014-05-09 02:39:48 UTC
Permalink
On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org> wrote:

> Good work Chip! Thanks for sharing. Comments below...
>
> On May 8, 2014, at 8:56 AM, Schweiss, Chip via illumos-zfs <***@lists.illumos.org> wrote:
>>
>> On Thu, May 8, 2014 at 10:08 AM, Steven Hartland <***@multiplay.co.uk> wrote:
>> ----- Original Message ----- From: "Schweiss, Chip" <***@innovates.com>
>>
>> Negative. This has been a problem both before and after. I suspected the
>> same thing and re-tested without a log device after the ZIO queuing was
>> released.
>>
>> I just recently replaced the spare ZeusIOPS SSD that was being used as a
>> log device with a ZeusRAM. Without a log device this pool gets latencies
>> spikes in the 3K ms range and throughput tanks as well. With a ZeusIOPs
>> log it got spikes around 20ms and with the ZeusRAM the worst I've seen is
>> 3ms. These measurements are from the VMware side. Similar measurements
>> were observable from nfssvrtop.
>>
>> On my experimental scratch SSD pool the same thing happens with sync
>> workload. It consists of 60 Samsung 840 Pro, 512GB SSDs. With ZIL
>> turned on and no log device it's performance tanks too. Since this is
>> purely a scratch pool, ZIL is turned off and it's performance is through
>> the roof with any workload.
>>
>> Interesting a few other questions spring to mind about 840 Pro setup:
>> 1. Have you tried tuning sync_write_max_active and sync_write_min_active?
>>
>> I have not.
>
> Many folks are in this boat: we know they will be interesting to test, but don't
> have the spare time. In general, I expect that we'll need to tune sync_write_max_active
> higher for SSD pools and lower for HDD pools.
>
>>
>> 2. Whats the impact of having a dedicate 840 Pro as SLOG?
>>
>> I haven't tried that. Like I said this is a dedicated scratch pool, ZIL was unimportant to me. I encounter this when I temporarily created a VM datastore on it to migrate VMs to it temporarily. When performance tanked shortly after starting to migrate, I simply shut off the ZIL on the datastore.
>
> I have some 840 Pro pools handy, but not right now on the same systems as suitable
> for measuring slog performance. My workplace sells ZeusRAMs, so most of the current
> performance work centers around using them for slogs.
>
>>
>> 3. Does moving SLOG on to a different controller have any impact?
>>
>> Again couldn't tell you on this pool. On the ZeusIOPs pool the ZeusRAM is on separate controllers on dedicated SAS paths.
>
> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.

Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?

I could be consciences and test but this is very interesting :)

I’ve 4 Intel DC3700s 100GB models that I can mess around with. Latency for sequential flows look to be 50micro for reads and 65micro for writes.

- aurf

"Janitorial Services”


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins via illumos-zfs
2014-05-09 07:48:13 UTC
Permalink
aurfalien via illumos-zfs wrote:
> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs
> <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>>
>> Important point here: a single ZeusRAM (or mirrored pair) will peak
>> out around 700-750 MB/sec
>> in my measurements (4k block size). To satisfy the needs of 10GbE
>> networks, you need to stripe
>> them as slogs. I have yet to see a single SAS/SATA SSD be able to
>> soak 1GB/sec.
>
> Can you actually see benefits from striping a SLOG across multiple
> SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t.
> Now if you ask me what nature, I could’t tell you other then it being
> a single threaded operation?
>

Yes. They're lust like any other device. You see a benefit from
striping devices in a pool, don't you?

--
Ian.
aurfalien via illumos-zfs
2014-05-09 15:30:00 UTC
Permalink
On May 9, 2014, at 12:48 AM, Ian Collins <***@ianshome.com> wrote:

> aurfalien via illumos-zfs wrote:
>> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>>>
>>> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
>>> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
>>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.
>>
>> Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?
>>
>
> Yes. They're lust like any other device. You see a benefit from striping devices in a pool, don't you?

Can you help interpret this post;

http://www.nexentastor.org/boards/5/topics/6179

- aurf

"Janitorial Services"
Matthew Ahrens via illumos-zfs
2014-05-09 16:32:57 UTC
Permalink
On Fri, May 9, 2014 at 8:30 AM, aurfalien via illumos-zfs <
***@lists.illumos.org> wrote:

> On May 9, 2014, at 12:48 AM, Ian Collins <***@ianshome.com> wrote:
>
> > aurfalien via illumos-zfs wrote:
> >> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <
> ***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
> >>>
> >>> Important point here: a single ZeusRAM (or mirrored pair) will peak
> out around 700-750 MB/sec
> >>> in my measurements (4k block size). To satisfy the needs of 10GbE
> networks, you need to stripe
> >>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak
> 1GB/sec.
> >>
> >> Can you actually see benefits from striping a SLOG across multiple
> SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now
> if you ask me what nature, I could’t tell you other then it being a single
> threaded operation?
> >>
> >
> > Yes. They're lust like any other device. You see a benefit from
> striping devices in a pool, don't you?
>
> Can you help interpret this post;
>
> http://www.nexentastor.org/boards/5/topics/6179
>

If you have one thread doing synchronous writes, adding more (striped) log
devices will not help, because you are bound by latency, not bandwidth or
iops.

If you have multiple threads doing synchronous writes, then as Richard
mentioned you could max the bandwidth of your log device. In this case,
adding more striped log devices will help.

The ZIL is a per-dataset (filesystem or zvol) structure, so concurrent
access to separate datasets proceeds concurrently (resulting in concurrent
writes to all the log devices). Each ZIL can also issue multiple
concurrent writes (although this algorithm is not perfect). So even with
just one dataset, by using multiple threads you could max out the bandwidth
on a single log device and then benefit from adding more striped log
devices.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling via illumos-zfs
2014-05-10 00:53:59 UTC
Permalink
On May 9, 2014, at 9:32 AM, Matthew Ahrens via illumos-zfs <***@lists.illumos.org> wrote:
> On Fri, May 9, 2014 at 8:30 AM, aurfalien via illumos-zfs <***@lists.illumos.org> wrote:
> On May 9, 2014, at 12:48 AM, Ian Collins <***@ianshome.com> wrote:
>
> > aurfalien via illumos-zfs wrote:
> >> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
> >>>
> >>> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
> >>> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
> >>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.
> >>
> >> Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?
> >>
> >
> > Yes. They're lust like any other device. You see a benefit from striping devices in a pool, don't you?
>
> Can you help interpret this post;
>
> http://www.nexentastor.org/boards/5/topics/6179
>
> If you have one thread doing synchronous writes, adding more (striped) log devices will not help, because you are bound by latency, not bandwidth or iops.
>
> If you have multiple threads doing synchronous writes, then as Richard mentioned you could max the bandwidth of your log device. In this case, adding more striped log devices will help.
>
> The ZIL is a per-dataset (filesystem or zvol) structure, so concurrent access to separate datasets proceeds concurrently (resulting in concurrent writes to all the log devices). Each ZIL can also issue multiple concurrent writes (although this algorithm is not perfect). So even with just one dataset, by using multiple threads you could max out the bandwidth on a single log device and then benefit from adding more striped log devices.

Yep. Examples of multithreaded workloads include:
+ in-kernel NFS servers
+ in-kernel SMB servers (not Samba, it is single-threaded)
+ in-kernel block services (COMSTAR)
+ many databases

When in doubt, test :-)

— richard

--

ZFS and performance consulting
http://www.RichardElling.com
Luke Iggleden via illumos-zfs
2014-05-11 01:59:38 UTC
Permalink
On 10/05/2014 10:53 am, Richard Elling via illumos-zfs wrote:
> On May 9, 2014, at 9:32 AM, Matthew Ahrens via illumos-zfs <***@lists.illumos.org> wrote:
>> On Fri, May 9, 2014 at 8:30 AM, aurfalien via illumos-zfs <***@lists.illumos.org> wrote:
>> On May 9, 2014, at 12:48 AM, Ian Collins <***@ianshome.com> wrote:
>>
>>> aurfalien via illumos-zfs wrote:
>>>> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>>>>>
>>>>> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
>>>>> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
>>>>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.
>>>>
>>>> Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?
>>>>
>>>
>>> Yes. They're lust like any other device. You see a benefit from striping devices in a pool, don't you?
>>


Has anyone tried using a FusionIO Card (iodrive) as a ZIL yet? I've
spoken to their sales team and they suggest they work with open solaris
2009.06.

Are they custom drivers then which probably haven't been updated since
Illumos forked?

Looking at the Iodrive2 duo 1.2TB (2 x 600GB drives) with a stripe for a
ZIL/SLOG.

Got a demo unit coming in end of this month to test it out. Priced
around the same as a pair of ZeusRAM devices, better performance and
latency though. (on paper)
Luke Iggleden via illumos-zfs
2014-05-11 01:29:25 UTC
Permalink
On 10/05/2014 10:53 am, Richard Elling via illumos-zfs wrote:
> On May 9, 2014, at 9:32 AM, Matthew Ahrens via illumos-zfs <***@lists.illumos.org> wrote:
>> On Fri, May 9, 2014 at 8:30 AM, aurfalien via illumos-zfs <***@lists.illumos.org> wrote:
>> On May 9, 2014, at 12:48 AM, Ian Collins <***@ianshome.com> wrote:
>>
>>> aurfalien via illumos-zfs wrote:
>>>> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>>>>>
>>>>> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
>>>>> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
>>>>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.
>>>>
>>>> Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?
>>>>
>>>
>>> Yes. They're lust like any other device. You see a benefit from striping devices in a pool, don't you?
>>


Has anyone tried using a FusionIO Card (iodrive) as a ZIL yet? I've
spoken to their sales team and they suggest they work with open solaris
2009.06.

Are they custom drivers then which probably haven't been updated since
Illumos forked?

Looking at the Iodrive2 duo 1.2TB (2 x 600GB drives) with a stripe for a
ZIL/SLOG.

Got a demo unit coming in end of this month to test it out. Priced
around the same as a pair of ZeusRAM devices, better performance and
latency though. (on paper)
Kirill Davydychev via illumos-zfs
2014-05-10 23:49:53 UTC
Permalink
I may be late to this discussion, but below are rules of thumb that I came up with in the last couple of years of working with various kinds of all-SSD systems:

1. If you’re using it for temporary/scratch/unimportant data, use sync=disabled. You don’t need the overhead of ZIL when ultimately your data is worthless (think VDI stateless desktops, or any other workload that survives a few minutes (worst case) of lost writes with absolutely no financial impact to your organization, except for sysadmin time to bring the systems back online.
2. ZFS free space fragmentation matters. Delphix did a lot of great work on lowering the impact of fragmented free space by optimizing the metaslab allocator, but ultimately, if your workload is highly random and write-intensive, you’re guaranteed to hit a wall at some point as your pool fills and free space gets more fragmented with regular data churn. The more random the workload, the sooner you’re gonna hit this. The reason you may want a dedicated ZIL device on an all-SSD system is because without it, ***unless you run sync=disabled***, your sync writes will go to the main pool drives uncoalesced, will fill random chunks of space everywhere, and will be very hard to rebalance/free if you get to a point of zfs fragmentation. If you have a dedicated ZIL/slog/whateveryouwannacallit, you’re basically eliminating the fragmentation induced by small-block sync writes (<=32k), and if you’ve been adventurous enough to raise your *_immediate_write_sz, the impact may be even worse.
3. We still don’t have block pointer rewrite :)
4. Flash rewrite cycles - if you can, even if you don’t use a DRAM-based ZIL/slog/whatever, after you run your system in prod for a couple years, would you rather replace all of your flash that’s worn out, or just some of the flash that’s worn out? Having a slog to absorb the flash wearout might be cheaper in the long term. Do the math.

Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.

On May 9, 2014, at 12:48 AM, Ian Collins via illumos-zfs <***@lists.illumos.org> wrote:

> aurfalien via illumos-zfs wrote:
>> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>>>
>>> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
>>> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
>>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.
>>
>> Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?
>>
>
> Yes. They're lust like any other device. You see a benefit from striping devices in a pool, don't you?
>
> --
> Ian.
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22099383-fefe14de
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
Matthew Ahrens via illumos-zfs
2014-05-12 22:06:01 UTC
Permalink
All good advice, kirill. Thanks for sharing.

--matt

> On May 10, 2014, at 4:49 PM, "Kirill Davydychev via illumos-zfs" <***@lists.illumos.org> wrote:
>
> I may be late to this discussion, but below are rules of thumb that I came up with in the last couple of years of working with various kinds of all-SSD systems:
>
> 1. If you’re using it for temporary/scratch/unimportant data, use sync=disabled. You don’t need the overhead of ZIL when ultimately your data is worthless (think VDI stateless desktops, or any other workload that survives a few minutes (worst case) of lost writes with absolutely no financial impact to your organization, except for sysadmin time to bring the systems back online.
> 2. ZFS free space fragmentation matters. Delphix did a lot of great work on lowering the impact of fragmented free space by optimizing the metaslab allocator, but ultimately, if your workload is highly random and write-intensive, you’re guaranteed to hit a wall at some point as your pool fills and free space gets more fragmented with regular data churn. The more random the workload, the sooner you’re gonna hit this. The reason you may want a dedicated ZIL device on an all-SSD system is because without it, ***unless you run sync=disabled***, your sync writes will go to the main pool drives uncoalesced, will fill random chunks of space everywhere, and will be very hard to rebalance/free if you get to a point of zfs fragmentation. If you have a dedicated ZIL/slog/whateveryouwannacallit, you’re basically eliminating the fragmentation induced by small-block sync writes (<=32k), and if you’ve been adventurous enough to raise your *_immediate_write_sz, the impact may be even worse.
> 3. We still don’t have block pointer rewrite :)
> 4. Flash rewrite cycles - if you can, even if you don’t use a DRAM-based ZIL/slog/whatever, after you run your system in prod for a couple years, would you rather replace all of your flash that’s worn out, or just some of the flash that’s worn out? Having a slog to absorb the flash wearout might be cheaper in the long term. Do the math.
>
> Best regards,
> Kirill Davydychev
> Enterprise Architect
> Nexenta Systems, Inc.
>
>> On May 9, 2014, at 12:48 AM, Ian Collins via illumos-zfs <***@lists.illumos.org> wrote:
>>
>> aurfalien via illumos-zfs wrote:
>>>> On May 8, 2014, at 6:06 PM, Richard Elling via illumos-zfs <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>>>>
>>>> Important point here: a single ZeusRAM (or mirrored pair) will peak out around 700-750 MB/sec
>>>> in my measurements (4k block size). To satisfy the needs of 10GbE networks, you need to stripe
>>>> them as slogs. I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.
>>>
>>> Can you actually see benefits from striping a SLOG across multiple SSDs? I thought that due to the nature of SLOG/ZIL that one wouldn’t. Now if you ask me what nature, I could’t tell you other then it being a single threaded operation?
>>
>> Yes. They're lust like any other device. You see a benefit from striping devices in a pool, don't you?
>>
>> --
>> Ian.
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22099383-fefe14de
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-12 23:18:51 UTC
Permalink
> 4. Flash rewrite cycles - if you can, even if you don’t use a DRAM-based ZIL/slog/whatever, after you run your system in prod for a couple years, would you rather replace all of your flash that’s worn out, or just some of the flash that’s worn out? Having a slog to absorb the flash wearout might be cheaper in the long term. Do the math. 


To go one better, if you use a DRAM based SLOG (e.g. ZeusRAM, DDRdrive) in addition to the not insubstantial performance benefits, you *also* eliminate nearly *all* of that wear and tear from a random workload.  You still have to write to your backing disks, but the SLOG’s lifetime should be nearly infinite, since the only times you really need to write to the backing NAND are on a power loss event.

You should factor that into your cost analysis when you’re thinking about the high cost of the RAM based SSDs.  I’m not sure its enough savings from the wear reduction to balance the much higher cost of the RAM based SSDs, but it at least *helps*. :-)

 - Garrett





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-05-09 09:32:19 UTC
Permalink
----- Original Message -----
From: "Richard Elling via illumos-zfs" <***@lists.illumos.org>

>>
>> 3. Does moving SLOG on to a different controller have any impact?
>>
>> Again couldn't tell you on this pool. On the ZeusIOPs pool the
> ZeusRAM is on separate controllers on dedicated SAS paths.
>
> Important point here: a single ZeusRAM (or mirrored pair) will peak
> out around 700-750 MB/sec in my measurements (4k block size). To
> satisfy the needs of 10GbE networks, you need to stripe them as slogs.
> I have yet to see a single SAS/SATA SSD be able to soak 1GB/sec.

Be interesting to see what the new Samsung XS1715, which is a NVMe
disk, can do as thats speced at 4GB/s and 750k IOPs

Regards
Steve
Stefan Ring via illumos-zfs
2014-05-08 14:37:55 UTC
Permalink
On Thu, May 8, 2014 at 4:40 PM, Garrett D'Amore via illumos-zfs
<***@lists.illumos.org> wrote:
>
> This is an interesting, and somewhat (at least thinking naively about it) surprising result. One would not expect a separate SLOG to have much impact on performance. And indeed, I’d have guessed that a separate SLOG that performs no better than primary pool vdevs would hurt performance.
>
> The question I’d ask myself is if adding an SLOG brought a benefit that was more akin to adding another write-dedicated spindle rather than just ordinary latency reduction benefits most typically associated with the SLOG. Also, one starts to wonder if this gets to be a situation where using a tiny amount of the drive for SSD (say 10GB as Keith suggests) means that you are effectively getting much better performance because this massively short-stroked device never has to wait for garbage collection.
>
> The other thing to consider is whether a single SLOG (or the SLOG configuration you are adding) can keep up with the sustained workload. It might not be able to. But if you have an all-up SSD pool, it starts to beg the question as to where the bottlenecks in that pool are. (Again, see SSD garbage collection as one possible theorized culprit. There may be others, such as pool configuration, contention for HBA resources, contention with reads, etc. Perhaps there is even something coming about as a result of a ‘streaming’ workload vs a random workload. One wouldn’t necessarily expect this to be a big a difference in SSDs, but if we can minimize write amplifications on non-over-provisioned drives, it can make a measurable difference I guess.)

A Flash device is capable of handling a few tens or hundreds of
thousands of writes per second. A DRAM module could easily keep up
with tens or hundreds of millions of operations per second. Does this
not explain the difference?
Garrett D'Amore via illumos-zfs
2014-05-08 16:07:24 UTC
Permalink
A Flash device is capable of handling a few tens or hundreds of 
thousands of writes per second. A DRAM module could easily keep up 
with tens or hundreds of millions of operations per second. Does this 
not explain the difference? 
Not if the SLOG is the same flash technology as the main pool vdevs.  (Yes, I’d expect a ZeusRAM to beat the snot out of a traditional SSD.  But that wasn’t what was described — the report was that even with a pool of ZeusIOPs drives, adding another one as a SLOG eliminated latency spikes.  *That* is a surprising result when looked at naively.

- Garrett


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 16:12:49 UTC
Permalink
On Thu, May 8, 2014 at 11:07 AM, Garrett D'Amore <***@damore.org> wrote:

> Not if the SLOG is the same flash technology as the main pool vdevs.
> (Yes, I’d expect a ZeusRAM to beat the snot out of a traditional SSD. But
> that wasn’t what was described — the report was that even with a pool of
> ZeusIOPs drives, adding another one as a SLOG eliminated latency spikes.
> *That* is a surprising result when looked at naively.
>
> - Garrett
>
Exactly! The ZeusIOPS pool has 11 SSDs, 10 in the pool 1 as a spare.
When I first encounter the performance issue, I found turning off ZIL
completely eliminated the problems and installing the spare a log device
also solved the problem. I wasn't expecting any of this behavior when I
built the pool or a ZeusRAM probably would have been part of it initially.

ZFS really doesn't behave well on SSDs with out a separate log device for
sync workloads. It was quite the sinking feeling when I put close to $15K
or SSDs in service and got horrible performance. It was doing worse than
the 7.2K pool with 2TB of L2ARC and sTec s840z log that it was built to
replace as our production VM datastore.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 16:22:18 UTC
Permalink
Here's another oddity about ZeusRAM as ZIL, but probably related to the
same SSD pool log issue.

I have 3 ZeusRAM attached to a 240 spindle 7.2k pool. Pushing 10Gb async
writes is very easy to this pool, but pushing past 5Gb/s sync never
happens. If I storage vMotion from this pool to the ZeusIOPs pool with 1
ZeusRAM it will sustain 6Gb/s.

vMotion back to the disk pool maxes out at 4Gb/s. If I turn off ZIL it
goes to 10Gb in either direction.

-Chip


On Thu, May 8, 2014 at 11:12 AM, Schweiss, Chip <***@innovates.com> wrote:

>
>
>
> On Thu, May 8, 2014 at 11:07 AM, Garrett D'Amore <***@damore.org>wrote:
>
>> Not if the SLOG is the same flash technology as the main pool vdevs.
>> (Yes, I’d expect a ZeusRAM to beat the snot out of a traditional SSD. But
>> that wasn’t what was described — the report was that even with a pool of
>> ZeusIOPs drives, adding another one as a SLOG eliminated latency spikes.
>> *That* is a surprising result when looked at naively.
>>
>> - Garrett
>>
> Exactly! The ZeusIOPS pool has 11 SSDs, 10 in the pool 1 as a spare.
> When I first encounter the performance issue, I found turning off ZIL
> completely eliminated the problems and installing the spare a log device
> also solved the problem. I wasn't expecting any of this behavior when I
> built the pool or a ZeusRAM probably would have been part of it initially.
>
> ZFS really doesn't behave well on SSDs with out a separate log device for
> sync workloads. It was quite the sinking feeling when I put close to $15K
> or SSDs in service and got horrible performance. It was doing worse than
> the 7.2K pool with 2TB of L2ARC and sTec s840z log that it was built to
> replace as our production VM datastore.
>
> -Chip
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 16:45:53 UTC
Permalink
On May 8, 2014 at 9:22:40 AM, Schweiss, Chip (***@innovates.com) wrote:

Here's another oddity about ZeusRAM as ZIL, but probably related to the same SSD pool log issue.

I have 3 ZeusRAM attached to a 240 spindle 7.2k pool.  Pushing 10Gb async writes is very easy to this pool, but pushing past 5Gb/s sync never happens.   If I storage vMotion from this pool to the ZeusIOPs pool with 1 ZeusRAM it will sustain 6Gb/s.
Stop measuring async numbers.  They are utterly devoid of meaning.  (Unless you’re just wanting to measure your network pipes, in which case go right ahead
. but you’re not measuring storage subsystem performance at that point.)  The exception here would be if you’ve got a streaming workload of sufficient size.  But we’ve been talking about random workloads here.

These results are hardly suprising if your ZeusRAM drives are in a mirror.  With a write (even SLOG), the writes have to be completed on all drives.  The ZeusRAM has a 6Gb/s SAS connection.  The combination of SAS pipeline limitations and the extra delays associated with synchronization across three devices means that you’re probably getting the expected performance from this.

If you want to go faster, you need to stripe the drives (at the risk of data loss!), or use devices that have a faster bus connection (e.g. PCIe cards.)

Using DRAM based technology doesn’t magically give you a fatter pipe, and writes to mirrors are not free — a 20% perf. penalty for mirrored writes is not entirely surprising.)



vMotion back to the disk pool maxes out at 4Gb/s.   If I turn off ZIL it goes to 10Gb in either direction.


Turning off ZIL invalidates your numbers, unless you measure them over a sustained duration that is long enough to ensure that all buffers (your entire ARC!  i.e. nearly your entire system memory!!) are filled and you’re actually keeping the drives sustained.  I don’t know of any performance benchmarks that actually do this.  All you’re really doing is testing your system memory and network pipes.

You wouldn’t run in production with the zil disabled, so don’t measure that either.  Measure what you’re going to *run*.

 - Garrett





-Chip


On Thu, May 8, 2014 at 11:12 AM, Schweiss, Chip <***@innovates.com> wrote:



On Thu, May 8, 2014 at 11:07 AM, Garrett D'Amore <***@damore.org> wrote:
Not if the SLOG is the same flash technology as the main pool vdevs.  (Yes, I’d expect a ZeusRAM to beat the snot out of a traditional SSD.  But that wasn’t what was described — the report was that even with a pool of ZeusIOPs drives, adding another one as a SLOG eliminated latency spikes.  *That* is a surprising result when looked at naively.

- Garrett

Exactly!  The ZeusIOPS pool has 11 SSDs,  10 in the pool 1 as a spare.  When I first encounter the performance issue, I found turning off ZIL completely eliminated the problems and installing the spare a log device also solved the problem.   I wasn't expecting any of this behavior when I built the pool or a ZeusRAM probably would have been part of it initially.

ZFS really doesn't behave well on SSDs with out a separate log device for sync workloads.   It was quite the sinking feeling when I put close to $15K or SSDs in service and got horrible performance.  It was doing worse than the 7.2K pool with 2TB of L2ARC and sTec s840z log that it was built to replace as our production VM datastore.

-Chip





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 16:55:07 UTC
Permalink
On Thu, May 8, 2014 at 11:45 AM, Garrett D'Amore <***@damore.org> wrote:

> Stop measuring async numbers. They are utterly devoid of meaning.
> (Unless you’re just wanting to measure your network pipes, in which case
> go right ahead
. but you’re not measuring storage subsystem performance at
> that point.) The exception here would be if you’ve got a streaming
> workload of sufficient size. But we’ve been talking about random workloads
> here.
>

No I'm not necessarily talking about random workload, I'm focusing on sync
workloads. ALL VMware writes are sync even when doing vMotion. My point
was that a streaming write coming from a vMotion with all writes being sync
will stream significantly faster to an SSD pool with 1 ZeusRAM than a big
disk pool with 3 ZeusRAMs.

> These results are hardly suprising if your ZeusRAM drives are in a mirror.
> With a write (even SLOG), the writes have to be completed on all drives.
> The ZeusRAM has a 6Gb/s SAS connection. The combination of SAS pipeline
> limitations and the extra delays associated with synchronization across
> three devices means that you’re probably getting the expected performance
> from this.
>
Every one of my ZeusRAMs are multipathed across two controllers and don't
share any of the SAS paths on the SAS expander. They effectively have a
12Gb/s SAS path.

None of the ZeusRAM are mirrored. Again a minor risk in the case of power
failure or system crash. If at that moment the ZeusRAM fails I will have
to do my zpool import without the ZIL and lose the last transactions.
Nothing I can't recover from, but will take some additional work.


> If you want to go faster, you need to stripe the drives (at the risk of
> data loss!), or use devices that have a faster bus connection (e.g. PCIe
> cards.)
>

Yes the 3 ZeusRAM are striped. The steaming sync workload is out performed
by 1 ZeusRAM on the SSD pool.

'zpool iostat -v' clearly show much more total write load on the SSD pool's
ZeusRAM.

Both server heads are identical on the two pools.

Using DRAM based technology doesn’t magically give you a fatter pipe, and
> writes to mirrors are not free — a 20% perf. penalty for mirrored writes is
> not entirely surprising.)
>


Maybe not, but if ZFS got to the point of using NVDIMM based servers and
was able to keep all write caches on these NVDIMMs and recover as if it was
a log device, then the DRAM based technology would really give sync loads
some serious boosts. Probably wishful thinking.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 16:35:48 UTC
Permalink
-- 
Garrett D'Amore
Sent with Airmail

On May 8, 2014 at 9:13:11 AM, Schweiss, Chip (***@innovates.com) wrote:




On Thu, May 8, 2014 at 11:07 AM, Garrett D'Amore <***@damore.org> wrote:
Not if the SLOG is the same flash technology as the main pool vdevs.  (Yes, I’d expect a ZeusRAM to beat the snot out of a traditional SSD.  But that wasn’t what was described — the report was that even with a pool of ZeusIOPs drives, adding another one as a SLOG eliminated latency spikes.  *That* is a surprising result when looked at naively.

- Garrett

Exactly!  The ZeusIOPS pool has 11 SSDs,  10 in the pool 1 as a spare.  When I first encounter the performance issue, I found turning off ZIL completely eliminated the problems and installing the spare a log device also solved the problem.   I wasn't expecting any of this behavior when I built the pool or a ZeusRAM probably would have been part of it initially.


How did you “turn off ZIL completely”?  zfs set sync=disabled ?  That’s a terrible idea for any non-scratch workspace.  Basically, you totally make a sync workload async, which while great for performance, breaks the POSIX write consistency promises.  Its fine as long as you never have a server crash or loss of power on the server or a drive failure.  But once any of those things occurs, you probably wind up losing the data that you’ve “promised” is already committed to stable storage.



ZFS really doesn't behave well on SSDs with out a separate log device for sync workloads.   It was quite the sinking feeling when I put close to $15K or SSDs in service and got horrible performance.  It was doing worse than the 7.2K pool with 2TB of L2ARC and sTec s840z log that it was built to replace as our production VM datastore.
I suspect that this blanket statement is probably not completely correct.  I suspect that on certain workloads with certain drives, you’ll see some nasty latencies.  But these latencies are likely far less than you’d see with HDDs.  Throwing an SLOG device — especially something write-optimized like a ZeusRAM, totally changes the equation and hardly bears comparison.

- Garrett





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 16:37:31 UTC
Permalink
On Thu, May 8, 2014 at 11:35 AM, Garrett D'Amore <***@damore.org> wrote:

>
> How did you “turn off ZIL completely”? zfs set sync=disabled ? That’s a
> terrible idea for any non-scratch workspace. Basically, you totally make a
> sync workload async, which while great for performance, breaks the POSIX
> write consistency promises. Its fine as long as you never have a server
> crash or loss of power on the server or a drive failure. But once any of
> those things occurs, you probably wind up losing the data that you’ve
> “promised” is already committed to stable storage.
>
Yes, there is risk involved. However, everything on these pools is backed
up and snapshot regularly. It sits in a well power protected data center
so I could certainly get things synced in case of a power problem.
sync=standard was put in place after these vMotion tests.

Keep in mind we are a research facility even these ZFS servers are part of
our research and risking some downtime is expected. The only data of ours
that is ever a single copy is when is first acquired at an MRI, then
immediately sent to two distinct systems.

We choose to build ZFS systems with some risks to get the maximum storage
space and performance for our dollar.


>
>
> ZFS really doesn't behave well on SSDs with out a separate log device for
> sync workloads. It was quite the sinking feeling when I put close to $15K
> or SSDs in service and got horrible performance. It was doing worse than
> the 7.2K pool with 2TB of L2ARC and sTec s840z log that it was built to
> replace as our production VM datastore.
>
> I suspect that this blanket statement is probably not completely correct.
> I suspect that on certain workloads with certain drives, you’ll see some
> nasty latencies. But these latencies are likely far less than you’d see
> with HDDs. Throwing an SLOG device — especially something write-optimized
> like a ZeusRAM, totally changes the equation and hardly bears comparison.
>
> Fair enough. Sync workload plays a lot into this. But in every case I
seen thus far a separate log device increases the performance.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 17:01:48 UTC
Permalink
On May 8, 2014 at 9:37:54 AM, Schweiss, Chip (***@innovates.com) wrote:

On Thu, May 8, 2014 at 11:35 AM, Garrett D'Amore <***@damore.org> wrote:

How did you “turn off ZIL completely”?  zfs set sync=disabled ?  That’s a terrible idea for any non-scratch workspace.  Basically, you totally make a sync workload async, which while great for performance, breaks the POSIX write consistency promises.  Its fine as long as you never have a server crash or loss of power on the server or a drive failure.  But once any of those things occurs, you probably wind up losing the data that you’ve “promised” is already committed to stable storage.

Yes, there is risk involved.   However, everything on these pools is backed up and snapshot regularly.   It sits in a well power protected data center so I could certainly get things synced in case of a power problem.   sync=standard was put in place after these vMotion tests.   

Keep in mind we are a research facility even these ZFS servers are part of our research and risking some downtime is expected.   The only data of ours that is ever a single copy is when is first acquired at an MRI, then immediately sent to two distinct systems.

We choose to build ZFS systems with some risks to get the maximum storage space and performance for our dollar. 


Well in that case, if this is really “scratch” data, why bother at all with an SLOG or withe ZIL.  You’ll definitely get better perceived performance without it.   Again, its at a risk, but running a pool or group of datasets this way is perfectly reasonable if you can live with that risk. 




 


ZFS really doesn't behave well on SSDs with out a separate log device for sync workloads.   It was quite the sinking feeling when I put close to $15K or SSDs in service and got horrible performance.  It was doing worse than the 7.2K pool with 2TB of L2ARC and sTec s840z log that it was built to replace as our production VM datastore.
I suspect that this blanket statement is probably not completely correct.  I suspect that on certain workloads with certain drives, you’ll see some nasty latencies.  But these latencies are likely far less than you’d see with HDDs.  Throwing an SLOG device — especially something write-optimized like a ZeusRAM, totally changes the equation and hardly bears comparison.

Fair enough.  Sync workload plays a lot into this.   But in every case I seen thus far a separate log device increases the performance.


Also fair enough.  I guess I’ve never had enough cash on hand that I could afford to configure a high performance pool made up entirely of solid state storage (outside of the trivial pools made up of one or two SSDs in laptop and small server configs.)

I’d love to do some measurements with these larger configurations and see if we can isolate the bottlenecks.  I have some theories which I’ve already espoused here.  Sadly some of the tests I’d like to do, like building a pool entirely of ZeusRAMs, are probably not something I’m going to be able to do in the immediate future. :-)

- Garrett






-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip via illumos-zfs
2014-05-08 16:59:21 UTC
Permalink
On Thu, May 8, 2014 at 12:01 PM, Garrett D'Amore <***@damore.org> wrote:

> I’d love to do some measurements with these larger configurations and see
> if we can isolate the bottlenecks. I have some theories which I’ve already
> espoused here. Sadly some of the tests I’d like to do, like building a
> pool entirely of ZeusRAMs, are probably not something I’m going to be able
> to do in the immediate future. :-)
>
> We're about to come to a period where our consumer SSD pool will be idle
for a while. I can arrange some access for you to experiment. :-) I'm
sure you can come up with more than I can about optimizing SSD pools.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 19:05:40 UTC
Permalink
I may be super busy. But let me know when it looks like the schedule is coming up and if I can make time for the testing I will. I really want to understand the bottlenecks.

Sent from my iPhone

> On May 8, 2014, at 9:59 AM, "Schweiss, Chip" <***@innovates.com> wrote:
>
>
>
>
>> On Thu, May 8, 2014 at 12:01 PM, Garrett D'Amore <***@damore.org> wrote:
>> I’d love to do some measurements with these larger configurations and see if we can isolate the bottlenecks. I have some theories which I’ve already espoused here. Sadly some of the tests I’d like to do, like building a pool entirely of ZeusRAMs, are probably not something I’m going to be able to do in the immediate future. :-)
> We're about to come to a period where our consumer SSD pool will be idle for a while. I can arrange some access for you to experiment. :-) I'm sure you can come up with more than I can about optimizing SSD pools.
>
> -Chip
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Luke Iggleden via illumos-zfs
2014-05-08 22:23:14 UTC
Permalink
On 9/05/2014 2:59 am, Schweiss, Chip via illumos-zfs wrote:
>
>
>
> On Thu, May 8, 2014 at 12:01 PM, Garrett D'Amore <***@damore.org
> <mailto:***@damore.org>> wrote:
>
> I’d love to do some measurements with these larger configurations
> and see if we can isolate the bottlenecks. I have some theories
> which I’ve already espoused here. Sadly some of the tests I’d like
> to do, like building a pool entirely of ZeusRAMs, are probably not
> something I’m going to be able to do in the immediate future. :-)
>
> We're about to come to a period where our consumer SSD pool will be idle
> for a while. I can arrange some access for you to experiment. :-) I'm
> sure you can come up with more than I can about optimizing SSD pools.
>
> -Chip
>

I'm happy to commit our pre-production pool (24 x SSD's) to some
scientific time as well. Should have everything together running oi in a
couple of weeks.

After reading the posts about using a ZIL with the SSD's I think we will
look at getting a PCI-e based ZIL, don't want to lose a SAS port in the
front of the chassis and the iops on the SAS ZIL will not be quick enough.
Richard Elling via illumos-zfs
2014-05-09 00:50:55 UTC
Permalink
On May 8, 2014, at 9:07 AM, Garrett D'Amore via illumos-zfs <***@lists.illumos.org> wrote:

>>
>> A Flash device is capable of handling a few tens or hundreds of
>> thousands of writes per second. A DRAM module could easily keep up
>> with tens or hundreds of millions of operations per second. Does this
>> not explain the difference?
>
> Not if the SLOG is the same flash technology as the main pool vdevs. (Yes, I’d expect a ZeusRAM to beat the snot out of a traditional SSD. But that wasn’t what was described — the report was that even with a pool of ZeusIOPs drives, adding another one as a SLOG eliminated latency spikes. *That* is a surprising result when looked at naively.
>
>

This makes sense. The rule-of-thumb I use is that the latency of the slog should be
at least an order of magnitude lower than the pool drives. ZeusRAMs are on the order
of 50us. ZeusIOPS, like most Flash SSDs are slower, closer to 1ms.

Important note, if you remember nothing else, latency != 1/IOPS. Flash SSDs in particular
get high IOPS counts through parallelism. Nine women can't produce a baby in 1 month.
Latency rules performance, IOPS is for marketeers (or people who don't remember
Amdahl's Law)
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
aurfalien via illumos-zfs
2014-05-08 02:56:33 UTC
Permalink
Well this is instructing.

All i would contribute is that SSD is still bad on endurance and that if given a certain budget similar to an all SSD pool, I’d do SAS 15K drives and DDR3 based ZIL.

The folks at DDR Drive have a DDR based drive approach that plugs into a PCIe slot. There is another vendor having a a RAM drive that plugs into a DDR3 memory slot but I forgot who. Its new and either not out just yet or out any day now.

- aurf

"Janitorial Services"

On May 7, 2014, at 4:35 PM, Luke Iggleden <***@sisgroup.com.au> wrote:

> If you are running an all SSD pool would you bother about a ZIL? Most of the blogs / info out there is related to hybrid pools with rust.
>
> At first I thought it probably wouldn't be needed, but if we're running sync=always on our datasets, we could look at using a ZIL to take the writes off the vdevs and increase the lifespan of them?
>
> Our individual drives are capable of delivering 36k write iops (4k random) and 90k read.
>
> The work load is primarily read based, virtual machine block storage, some SQL transactions for Mysql/Postgres (e-commerce) and a massive (2tb) MS-MSQL DB that is mainly read hungry.
>
> My concern is a ZIL will slow down the writes with sync on the datasets. What would you guys roll with or what have you tried (hopefully) and had success with - in an all SSD pool?
>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/24758408-e0240379
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 03:22:45 UTC
Permalink
There are things you miss. DDR drive and similar configs are not cluster safe. But if you can deal with that skip the 15k drives. The are expensive and power hungry. A little bit of ssd means you can use 7200 drives without perf probs.

Sent from my iPhone

> On May 7, 2014, at 7:56 PM, "aurfalien via illumos-zfs" <***@lists.illumos.org> wrote:
>
> Well this is instructing.
>
> All i would contribute is that SSD is still bad on endurance and that if given a certain budget similar to an all SSD pool, I’d do SAS 15K drives and DDR3 based ZIL.
>
> The folks at DDR Drive have a DDR based drive approach that plugs into a PCIe slot. There is another vendor having a a RAM drive that plugs into a DDR3 memory slot but I forgot who. Its new and either not out just yet or out any day now.
>
> - aurf
>
> "Janitorial Services"
>
>> On May 7, 2014, at 4:35 PM, Luke Iggleden <***@sisgroup.com.au> wrote:
>>
>> If you are running an all SSD pool would you bother about a ZIL? Most of the blogs / info out there is related to hybrid pools with rust.
>>
>> At first I thought it probably wouldn't be needed, but if we're running sync=always on our datasets, we could look at using a ZIL to take the writes off the vdevs and increase the lifespan of them?
>>
>> Our individual drives are capable of delivering 36k write iops (4k random) and 90k read.
>>
>> The work load is primarily read based, virtual machine block storage, some SQL transactions for Mysql/Postgres (e-commerce) and a massive (2tb) MS-MSQL DB that is mainly read hungry.
>>
>> My concern is a ZIL will slow down the writes with sync on the datasets. What would you guys roll with or what have you tried (hopefully) and had success with - in an all SSD pool?
>>
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/24758408-e0240379
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
> illumos-zfs | Archives | Modify Your Subscription



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
aurfalien via illumos-zfs
2014-05-08 03:32:40 UTC
Permalink
A bit vague on the reply like;

"A little bit of ssd means you can use 7200 drives without perf probs”

There is no free lunch and your IO is ultimately bound by your disk subsystem. Some think that ZFS will let you cheat this but in the end ZFS is first and foremost about data reliability, then scalability and some were after is performance.

Don’t think a simple SSD will solve this. We can argue all day long about this so no need to expand on it.

Now your other statement;

"DDR drive and similar configs are not cluster safe.”

May or may not be true as I have no ZFS cluster experience. So if you would, expand on this so I can learn something in this area.


- aurf

"Janitorial Services"

On May 7, 2014, at 8:22 PM, Garrett D'Amore <***@damore.org> wrote:

> There are things you miss. DDR drive and similar configs are not cluster safe. But if you can deal with that skip the 15k drives. The are expensive and power hungry. A little bit of ssd means you can use 7200 drives without perf probs.
>
> Sent from my iPhone
>
> On May 7, 2014, at 7:56 PM, "aurfalien via illumos-zfs" <***@lists.illumos.org> wrote:
>
>> Well this is instructing.
>>
>> All i would contribute is that SSD is still bad on endurance and that if given a certain budget similar to an all SSD pool, I’d do SAS 15K drives and DDR3 based ZIL.
>>
>> The folks at DDR Drive have a DDR based drive approach that plugs into a PCIe slot. There is another vendor having a a RAM drive that plugs into a DDR3 memory slot but I forgot who. Its new and either not out just yet or out any day now.
>>
>> - aurf
>>
>> "Janitorial Services"
>>
>> On May 7, 2014, at 4:35 PM, Luke Iggleden <***@sisgroup.com.au> wrote:
>>
>>> If you are running an all SSD pool would you bother about a ZIL? Most of the blogs / info out there is related to hybrid pools with rust.
>>>
>>> At first I thought it probably wouldn't be needed, but if we're running sync=always on our datasets, we could look at using a ZIL to take the writes off the vdevs and increase the lifespan of them?
>>>
>>> Our individual drives are capable of delivering 36k write iops (4k random) and 90k read.
>>>
>>> The work load is primarily read based, virtual machine block storage, some SQL transactions for Mysql/Postgres (e-commerce) and a massive (2tb) MS-MSQL DB that is mainly read hungry.
>>>
>>> My concern is a ZIL will slow down the writes with sync on the datasets. What would you guys roll with or what have you tried (hopefully) and had success with - in an all SSD pool?
>>>
>>>
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/24758408-e0240379
>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>
>> illumos-zfs | Archives | Modify Your Subscription




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins via illumos-zfs
2014-05-08 04:48:32 UTC
Permalink
aurfalien via illumos-zfs wrote:
> A bit vague on the reply like;
>
> "A little bit of ssd means you can use 7200 drives without perf probs”
>
> There is no free lunch and your IO is ultimately bound by your disk
> subsystem. Some think that ZFS will let you cheat this but in the end
> ZFS is first and foremost about data reliability, then scalability and
> some were after is performance.
>

It will "let you cheat" thanks to write consolidation. If your workload
is heavy on synchronous write IOPs, a fast log (that can mach the write
throughput) will make a huge difference to the pool performance.

--
Ian.
Tim Cook via illumos-zfs
2014-05-08 05:11:07 UTC
Permalink
On Wed, May 7, 2014 at 11:48 PM, Ian Collins via illumos-zfs <
***@lists.illumos.org> wrote:

> aurfalien via illumos-zfs wrote:
>
>> A bit vague on the reply like;
>>
>> "A little bit of ssd means you can use 7200 drives without perf probs”
>>
>> There is no free lunch and your IO is ultimately bound by your disk
>> subsystem. Some think that ZFS will let you cheat this but in the end ZFS
>> is first and foremost about data reliability, then scalability and some
>> were after is performance.
>>
>>
> It will "let you cheat" thanks to write consolidation. If your workload
> is heavy on synchronous write IOPs, a fast log (that can mach the write
> throughput) will make a huge difference to the pool performance.
>
> --
> Ian.
>
>

Everything has to hit a final resting place eventually. An SSD cache will
let you *BURST*, it will not let you cheat on a sustained workload. Too
many people make that mistake, and too many vendors sell it as the panacea
for the masses.

--Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-08 06:03:02 UTC
Permalink
If you are limited by spindles (usually streaming workloads) then you are right. But many workloads can be helped by use of ssd to eliminate latency and turning random write workloads into the streaming work that hard disks excel at.

Read cache is another matter of course. That's a working set size analysis problem. Again utterly useless to a pure streaming workload but very few read workloads have that characteristic.

Sent from my iPhone

> On May 7, 2014, at 10:11 PM, "Tim Cook via illumos-zfs" <***@lists.illumos.org> wrote:
>
>
>
>
>> On Wed, May 7, 2014 at 11:48 PM, Ian Collins via illumos-zfs <***@lists.illumos.org> wrote:
>> aurfalien via illumos-zfs wrote:
>>> A bit vague on the reply like;
>>>
>>> "A little bit of ssd means you can use 7200 drives without perf probs”
>>>
>>> There is no free lunch and your IO is ultimately bound by your disk subsystem. Some think that ZFS will let you cheat this but in the end ZFS is first and foremost about data reliability, then scalability and some were after is performance.
>>
>> It will "let you cheat" thanks to write consolidation. If your workload is heavy on synchronous write IOPs, a fast log (that can mach the write throughput) will make a huge difference to the pool performance.
>>
>> --
>> Ian.
>
>
> Everything has to hit a final resting place eventually. An SSD cache will let you *BURST*, it will not let you cheat on a sustained workload. Too many people make that mistake, and too many vendors sell it as the panacea for the masses.
>
> --Tim
>
> illumos-zfs | Archives | Modify Your Subscription



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins via illumos-zfs
2014-05-08 07:24:19 UTC
Permalink
Tim Cook via illumos-zfs wrote:
>
>
>
> On Wed, May 7, 2014 at 11:48 PM, Ian Collins via illumos-zfs
> <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>
> aurfalien via illumos-zfs wrote:
>
> A bit vague on the reply like;
>
> "A little bit of ssd means you can use 7200 drives without
> perf probs”
>
> There is no free lunch and your IO is ultimately bound by your
> disk subsystem. Some think that ZFS will let you cheat this
> but in the end ZFS is first and foremost about data
> reliability, then scalability and some were after is performance.
>
>
> It will "let you cheat" thanks to write consolidation. If your
> workload is heavy on synchronous write IOPs, a fast log (that can
> mach the write throughput) will make a huge difference to the pool
> performance.
>
>
> Everything has to hit a final resting place eventually. An SSD cache
> will let you *BURST*, it will not let you cheat on a sustained
> workload. Too many people make that mistake, and too many vendors
> sell it as the panacea for the masses.
>

Oh I agree with everything having to hit a final resting place. But
even a modest (in throughput) synchronous random write load, such as a
KVM guest, can generate more IOPs than the drives in a small to medium
sized pool can handle. SSD log devices are good at soaking up those IOPs.

--
Ian.
Bob Friesenhahn via illumos-zfs
2014-05-08 14:15:17 UTC
Permalink
On Thu, 8 May 2014, Tim Cook via illumos-zfs wrote:
>
> Everything has to hit a final resting place eventually.  An SSD cache will let you *BURST*, it will not let you cheat on a
> sustained workload.  Too many people make that mistake, and too many vendors sell it as the panacea for the masses.

The effective I/O efficiency of zfs's normal async TXG sync is quite
good given that it is able to eliminate duplicate (overlapping in
time) writes and writes out full blocks. With suitable pool devices,
it is able to write gigabytes per second. A dedicated zil device with
suitable properties will allow the zfs pool to sustain a very high
sync workload since the deferred async writes are so much more
efficient. In fact, it is best described as effectively allowing you
to cheat on a sustained workload.

The typical problem encountered is that the device selected for the
dedicated zil is not capable of handling the sustained load.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
Jim Klimov via illumos-zfs
2014-05-08 08:41:19 UTC
Permalink
8 мая 2014 г. 7:11:07 CEST, Tim Cook via illumos-zfs <***@lists.illumos.org> пишет:
>On Wed, May 7, 2014 at 11:48 PM, Ian Collins via illumos-zfs <
>***@lists.illumos.org> wrote:
>
>> aurfalien via illumos-zfs wrote:
>>
>>> A bit vague on the reply like;
>>>
>>> "A little bit of ssd means you can use 7200 drives without perf
>probs”
>>>
>>> There is no free lunch and your IO is ultimately bound by your disk
>>> subsystem. Some think that ZFS will let you cheat this but in the
>end ZFS
>>> is first and foremost about data reliability, then scalability and
>some
>>> were after is performance.
>>>
>>>
>> It will "let you cheat" thanks to write consolidation. If your
>workload
>> is heavy on synchronous write IOPs, a fast log (that can mach the
>write
>> throughput) will make a huge difference to the pool performance.
>>
>> --
>> Ian.
>>
>>
>
>Everything has to hit a final resting place eventually. An SSD cache
>will
>let you *BURST*, it will not let you cheat on a sustained workload.
>Too
>many people make that mistake, and too many vendors sell it as the
>panacea
>for the masses.
>
>--Tim
>
>
>
>-------------------------------------------
>illumos-zfs
>Archives: https://www.listbox.com/member/archive/182191/=now
>RSS Feed:
>https://www.listbox.com/member/archive/rss/182191/22497542-d75cd9d9
>Modify Your Subscription:
>https://www.listbox.com/member/?&
>Powered by Listbox: http://www.listbox.com

True, but an ssd cache translates 'final' writes into sequential io that performs much faster than random one and does not disrupt other (async) operations. Though this is more of a concern for spinning rust. However, coalesced large writes to the main pool instead of truly random small IOs may also help on performance and reliability of main pool ssds as well.
Jim
--
Typos courtesy of K-9 Mail on my Samsung Android
Bob Friesenhahn via illumos-zfs
2014-05-08 19:52:00 UTC
Permalink
On Thu, 8 May 2014, Jim Klimov via illumos-zfs wrote:
>
> True, but an ssd cache translates 'final' writes into sequential io
> that performs much faster than random one and does not disrupt other
> (async) operations. Though this is more of a concern for spinning
> rust. However, coalesced large writes to the main pool instead of
> truly random small IOs may also help on performance and reliability
> of main pool ssds as well.

This is a statement that I very much agree with. The SLOG device is
written sequentially with each record only containing the data to be
written. The zfs pool is partitioned in blocks based on the zfs block
size, which may be additionally subdivided (at device level) by raidzN
requirements. Zfs uses COW so each block must be fully re-written
even if only part of it needed to be updated. The SLOG reduces write
activity and pool fragmentation.

There is much less thrashing if the dedicated SLOG device takes the
load and then zfs can write nicely ordered data in each TXG.

It is quite common for the same file offsets to be written over and
over and these overwrites are mostly discarded (sometimes entirely
discarded) due to the lazy operation of TXG commits.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Luke Iggleden via illumos-zfs
2014-05-08 04:38:28 UTC
Permalink
The issue with using 15K SAS is that the iops are crap.

I've read with illumos its not recommended to push past 128GB of RAM and
not to use more than 3-4 L2ARC cache devices. (Andrew's Nex7 Blog)

Sure, I could put 4 x 1TB SATA SSD l2arc drives in the case on AHCI,
which would get us close to delivering *most* of the blocks cached from
Arc or L2, but not all. Then there is the issue of warming the cache up
on start-up. Anyone want to guess how long it would take to warm up a
4TB L2ARC, would 128GB of RAM be enough to index this arc? (cant be
bothered looking for the math again)

Durability of enterprise SSD is very good these days, Intel DC3700 is
claiming up to 10 drive writes per day.

The seagate ssd's we're going to be using will do 500TB over the life of
each the drive, 2630TB if written sequentially.

In 18 months the SAS disks in our existing SSD+7200 install, is showing
1.8TB/m written per month. This existing setup is a different workload
and a much larger install. Let's say we go for the whole 60 months
(lifetime) with this, at 1.8 TB /m that = 108TB written to the disk.
Under the product data sheet at least, this means we're really only
using 1/5th of the drive capacity at worst.

Reads aren't a penalty with SSD and we're sure to get massive reads out
of this box.

I'll add this as well, not for a second will I trust a product data
sheet from a vendor when it comes to reliability, we will definitely be
replicating this data set to spinning rust hourly to protect us if in
the event we lose mirror sets due to multiple SSD's popping at once.




On 8/05/2014 12:56 pm, aurfalien via illumos-zfs wrote:
> Well this is instructing.
>
> All i would contribute is that SSD is still bad on endurance and that if
> given a certain budget similar to an all SSD pool, I’d do SAS 15K drives
> and DDR3 based ZIL.
>
> The folks at DDR Drive have a DDR based drive approach that plugs into a
> PCIe slot. There is another vendor having a a RAM drive that plugs
> into a DDR3 memory slot but I forgot who. Its new and either not out
> just yet or out any day now.
>
> - aurf
>
> "Janitorial Services"
>
> On May 7, 2014, at 4:35 PM, Luke Iggleden <***@sisgroup.com.au
> <mailto:***@sisgroup.com.au>> wrote:
>
>> If you are running an all SSD pool would you bother about a ZIL? Most
>> of the blogs / info out there is related to hybrid pools with rust.
>>
>> At first I thought it probably wouldn't be needed, but if we're
>> running sync=always on our datasets, we could look at using a ZIL to
>> take the writes off the vdevs and increase the lifespan of them?
>>
>> Our individual drives are capable of delivering 36k write iops (4k
>> random) and 90k read.
>>
>> The work load is primarily read based, virtual machine block storage,
>> some SQL transactions for Mysql/Postgres (e-commerce) and a massive
>> (2tb) MS-MSQL DB that is mainly read hungry.
>>
>> My concern is a ZIL will slow down the writes with sync on the
>> datasets. What would you guys roll with or what have you tried
>> (hopefully) and had success with - in an all SSD pool?
>>
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed:
>> https://www.listbox.com/member/archive/rss/182191/24758408-e0240379
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
> *illumos-zfs* | Archives
> <https://www.listbox.com/member/archive/182191/=now>
> <https://www.listbox.com/member/archive/rss/182191/26029255-3afb4097> |
> Modify
> <https://www.listbox.com/member/?&>
> Your Subscription [Powered by Listbox] <http://www.listbox.com>
>
Jim Klimov via illumos-zfs
2014-05-08 08:27:00 UTC
Permalink
8 мая 2014 г. 6:38:28 CEST, Luke Iggleden via illumos-zfs <***@lists.illumos.org> пишет:
>The issue with using 15K SAS is that the iops are crap.
>
>I've read with illumos its not recommended to push past 128GB of RAM
>and
>not to use more than 3-4 L2ARC cache devices. (Andrew's Nex7 Blog)
>
>Sure, I could put 4 x 1TB SATA SSD l2arc drives in the case on AHCI,
>which would get us close to delivering *most* of the blocks cached from
>
>Arc or L2, but not all. Then there is the issue of warming the cache up
>
>on start-up. Anyone want to guess how long it would take to warm up a
>4TB L2ARC, would 128GB of RAM be enough to index this arc? (cant be
>bothered looking for the math again)
>
>Durability of enterprise SSD is very good these days, Intel DC3700 is
>claiming up to 10 drive writes per day.
>
>The seagate ssd's we're going to be using will do 500TB over the life
>of
>each the drive, 2630TB if written sequentially.
>
>In 18 months the SAS disks in our existing SSD+7200 install, is showing
>
>1.8TB/m written per month. This existing setup is a different workload
>and a much larger install. Let's say we go for the whole 60 months
>(lifetime) with this, at 1.8 TB /m that = 108TB written to the disk.
>Under the product data sheet at least, this means we're really only
>using 1/5th of the drive capacity at worst.
>
>Reads aren't a penalty with SSD and we're sure to get massive reads out
>
>of this box.
>
>I'll add this as well, not for a second will I trust a product data
>sheet from a vendor when it comes to reliability, we will definitely be
>
>replicating this data set to spinning rust hourly to protect us if in
>the event we lose mirror sets due to multiple SSD's popping at once.
>
>
>
>
>On 8/05/2014 12:56 pm, aurfalien via illumos-zfs wrote:
>> Well this is instructing.
>>
>> All i would contribute is that SSD is still bad on endurance and that
>if
>> given a certain budget similar to an all SSD pool, I’d do SAS 15K
>drives
>> and DDR3 based ZIL.
>>
>> The folks at DDR Drive have a DDR based drive approach that plugs
>into a
>> PCIe slot. There is another vendor having a a RAM drive that plugs
>> into a DDR3 memory slot but I forgot who. Its new and either not out
>> just yet or out any day now.
>>
>> - aurf
>>
>> "Janitorial Services"
>>
>> On May 7, 2014, at 4:35 PM, Luke Iggleden <***@sisgroup.com.au
>> <mailto:***@sisgroup.com.au>> wrote:
>>
>>> If you are running an all SSD pool would you bother about a ZIL?
>Most
>>> of the blogs / info out there is related to hybrid pools with rust.
>>>
>>> At first I thought it probably wouldn't be needed, but if we're
>>> running sync=always on our datasets, we could look at using a ZIL to
>>> take the writes off the vdevs and increase the lifespan of them?
>>>
>>> Our individual drives are capable of delivering 36k write iops (4k
>>> random) and 90k read.
>>>
>>> The work load is primarily read based, virtual machine block
>storage,
>>> some SQL transactions for Mysql/Postgres (e-commerce) and a massive
>>> (2tb) MS-MSQL DB that is mainly read hungry.
>>>
>>> My concern is a ZIL will slow down the writes with sync on the
>>> datasets. What would you guys roll with or what have you tried
>>> (hopefully) and had success with - in an all SSD pool?
>>>
>>>
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed:
>>> https://www.listbox.com/member/archive/rss/182191/24758408-e0240379
>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>
>> *illumos-zfs* | Archives
>> <https://www.listbox.com/member/archive/182191/=now>
>> <https://www.listbox.com/member/archive/rss/182191/26029255-3afb4097>
>|
>> Modify
>> <https://www.listbox.com/member/?&>
>> Your Subscription [Powered by Listbox] <http://www.listbox.com>
>>
>
>
>
>-------------------------------------------
>illumos-zfs
>Archives: https://www.listbox.com/member/archive/182191/=now
>RSS Feed:
>https://www.listbox.com/member/archive/rss/182191/22497542-d75cd9d9
>Modify Your Subscription:
>https://www.listbox.com/member/?&
>Powered by Listbox: http://www.listbox.com

Links to sas3 reviews were posted recently, and the hgst/intel device models in the review claim 2, 10 or 25 full disk rewrites daily over 5 years, translating to 14 or 36PB lifetime writes for the top 800GB models. Also, it seems that the more reliable parts perform faster on random writes. I guess the only downside would be pricing ;)
--
Typos courtesy of K-9 Mail on my Samsung Android
Schlacta, Christ via illumos-zfs
2014-05-08 19:01:41 UTC
Permalink
One point that all the other people have missed is that not all ssds
provide a guarantee on write integrity. To use consumer devices as an
example, a pool of Samsung ssds would provide excellent read and write
iops, but in a power failure situation, provides no guarantee that data
synced to disk has indeed been synced.

In these cases, using a device with super caps or other emergency flush
technology, such as a crucial m4 for slog would provide two benefits:
firstly, the m4 does provide the guarantee that writes, once synced, are
permanent. Secondly, zfs first writes data to the zil, then again to it's
final destination on disk.

Assuming again that you're using the same consumer devices from above, I
conceive that an ideal pool might consist of a dozen or more raidz2 vdevs
of the Samsung ssds to provide longevity, resilience, and failure
protection. Then attached would be a small number, I'm thinking 2-4 per
dozen vdev, mirrors consisting of 2 of the crucial style disks with write
consistency verifications, and soaking up the write amplification.
Using the excess capacity on the crucial or some zram depending on your
budget for an l2arc would also improve read times, especially for streaming
or repetitive work loads, while freeing up the slightly slower vdevs to
handle the less predictable portions of the work load
On May 7, 2014 7:34 PM, "Luke Iggleden" <***@sisgroup.com.au> wrote:

> If you are running an all SSD pool would you bother about a ZIL? Most of
> the blogs / info out there is related to hybrid pools with rust.
>
> At first I thought it probably wouldn't be needed, but if we're running
> sync=always on our datasets, we could look at using a ZIL to take the
> writes off the vdevs and increase the lifespan of them?
>
> Our individual drives are capable of delivering 36k write iops (4k random)
> and 90k read.
>
> The work load is primarily read based, virtual machine block storage, some
> SQL transactions for Mysql/Postgres (e-commerce) and a massive (2tb)
> MS-MSQL DB that is mainly read hungry.
>
> My concern is a ZIL will slow down the writes with sync on the datasets.
> What would you guys roll with or what have you tried (hopefully) and had
> success with - in an all SSD pool?
>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
> 23054485-60ad043a
> Modify Your Subscription: https://www.listbox.com/
> member/?&
> Powered by Listbox: http://www.listbox.com
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-07 04:45:03 UTC
Permalink
Yes. illumos and sata directly connected is fine.

Don't mix sata and SAS. Use a sata port (AHCI) for your sata ssd drives.

A bad disk may do bad things in the pool if the pool is not redundantly configured. But if it is you should be fine.

Sent from my iPhone

> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <***@lists.illumos.org> wrote:
>
>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>> If you want to use SAS expanders, you need real SAS end devices. An
>> alternative is to do SATA (no interposers) with direct-attach. We've
>> had reasonable success with that configuration at Joyent using Intel
>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>> haven't used the Seagate "PRO" model you're considering; the only
>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>
> Do you remember what storage bays you used at Joyent?
>
> Using something like the intel 24 port jbod as Chip suggested, means we have to use 6 x External SAS cables connected to a single host. Not ideal, but I suppose we could make that work if we could get a 'yes, illumos and sata directly connected is fine' Seems that isn't the case either with others noting that a disk can bring down the whole zpool.
>
> If we use sata direct connect with an external storage bay, then we lose the ability to provide a fail over mechanism if we need to upgrade oi or if it crashes? I don't like 3am runs to the DC any more and I don't really want to be thinking about what ifs before I go to sleep at night ;)
>
> Seems everywhere you turn, there is a gotchya with this. I'd love to be able to go straight to some SAS SSD's, but the reality is the cost per GB is Double and the performance of the flash does not scale with the Dollar.
>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
Luke Iggleden via illumos-zfs
2014-05-07 05:41:55 UTC
Permalink
What about directly connected to a HBA like an LSI, with breakout
cables? Each SATA disk has its own lane and will talk sata to the HBA.

No expander, no issue correct?

http://www.supermicro.com/products/system/2U/2027/SSG-2027R-AR24.cfm
Backplane
SAS2 / SATA3 direct attached backplane



On 7/05/2014 2:45 pm, Garrett D'Amore via illumos-zfs wrote:
> Yes. illumos and sata directly connected is fine.
>
> Don't mix sata and SAS. Use a sata port (AHCI) for your sata ssd drives.
>
> A bad disk may do bad things in the pool if the pool is not redundantly configured. But if it is you should be fine.
>
> Sent from my iPhone
>
>> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <***@lists.illumos.org> wrote:
>>
>>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>>> If you want to use SAS expanders, you need real SAS end devices. An
>>> alternative is to do SATA (no interposers) with direct-attach. We've
>>> had reasonable success with that configuration at Joyent using Intel
>>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>>> haven't used the Seagate "PRO" model you're considering; the only
>>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>>
>> Do you remember what storage bays you used at Joyent?
>>
>> Using something like the intel 24 port jbod as Chip suggested, means we have to use 6 x External SAS cables connected to a single host. Not ideal, but I suppose we could make that work if we could get a 'yes, illumos and sata directly connected is fine' Seems that isn't the case either with others noting that a disk can bring down the whole zpool.
>>
>> If we use sata direct connect with an external storage bay, then we lose the ability to provide a fail over mechanism if we need to upgrade oi or if it crashes? I don't like 3am runs to the DC any more and I don't really want to be thinking about what ifs before I go to sleep at night ;)
>>
>> Seems everywhere you turn, there is a gotchya with this. I'd love to be able to go straight to some SAS SSD's, but the reality is the cost per GB is Double and the performance of the flash does not scale with the Dollar.
>>
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/26029255-3afb4097
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
>
Schweiss, Chip via illumos-zfs
2014-05-07 12:31:52 UTC
Permalink
On Wed, May 7, 2014 at 12:41 AM, Luke Iggleden via illumos-zfs <
***@lists.illumos.org> wrote:

> What about directly connected to a HBA like an LSI, with breakout cables?
> Each SATA disk has its own lane and will talk sata to the HBA.
>
> No expander, no issue correct?
>
> http://www.supermicro.com/products/system/2U/2027/SSG-2027R-AR24.cfm
> Backplane
> SAS2 / SATA3 direct attached backplane
>
>
I thought that would be safe when I designed a storage system that used
direct connection to each drive via LSI HBAs in an Intel JBOD where I
bypassed the SAS expanders.
http://www.bigdatajunkie.com/index.php/11-hardware/jbods/15-intel-jbod-ripe-for-ssds

All the SSDs in here are connected to L2ARC and are on an independent HBA
from the disk pool. When one interposer took a dive the driver kept
resetting it and the driver ended up taking a dive, taking down the whole
pool.

This is definitely an OS/driver issue. The correct thing to do would be
first drop the drive, then then HBA, the driver needs to stay alive at
nearly all cost since it is handling multiple paths to potentially multiple
pools.

This type of failure is what disturbs me the most about ZFS on Illumos.
Granted the failure was triggered by an Interposer but the same problem
occurs if a SAS disk becomes erratic. I've seem the same system log
messages from SAS disks when they were having sector read problems. When
this is occurring the entire pool is unresponsive. I've seen this go on
for as long as 20 minutes with SAS disks in a pure SAS system. It only
stopped when I offlined the offending disk. When a disk is unresponsive
for even a few seconds it needs to be kicked from the pool unless there is
no more redundancy.

-Chip


>
>
> On 7/05/2014 2:45 pm, Garrett D'Amore via illumos-zfs wrote:
>
>> Yes. illumos and sata directly connected is fine.
>>
>> Don't mix sata and SAS. Use a sata port (AHCI) for your sata ssd drives.
>>
>> A bad disk may do bad things in the pool if the pool is not redundantly
>> configured. But if it is you should be fine.
>>
>> Sent from my iPhone
>>
>> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <
>>> ***@lists.illumos.org> wrote:
>>>
>>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>>>> If you want to use SAS expanders, you need real SAS end devices. An
>>>> alternative is to do SATA (no interposers) with direct-attach. We've
>>>> had reasonable success with that configuration at Joyent using Intel
>>>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>>>> haven't used the Seagate "PRO" model you're considering; the only
>>>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>>>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>>>>
>>>
>>> Do you remember what storage bays you used at Joyent?
>>>
>>> Using something like the intel 24 port jbod as Chip suggested, means we
>>> have to use 6 x External SAS cables connected to a single host. Not ideal,
>>> but I suppose we could make that work if we could get a 'yes, illumos and
>>> sata directly connected is fine' Seems that isn't the case either with
>>> others noting that a disk can bring down the whole zpool.
>>>
>>> If we use sata direct connect with an external storage bay, then we lose
>>> the ability to provide a fail over mechanism if we need to upgrade oi or if
>>> it crashes? I don't like 3am runs to the DC any more and I don't really
>>> want to be thinking about what ifs before I go to sleep at night ;)
>>>
>>> Seems everywhere you turn, there is a gotchya with this. I'd love to be
>>> able to go straight to some SAS SSD's, but the reality is the cost per GB
>>> is Double and the performance of the flash does not scale with the Dollar.
>>>
>>>
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>>> 22035932-85c5d227
>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>> 26029255-3afb4097
>>
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>>
>>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
> 21878139-69539aca
> Modify Your Subscription: https://www.listbox.com/
> member/?&
> Powered by Listbox: http://www.listbox.com
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schlacta, Christ via illumos-zfs
2014-05-07 13:58:47 UTC
Permalink
I've been complaining about this data/SAS issue for quite some time. At
this point, sata is effectively deprecated. It's continued existence is
causing significant harm to the hard drive market place. It's only purpose
is to artificially differentiate between two unnecessarily distinct markets
so that drive makers can have some half assed excuse to charge a convenient
markup.

The practical result, however, is that development resources are spent
refining and reimplementing two different interfaces with two nearly
identical feature sets that serve the same purpose. The end result is that
we have two separate, incompatible, half baked interfaces for each drive,
with half the eyes on the source and twice the bugs they should have.

There's no *good* reason that drive makers and board makers and card makers
can't decide to switch to all SAS drives next generation, and differentiate
between enterprise and consumer drives solely by reliability rating,
performance point, and silly firmware compile time settings like
user-modifiable tler, accessible cache, reserved blocks, rait, etc.

The only practical upshots of such a change are that 1) *every* drive would
mandatorily have a wwn (most do now anyway), and 2) cheap knockoff
manufacturers would have a harder time peddling their crap, 3) overall
firmware quality on hbas and drives would skyrocket, and would have a
trickle down effect on things like expanders and raid adapters due to
increased development resources.

Oh, and low end SAS hbas (4-8 drives, raid 0,1,10 only) would come down in
price some due to the increased demand over the first few years
On May 7, 2014 5:32 AM, "Schweiss, Chip via illumos-zfs" <
***@lists.illumos.org> wrote:

> On Wed, May 7, 2014 at 12:41 AM, Luke Iggleden via illumos-zfs <
> ***@lists.illumos.org> wrote:
>
>> What about directly connected to a HBA like an LSI, with breakout cables?
>> Each SATA disk has its own lane and will talk sata to the HBA.
>>
>> No expander, no issue correct?
>>
>> http://www.supermicro.com/products/system/2U/2027/SSG-2027R-AR24.cfm
>> Backplane
>> SAS2 / SATA3 direct attached backplane
>>
>>
> I thought that would be safe when I designed a storage system that used
> direct connection to each drive via LSI HBAs in an Intel JBOD where I
> bypassed the SAS expanders.
> http://www.bigdatajunkie.com/index.php/11-hardware/jbods/15-intel-jbod-ripe-for-ssds
>
> All the SSDs in here are connected to L2ARC and are on an independent HBA
> from the disk pool. When one interposer took a dive the driver kept
> resetting it and the driver ended up taking a dive, taking down the whole
> pool.
>
> This is definitely an OS/driver issue. The correct thing to do would be
> first drop the drive, then then HBA, the driver needs to stay alive at
> nearly all cost since it is handling multiple paths to potentially multiple
> pools.
>
> This type of failure is what disturbs me the most about ZFS on Illumos.
> Granted the failure was triggered by an Interposer but the same problem
> occurs if a SAS disk becomes erratic. I've seem the same system log
> messages from SAS disks when they were having sector read problems. When
> this is occurring the entire pool is unresponsive. I've seen this go on
> for as long as 20 minutes with SAS disks in a pure SAS system. It only
> stopped when I offlined the offending disk. When a disk is unresponsive
> for even a few seconds it needs to be kicked from the pool unless there is
> no more redundancy.
>
> -Chip
>
>
>>
>>
>> On 7/05/2014 2:45 pm, Garrett D'Amore via illumos-zfs wrote:
>>
>>> Yes. illumos and sata directly connected is fine.
>>>
>>> Don't mix sata and SAS. Use a sata port (AHCI) for your sata ssd drives.
>>>
>>> A bad disk may do bad things in the pool if the pool is not redundantly
>>> configured. But if it is you should be fine.
>>>
>>> Sent from my iPhone
>>>
>>> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <
>>>> ***@lists.illumos.org> wrote:
>>>>
>>>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>>>>> If you want to use SAS expanders, you need real SAS end devices. An
>>>>> alternative is to do SATA (no interposers) with direct-attach. We've
>>>>> had reasonable success with that configuration at Joyent using Intel
>>>>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>>>>> haven't used the Seagate "PRO" model you're considering; the only
>>>>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>>>>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>>>>>
>>>>
>>>> Do you remember what storage bays you used at Joyent?
>>>>
>>>> Using something like the intel 24 port jbod as Chip suggested, means we
>>>> have to use 6 x External SAS cables connected to a single host. Not ideal,
>>>> but I suppose we could make that work if we could get a 'yes, illumos and
>>>> sata directly connected is fine' Seems that isn't the case either with
>>>> others noting that a disk can bring down the whole zpool.
>>>>
>>>> If we use sata direct connect with an external storage bay, then we
>>>> lose the ability to provide a fail over mechanism if we need to upgrade oi
>>>> or if it crashes? I don't like 3am runs to the DC any more and I don't
>>>> really want to be thinking about what ifs before I go to sleep at night ;)
>>>>
>>>> Seems everywhere you turn, there is a gotchya with this. I'd love to be
>>>> able to go straight to some SAS SSD's, but the reality is the cost per GB
>>>> is Double and the performance of the flash does not scale with the Dollar.
>>>>
>>>>
>>>>
>>>>
>>>> -------------------------------------------
>>>> illumos-zfs
>>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>>>> 22035932-85c5d227
>>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>>> Powered by Listbox: http://www.listbox.com
>>>>
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>>> 26029255-3afb4097
>>>
>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>>
>>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>> 21878139-69539aca
>> Modify Your Subscription: https://www.listbox.com/member/?&id_
>> secret=21878139-61e37d3e <https://www.listbox.com/member/?&>
>> Powered by Listbox: http://www.listbox.com
>>
>
> *illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
> <https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a> |
> Modify<https://www.listbox.com/member/?&>Your Subscription
> <http://www.listbox.com>
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-07 14:36:22 UTC
Permalink
Sounds like you hit an LSI firmware or driver bug to me. Would love to examine further if I had time. I have not heard if people having these problems with AHCI.

Sent from my iPhone

> On May 7, 2014, at 5:31 AM, "Schweiss, Chip via illumos-zfs" <***@lists.illumos.org> wrote:
>
>> On Wed, May 7, 2014 at 12:41 AM, Luke Iggleden via illumos-zfs <***@lists.illumos.org> wrote:
>> What about directly connected to a HBA like an LSI, with breakout cables? Each SATA disk has its own lane and will talk sata to the HBA.
>>
>> No expander, no issue correct?
>>
>> http://www.supermicro.com/products/system/2U/2027/SSG-2027R-AR24.cfm
>> Backplane
>> SAS2 / SATA3 direct attached backplane
>>
>
> I thought that would be safe when I designed a storage system that used direct connection to each drive via LSI HBAs in an Intel JBOD where I bypassed the SAS expanders. http://www.bigdatajunkie.com/index.php/11-hardware/jbods/15-intel-jbod-ripe-for-ssds
>
> All the SSDs in here are connected to L2ARC and are on an independent HBA from the disk pool. When one interposer took a dive the driver kept resetting it and the driver ended up taking a dive, taking down the whole pool.
>
> This is definitely an OS/driver issue. The correct thing to do would be first drop the drive, then then HBA, the driver needs to stay alive at nearly all cost since it is handling multiple paths to potentially multiple pools.
>
> This type of failure is what disturbs me the most about ZFS on Illumos. Granted the failure was triggered by an Interposer but the same problem occurs if a SAS disk becomes erratic. I've seem the same system log messages from SAS disks when they were having sector read problems. When this is occurring the entire pool is unresponsive. I've seen this go on for as long as 20 minutes with SAS disks in a pure SAS system. It only stopped when I offlined the offending disk. When a disk is unresponsive for even a few seconds it needs to be kicked from the pool unless there is no more redundancy.
>
> -Chip
>
>>
>>
>>> On 7/05/2014 2:45 pm, Garrett D'Amore via illumos-zfs wrote:
>>> Yes. illumos and sata directly connected is fine.
>>>
>>> Don't mix sata and SAS. Use a sata port (AHCI) for your sata ssd drives.
>>>
>>> A bad disk may do bad things in the pool if the pool is not redundantly configured. But if it is you should be fine.
>>>
>>> Sent from my iPhone
>>>
>>>>> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <***@lists.illumos.org> wrote:
>>>>>
>>>>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>>>>> If you want to use SAS expanders, you need real SAS end devices. An
>>>>> alternative is to do SATA (no interposers) with direct-attach. We've
>>>>> had reasonable success with that configuration at Joyent using Intel
>>>>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>>>>> haven't used the Seagate "PRO" model you're considering; the only
>>>>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>>>>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>>>>
>>>> Do you remember what storage bays you used at Joyent?
>>>>
>>>> Using something like the intel 24 port jbod as Chip suggested, means we have to use 6 x External SAS cables connected to a single host. Not ideal, but I suppose we could make that work if we could get a 'yes, illumos and sata directly connected is fine' Seems that isn't the case either with others noting that a disk can bring down the whole zpool.
>>>>
>>>> If we use sata direct connect with an external storage bay, then we lose the ability to provide a fail over mechanism if we need to upgrade oi or if it crashes? I don't like 3am runs to the DC any more and I don't really want to be thinking about what ifs before I go to sleep at night ;)
>>>>
>>>> Seems everywhere you turn, there is a gotchya with this. I'd love to be able to go straight to some SAS SSD's, but the reality is the cost per GB is Double and the performance of the flash does not scale with the Dollar.
>>>>
>>>>
>>>>
>>>>
>>>> -------------------------------------------
>>>> illumos-zfs
>>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
>>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>>> Powered by Listbox: http://www.listbox.com
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/26029255-3afb4097
>>>
>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>>
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/21878139-69539aca
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
> illumos-zfs | Archives | Modify Your Subscription



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-zfs
2014-05-07 04:47:12 UTC
Permalink
Btw. In case my last reply wasn't clear enough. Sata is not an option in a shared storage cluster.

If you want shared storage ha clustering then you need to stop being cheap and pony up the funds for a SAS drive. HA costs more. Get over it.

Sent from my iPhone

> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <***@lists.illumos.org> wrote:
>
>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>> If you want to use SAS expanders, you need real SAS end devices. An
>> alternative is to do SATA (no interposers) with direct-attach. We've
>> had reasonable success with that configuration at Joyent using Intel
>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>> haven't used the Seagate "PRO" model you're considering; the only
>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>
> Do you remember what storage bays you used at Joyent?
>
> Using something like the intel 24 port jbod as Chip suggested, means we have to use 6 x External SAS cables connected to a single host. Not ideal, but I suppose we could make that work if we could get a 'yes, illumos and sata directly connected is fine' Seems that isn't the case either with others noting that a disk can bring down the whole zpool.
>
> If we use sata direct connect with an external storage bay, then we lose the ability to provide a fail over mechanism if we need to upgrade oi or if it crashes? I don't like 3am runs to the DC any more and I don't really want to be thinking about what ifs before I go to sleep at night ;)
>
> Seems everywhere you turn, there is a gotchya with this. I'd love to be able to go straight to some SAS SSD's, but the reality is the cost per GB is Double and the performance of the flash does not scale with the Dollar.
>
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
Luke Iggleden via illumos-zfs
2014-05-07 05:48:58 UTC
Permalink
Agree totally with what you're saying. Would be nice to be able to use
many more 'cheaper, still enterprise' drives in a larger clustered
storage network, perhaps its time we looked at what ceph can do for us.



On 7/05/2014 2:47 pm, Garrett D'Amore via illumos-zfs wrote:
> Btw. In case my last reply wasn't clear enough. Sata is not an option in a shared storage cluster.
>
> If you want shared storage ha clustering then you need to stop being cheap and pony up the funds for a SAS drive. HA costs more. Get over it.
>
> Sent from my iPhone
>
>> On May 6, 2014, at 7:07 PM, "Luke Iggleden via illumos-zfs" <***@lists.illumos.org> wrote:
>>
>>> On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>>> If you want to use SAS expanders, you need real SAS end devices. An
>>> alternative is to do SATA (no interposers) with direct-attach. We've
>>> had reasonable success with that configuration at Joyent using Intel
>>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>>> haven't used the Seagate "PRO" model you're considering; the only
>>> Seagate device I've evaluated was the Pulsar.2, which worked. I never
>>> recommend SATA, but if SAS just isn't an option, this is the way to go.
>>
>> Do you remember what storage bays you used at Joyent?
>>
>> Using something like the intel 24 port jbod as Chip suggested, means we have to use 6 x External SAS cables connected to a single host. Not ideal, but I suppose we could make that work if we could get a 'yes, illumos and sata directly connected is fine' Seems that isn't the case either with others noting that a disk can bring down the whole zpool.
>>
>> If we use sata direct connect with an external storage bay, then we lose the ability to provide a fail over mechanism if we need to upgrade oi or if it crashes? I don't like 3am runs to the DC any more and I don't really want to be thinking about what ifs before I go to sleep at night ;)
>>
>> Seems everywhere you turn, there is a gotchya with this. I'd love to be able to go straight to some SAS SSD's, but the reality is the cost per GB is Double and the performance of the flash does not scale with the Dollar.
>>
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/26029255-3afb4097
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
>
Jim Klimov via illumos-zfs
2014-05-07 07:28:56 UTC
Permalink
7 мая 2014 г. 4:07:41 CEST, Luke Iggleden via illumos-zfs <***@lists.illumos.org> пишет:
>On 7/05/2014 2:37 am, Keith Wesolowski wrote:
>> If you want to use SAS expanders, you need real SAS end devices. An
>> alternative is to do SATA (no interposers) with direct-attach. We've
>> had reasonable success with that configuration at Joyent using Intel
>> DCS3700 devices and the same 2308-IT HBA you're likely looking at. I
>> haven't used the Seagate "PRO" model you're considering; the only
>> Seagate device I've evaluated was the Pulsar.2, which worked. I
>never
>> recommend SATA, but if SAS just isn't an option, this is the way to
>go.
>
>Do you remember what storage bays you used at Joyent?
>
>Using something like the intel 24 port jbod as Chip suggested, means we
>
>have to use 6 x External SAS cables connected to a single host. Not
>ideal, but I suppose we could make that work if we could get a 'yes,
>illumos and sata directly connected is fine' Seems that isn't the case
>either with others noting that a disk can bring down the whole zpool.
>
>If we use sata direct connect with an external storage bay, then we
>lose
>the ability to provide a fail over mechanism if we need to upgrade oi
>or
>if it crashes? I don't like 3am runs to the DC any more and I don't
>really want to be thinking about what ifs before I go to sleep at night
>;)
>
>Seems everywhere you turn, there is a gotchya with this. I'd love to be
>
>able to go straight to some SAS SSD's, but the reality is the cost per
>GB is Double and the performance of the flash does not scale with the
>Dollar.
>
>
>
>
>-------------------------------------------
>illumos-zfs
>Archives: https://www.listbox.com/member/archive/182191/=now
>RSS Feed:
>https://www.listbox.com/member/archive/rss/182191/22497542-d75cd9d9
>Modify Your Subscription:
>https://www.listbox.com/member/?&
>Powered by Listbox: http://www.listbox.com

Hi, I want to make sure i haven't lost track of the thread: you'd love a box of cheaper sata disks connected to two hosts in a way that it can failover? Building a few home/soho nases, i'd love that too ;)

Theoretically it might work, with some backplane in the box multiplexing the connectivity from each disk to the two hosts' hba's, but in practice it seems that sata (hardware and protocol) are just not designed and engineered to do that, at least not in a reliable manner. Market segregation - a home user does not have to pay for those extra features he won't ever use, but does get a cheaper and less capable device. And maybe even yet cheaper if the manufacturer cuts even more corners aiming at the average joe ;)

Also, you are aware that on the hardware level the sas port and cabling includes two separate connection lanes, unlike sata, which kinda helps in the matters of multipathing and faster/more reliable connections to each disk, right?

Possibly, this is also another difference between enterprise sata and sas, with otherwise similar (allegedly more reliable) hardware compared to the consumer sata.

Finally, at least to some extent, performance of ssd does scale with the dollar: at least, the bigger they are, the more chips they have to stripe the data over, and single chips have pretty limited performance. Also, while consumer devices might boost size (less overprovisioning) and performance (higher voltage writes, less/simplistic wear-leveling), they might to so at expense of reliability. Also on the hardware level - i.e. the much-discussed power-loss protection. On scratch or l2arc devices this might not matter and you indeed prefer fast shorter-lived devices. On the main pool storage (even on all-ssd pools) you should have a different preference - towards reliability after all ;)

Hth, Jim
--
Typos courtesy of K-9 Mail on my Samsung Android
Keith Wesolowski via illumos-zfs
2014-05-07 16:29:27 UTC
Permalink
On Wed, May 07, 2014 at 12:07:41PM +1000, Luke Iggleden via illumos-zfs wrote:

> Do you remember what storage bays you used at Joyent?

Everything we use (still use) is documented at
http://eng.joyent.com/manufacturing/bom.html.

We neither use nor support JBODs; cabling, management, and overall
reliability make them a poor choice, and our use case never requires
that much storage on one system anyway.

> Using something like the intel 24 port jbod as Chip suggested, means we
> have to use 6 x External SAS cables connected to a single host. Not
> ideal, but I suppose we could make that work if we could get a 'yes,
> illumos and sata directly connected is fine' Seems that isn't the case
> either with others noting that a disk can bring down the whole zpool.

The lessons we've learned have led us away from building centralised
storage systems with gigantic fabrics, regardless of end device
interface. These days we build scale and redundancy higher up the stack
using smaller, simpler storage nodes that have proven very reliable and
easy to manage. The software is complex and difficult, but it's better
to deal with the hard problems ourselves, where we can see and control
them, than in the firmware in disks, expanders, and IOCs.

> If we use sata direct connect with an external storage bay, then we lose
> the ability to provide a fail over mechanism if we need to upgrade oi or
> if it crashes? I don't like 3am runs to the DC any more and I don't
> really want to be thinking about what ifs before I go to sleep at night ;)

You lose this anyway with SATA, because of affiliations. The
interposers nominally give you back that ability, but they don't fucking
work, so it's not a great tradeoff.

> Seems everywhere you turn, there is a gotchya with this. I'd love to be

Architecture is hard. CAP sucks. You get what you pay for.
jason matthews
2014-05-06 17:43:13 UTC
Permalink
On May 6, 2014, at 12:45 AM, "Luke Iggleden" <***@lists.illumos.org> wrote:

> My question really relates to the issues with SATA on SAS expanders and ZFS and are modern LSI interposers with this combo working ok now with the mpt_sas driver?

You are asking for a world of heart ache. If you go with SATA devices then you want to be direct attached.

My GenII environment is built on:

Dell r720 e5-2693v2, 384gb of ram
Dell r820 4650L 576gb of ram
removed the perc cards & the spinning rust
added two LSI 9207-8i cards, direct attach - helpful hint is to use the hidden disk boot selector option in the LSI firmware ALT-B
GenII is using all DC S3700 800GB drives. We have a relatively small number of disks each system and routinely hit 100k+ IOPs running postgres databases.

To recap, if you use SATA go direct attached
If you use expanders, use SAS

In the GenI environment I used Newisys/Sanima NDS-2241 disk shelves which have expanders. First rate, top notch gear. DataOn rebrands them. If you are buying any quantity, Newisys will deal direct.










-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Andrew Gabriel via illumos-zfs
2014-05-08 08:35:02 UTC
Permalink
One other thing I've seen with an all-SSD pool is the nature of wearing
on SSD's being very different from that of HDD's.

Flash-based SSD's have much more predictable lives than HDD's. This also
means that when you have same traffic on SSD's (such as mirrors), they
will wear out at the same time. If you run your pool like you probably
would with HDD's, i.e. replace a drive when it fails, you will likely
find with SSD's that you get multiple device failures in the same vdev
at almost the same time. I've come across multiple wear-out failures
within 2 days on a 2 year old pool. The conventional HDD calculations
for MTTDL don't work at SSD wear-out time.

This requires handling differently than HDDs. Keep an eye on the wear of
the SSD's and replace them before they completely wear out. You will
also require more spares stock as they reach this point than you would
with HDD's, since you expect them all to fail around the same time,
which is not typically the failure pattern of HDD's. Your supplier may
struggle to supply you with a whole pool of replacement SSD's in the
turn-around time you expect for failed replacement drives.

Something to bare in mind if you haven't used all-SSD pools before.

--
Andrew Gabriel
jason matthews via illumos-zfs
2014-05-08 20:59:23 UTC
Permalink
On May 8, 2014, at 1:35 AM, Andrew Gabriel via illumos-zfs <***@lists.illumos.org> wrote:

> Flash-based SSD's have much more predictable lives than HDD's. This also means that when you have same traffic on SSD's (such as mirrors), they will wear out at the same time. If you run your pool like you probably would with HDD's, i.e. replace a drive when it fails, you will likely find with SSD's that you get multiple device failures in the same vdev at almost the same time. I've come across multiple wear-out failures within 2 days on a 2 year old pool. The conventional HDD calculations for MTTDL don't work at SSD wear-out time.


This isn't my experience at all. In three years of running SSDs I have had two Intel 910s spontaneously die where the failure had nothing to do with wearing out. I have seen the 910s take check some errors.

In five months of operating 152 Intel DC S3700s I have seen four disks die. Three just stopped working altogether and the third accumulated massive checksum errors. Another, experiences interface errors but seems to fly under the radar of zfs. Sure, you could blame these on infant mortality but dead is dead.

On the consumer side, I have Crucial M4s that I use in non-critical roles that have exceeded expected cell writes by 2x-3x.

On par, I would say SSDs have approximately the same or perhaps slightly higher failure rate of the 15k Seagate SAS drives I installed in 2011 but they sure pack a bigger bang.


j.






-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Luke Iggleden via illumos-zfs
2014-05-08 22:07:48 UTC
Permalink
Great advice Andrew. Guess we will write a script or something that can
be called via SNMP to graph and alert via nagios once we get to 85% life
on the SSD's. Its then best to just discard the whole pool I'd say.
zfs-send off it, re-deploy fresh flash and zfs-send back.


On 8/05/2014 6:35 pm, Andrew Gabriel via illumos-zfs wrote:
> One other thing I've seen with an all-SSD pool is the nature of wearing
> on SSD's being very different from that of HDD's.
>
> Flash-based SSD's have much more predictable lives than HDD's. This also
> means that when you have same traffic on SSD's (such as mirrors), they
> will wear out at the same time. If you run your pool like you probably
> would with HDD's, i.e. replace a drive when it fails, you will likely
> find with SSD's that you get multiple device failures in the same vdev
> at almost the same time. I've come across multiple wear-out failures
> within 2 days on a 2 year old pool. The conventional HDD calculations
> for MTTDL don't work at SSD wear-out time.
>
> This requires handling differently than HDDs. Keep an eye on the wear of
> the SSD's and replace them before they completely wear out. You will
> also require more spares stock as they reach this point than you would
> with HDD's, since you expect them all to fail around the same time,
> which is not typically the failure pattern of HDD's. Your supplier may
> struggle to supply you with a whole pool of replacement SSD's in the
> turn-around time you expect for failed replacement drives.
>
> Something to bare in mind if you haven't used all-SSD pools before.
>
Robert Milkowski via illumos-zfs
2014-05-09 10:04:10 UTC
Permalink
Interesting... especially in the light of this post:

https://blogs.oracle.com/ahl/entry/mirroring_flash_ssds



--
Robert Milkowski
http://milek.blogspot.com



> -----Original Message-----
> From: Andrew Gabriel via illumos-zfs [mailto:***@lists.illumos.org]
> Sent: 08 May 2014 09:35
> To: ***@lists.illumos.org
> Subject: Re: [zfs] all ssd pool
>
> One other thing I've seen with an all-SSD pool is the nature of wearing
> on SSD's being very different from that of HDD's.
>
> Flash-based SSD's have much more predictable lives than HDD's. This
> also means that when you have same traffic on SSD's (such as mirrors),
> they will wear out at the same time. If you run your pool like you
> probably would with HDD's, i.e. replace a drive when it fails, you will
> likely find with SSD's that you get multiple device failures in the
> same vdev at almost the same time. I've come across multiple wear-out
> failures within 2 days on a 2 year old pool. The conventional HDD
> calculations for MTTDL don't work at SSD wear-out time.
>
> This requires handling differently than HDDs. Keep an eye on the wear
> of the SSD's and replace them before they completely wear out. You will
> also require more spares stock as they reach this point than you would
> with HDD's, since you expect them all to fail around the same time,
> which is not typically the failure pattern of HDD's. Your supplier may
> struggle to supply you with a whole pool of replacement SSD's in the
> turn-around time you expect for failed replacement drives.
>
> Something to bare in mind if you haven't used all-SSD pools before.
>
> --
> Andrew Gabriel
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010221-
> f80ab471
> Modify Your Subscription:
> https://www.listbox.com/member/?&
> d63933b0
> Powered by Listbox: http://www.listbox.com
Schweiss, Chip
2014-05-06 12:46:56 UTC
Permalink
I've done a fair amount of experimentation with various SATA SSDs and
interposers. Long story short more SSD don't work than do with an
interposer.

The Intel DC3700 would not fully initialize. I tried there different
Micron's and none of them would finish initializing either. I have had
the best luck with Samsung 840 Pro and EVO.

The lack of trim can be overcome by short stroking the SSD as long as it
has a good garbage collection routine. Some do and and other don't.
Anandtech has a good write up on this:
http://www.anandtech.com/show/6489/playing-with-op

I also have some lengthy write ups about SSDs and interposers on my blog:
http://www.bigdatajunkie.com

The build I have in service has 17TB useable space and had been stable for
many months now. We have processed 100's of TBs of data on this system
and frequently see it saturate both 10Gb network interfaces. The
bottleneck is typically the CPU, in hind sight it should have been built
with a much beefier CPU.

I haven't had an interposer die since the latest Illumos patches were
applied. If and when that happens, I will certainly know if the patches
keep the system alive. Within the first month of the service an
interposer did die and take the system down.

-Chip


On Tue, May 6, 2014 at 4:32 AM, Richard Kojedzinszky
<***@lists.illumos.org>wrote:

>
> And also a performance and reliability test would worth it. (
> https://github.com/rkojedzinszky/zfsziltest)
>
> I would be interested in comparing it to an Intel SSD DC3700, which has a
> very impressive performance, and with Intel's promise, its endurance is
> comparable to SLC based SSDs. And the cost is very reasonable.
>
> Kojedzinszky Richard
>
>
> On Tue, 6 May 2014, Steven Hartland wrote:
>
> I can't really comment on OI but we have quite a bit of experience of all
>> SSD
>> pools under FreeBSD.
>>
>> The biggest issue is single strength when going though expanders when
>> using
>> 6Gbps devices. We've tested a number of chassis with hotswap backplanes
>> which have turned out to have bad signal strength which results in
>> unstable
>> devices which will drop under load.
>>
>> Once you have a setup which is confirmed to have good signaling then
>> things
>> become a lot easier.
>>
>> I cant say I've used Seagate SSD's as we mainly use consumer grade disks
>> which have served us well for what we do.
>>
>> One thing that may be an issue is SSD's generally require TRIM support to
>> remain performant. Currently OI doesn't have TRIM support for ZFS where
>> as FreeBSD does, which myself and other actively maintain so it maybe
>> something worth considering.
>>
>> FW is also very important, particularly when it comes to TRIM support so
>> I'd definitely recommend testing a single disk before buying in bulk.
>>
>> Regards
>> Steve
>>
>>
>> ----- Original Message ----- From: "Luke Iggleden" <***@lists.illumos.org
>> >
>> To: <***@lists.illumos.org>
>> Sent: Tuesday, May 06, 2014 8:45 AM
>> Subject: [zfs] all ssd pool
>>
>>
>> Hi All,
>>>
>>> We're looking at deploying an all SSD pool with the following hardware:
>>>
>>> Dual Node
>>>
>>> Supermicro SSG-2027B-DE2R24L
>>> (includes LSI 2308 Controller)
>>> 128GB RAM per node
>>> 24 x Seagate PRO 600 480GB SSD
>>>
>>> 24 x LSI interposers (sata > sas) ?? (maybe, see post)
>>> RSF-1 High Availability Suite to failover between nodes
>>> Open Indiana or Omni OS
>>>
>>> My question really relates to the issues with SATA on SAS expanders and
>>> ZFS and are modern LSI interposers with this combo working ok now with the
>>> mpt_sas driver?
>>>
>>> I've seen some posts on forums which suggest that a couple of
>>> interposers have died and have crashed the mpt_sas driver due to resets,
>>> but I'm wondering if that is related to the bug in illumos which crashes
>>> the mpt_sas driver (illumos bugs 4403, 4682 & 4819)
>>>
>>> https://www.illumos.org/issues/4403
>>> https://www.illumos.org/issues/4682
>>> https://www.illumos.org/issues/4819
>>>
>>> If LSI interposers are a no go, has anyone got these (or other) SATA
>>> SSD's running on supermicro SAS2 expanders and getting a reliable platform,
>>> specifically when a SSD dies or performance is at max?
>>>
>>> A few years ago we were burned by putting Hitachi 7200rpm SATA disks on
>>> an expander, this was before most of the posts about 'sata on sas DONT!'
>>> posts came out. That was 2009/10 then, so things could have changed?
>>>
>>> Also, there were some other posts suggesting that the WWN for SSD's with
>>> LSI interposers were not being passed through, but it was suggested that
>>> this was an issue with the SSD and not the interposer.
>>>
>>> Thanks in advance.
>>>
>>>
>>> Luke Iggleden
>>>
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>>> 24401717-fdfe502b
>>> Modify Your Subscription: https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>>
>>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
>> 25402478-0858cafa
>>
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>>
>>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182191/
> 21878139-69539aca
> Modify Your Subscription: https://www.listbox.com/
> member/?&
> Powered by Listbox: http://www.listbox.com
>



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Luke Iggleden via illumos-zfs
2014-05-07 00:08:33 UTC
Permalink
We were looking at not using more than 75 - 80% of the space on each of
the SSD drives in the zpool to assist with this issue.

The seagate drives we were looking at using have 28% over provision
built in to them. Would you still short stroke these to 80% or trust the
drive manufacturer. I think I know my answer.

http://www.tomshardware.com/reviews/ssd-dc-s3500-review-6gbps,3529-4.html

Around half of the initial deployment will be for a 2TB MS SQL DB (on a
zvol) with lots of 4k random reads. This is where the Seagate (according
to Toms Hardware) shines, right up there with the DC3700.

I did come across your blog prior which triggered my post to this list,
looks good mate! Your update on the 21/Jan suggests that the SAS bug is
that of the illumos kernel and not directly related to the interposer.
We unfortunately just upgraded our oi installs to find the mpt_sas bug
and we are not taking any chances with removing sas drives from the bus,
having to use hot spares until we can get a stable version of illumos on
oi running with regards to the mpt_sas issues.

Is there any way short of plugging in and out sas disk (or
sas/sata/interposer/) to cause an error on the zpool so it faults so we
can test this?

Also - how did you go with the stec sas ssd's?




On 6/05/2014 10:46 pm, Schweiss, Chip wrote:
> I've done a fair amount of experimentation with various SATA SSDs and
> interposers. Long story short more SSD don't work than do with an
> interposer.
>
> The Intel DC3700 would not fully initialize. I tried there different
> Micron's and none of them would finish initializing either. I have had
> the best luck with Samsung 840 Pro and EVO.
>
> The lack of trim can be overcome by short stroking the SSD as long as it
> has a good garbage collection routine. Some do and and other don't.
> Anandtech has a good write up on this:
> http://www.anandtech.com/show/6489/playing-with-op
>
> I also have some lengthy write ups about SSDs and interposers on my
> blog: http://www.bigdatajunkie.com
>
> The build I have in service has 17TB useable space and had been stable
> for many months now. We have processed 100's of TBs of data on this
> system and frequently see it saturate both 10Gb network interfaces. The
> bottleneck is typically the CPU, in hind sight it should have been built
> with a much beefier CPU.
>
> I haven't had an interposer die since the latest Illumos patches were
> applied. If and when that happens, I will certainly know if the
> patches keep the system alive. Within the first month of the service
> an interposer did die and take the system down.
>
> -Chip
>
>
> On Tue, May 6, 2014 at 4:32 AM, Richard Kojedzinszky
> <***@lists.illumos.org <mailto:***@lists.illumos.org>> wrote:
>
>
> And also a performance and reliability test would worth it.
> (https://github.com/__rkojedzinszky/zfsziltest
> <https://github.com/rkojedzinszky/zfsziltest>)
>
> I would be interested in comparing it to an Intel SSD DC3700, which
> has a very impressive performance, and with Intel's promise, its
> endurance is comparable to SLC based SSDs. And the cost is very
> reasonable.
>
> Kojedzinszky Richard
>
>
> On Tue, 6 May 2014, Steven Hartland wrote:
>
> I can't really comment on OI but we have quite a bit of
> experience of all SSD
> pools under FreeBSD.
>
> The biggest issue is single strength when going though expanders
> when using
> 6Gbps devices. We've tested a number of chassis with hotswap
> backplanes
> which have turned out to have bad signal strength which results
> in unstable
> devices which will drop under load.
>
> Once you have a setup which is confirmed to have good signaling
> then things
> become a lot easier.
>
> I cant say I've used Seagate SSD's as we mainly use consumer
> grade disks
> which have served us well for what we do.
>
> One thing that may be an issue is SSD's generally require TRIM
> support to
> remain performant. Currently OI doesn't have TRIM support for
> ZFS where
> as FreeBSD does, which myself and other actively maintain so it
> maybe
> something worth considering.
>
> FW is also very important, particularly when it comes to TRIM
> support so
> I'd definitely recommend testing a single disk before buying in
> bulk.
>
> Regards
> Steve
>
>
> ----- Original Message ----- From: "Luke Iggleden"
> <***@lists.illumos.org <mailto:***@lists.illumos.org>>
> To: <***@lists.illumos.org <mailto:***@lists.illumos.org>>
> Sent: Tuesday, May 06, 2014 8:45 AM
> Subject: [zfs] all ssd pool
>
>
> Hi All,
>
> We're looking at deploying an all SSD pool with the
> following hardware:
>
> Dual Node
>
> Supermicro SSG-2027B-DE2R24L
> (includes LSI 2308 Controller)
> 128GB RAM per node
> 24 x Seagate PRO 600 480GB SSD
>
> 24 x LSI interposers (sata > sas) ?? (maybe, see post)
> RSF-1 High Availability Suite to failover between nodes
> Open Indiana or Omni OS
>
> My question really relates to the issues with SATA on SAS
> expanders and ZFS and are modern LSI interposers with this
> combo working ok now with the mpt_sas driver?
>
> I've seen some posts on forums which suggest that a couple
> of interposers have died and have crashed the mpt_sas driver
> due to resets, but I'm wondering if that is related to the
> bug in illumos which crashes the mpt_sas driver (illumos
> bugs 4403, 4682 & 4819)
>
> https://www.illumos.org/__issues/4403
> <https://www.illumos.org/issues/4403>
> 
https://www.illumos.org/__issues/4682
> <https://www.illumos.org/issues/4682>
> 
https://www.illumos.org/__issues/4819
> <https://www.illumos.org/issues/4819>
>
> If LSI interposers are a no go, has anyone got these (or
> other) SATA SSD's running on supermicro SAS2 expanders and
> getting a reliable platform, specifically when a SSD dies or
> performance is at max?
>
> A few years ago we were burned by putting Hitachi 7200rpm
> SATA disks on an expander, this was before most of the posts
> about 'sata on sas DONT!' posts came out. That was 2009/10
> then, so things could have changed?
>
> Also, there were some other posts suggesting that the WWN
> for SSD's with LSI interposers were not being passed
> through, but it was suggested that this was an issue with
> the SSD and not the interposer.
>
> Thanks in advance.
>
>
> Luke Iggleden
>
>
>
> ------------------------------__-------------
> illumos-zfs
> Archives:
> https://www.listbox.com/__member/archive/182191/=now
> <https://www.listbox.com/member/archive/182191/=now>
> RSS Feed:
> https://www.listbox.com/__member/archive/rss/182191/__24401717-fdfe502b
> <https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b>
> Modify Your Subscription:
> https://www.listbox.com/__member/?&
> <https://www.listbox.com/member/?&>
> Powered by Listbox: http://www.listbox.com
>
>
>
>
> ------------------------------__-------------
> illumos-zfs
> Archives: https://www.listbox.com/__member/archive/182191/=now
> <https://www.listbox.com/member/archive/182191/=now>
> RSS Feed:
> https://www.listbox.com/__member/archive/rss/182191/__25402478-0858cafa
> <https://www.listbox.com/member/archive/rss/182191/25402478-0858cafa>
>
> Modify Your Subscription: https://www.listbox.com/__member/?&
> <https://www.listbox.com/member/?&>
> Powered by Listbox: http://www.listbox.com
>
>
>
> ------------------------------__-------------
> illumos-zfs
> Archives: https://www.listbox.com/__member/archive/182191/=now
> <https://www.listbox.com/member/archive/182191/=now>
> RSS Feed:
> https://www.listbox.com/__member/archive/rss/182191/__21878139-69539aca
> <https://www.listbox.com/member/archive/rss/182191/21878139-69539aca>
> Modify Your Subscription:
> https://www.listbox.com/__member/?&id___secret=21878139-61e37d3e
> <https://www.listbox.com/member/?&>
> Powered by Listbox: http://www.listbox.com
>
>
> *illumos-zfs* | Archives
> <https://www.listbox.com/member/archive/182191/=now>
> <https://www.listbox.com/member/archive/rss/182191/26029255-3afb4097> |
> Modify
> <https://www.listbox.com/member/?&>
> Your Subscription [Powered by Listbox] <http://www.listbox.com>
>
Loading...