Discussion:
Petabyte pool?
(too old to reply)
Marion Hakanson
2013-03-16 01:09:34 UTC
Permalink
Greetings,

Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion
Ray Van Dolson
2013-03-16 01:17:46 UTC
Permalink
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
We've come close:

***@mes-str-imgnx-p1:~$ zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
datapool 978T 298T 680T 30% 1.00x ONLINE -
syspool 278G 104G 174G 37% 1.00x ONLINE -

Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual
pathed to a couple of LSI SAS switches.

Using Nexenta but no reason you couldn't do this w/ $whatever.

We did triple parity and our vdev membership is set up such that we can
lose up to three JBODs and still be functional (one vdev member disk
per JBOD).

This is with 3TB NL-SAS drives.

Ray
Kristoffer Sheather @ CloudCentral
2013-03-16 01:21:22 UTC
Permalink
Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..

Regards,

Kristoffer Sheather
Cloud Central
Scale Your Data Center In The Cloud
Phone: 1300 144 007 | Mobile: +61 414 573 130 | Email:
***@cloudcentral.com.au
LinkedIn: | Skype: kristoffer.sheather | Twitter:
http://twitter.com/kristofferjon

----------------------------------------
From: "Marion Hakanson" <***@ohsu.edu>
Sent: Saturday, March 16, 2013 12:12 PM
To: ***@lists.illumos.org
Subject: [zfs] Petabyte pool?

Greetings,

Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper
"power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182191/23629987-2afa167a
Modify Your Subscription:
https://www.listbox.com/member/?member_id=23629987&id_secret=23629987-c48148
a8
Powered by Listbox: http://www.listbox.com
Bob Friesenhahn
2013-03-16 14:20:56 UTC
Permalink
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between
JBOD chassis? Does the server need to be powered up last so that it
does not time out on the zfs import?

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2013-03-16 19:27:08 UTC
Permalink
Post by Bob Friesenhahn
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between JBOD
chassis? Does the server need to be powered up last so that it does not
time out on the zfs import?
I guess you can use managed PDUs like those from APC (many models for
varied socket types and amounts); they can be scripted on an advanced
level, and on a basic level I think delays can be just configured
per-socket to make the staggered startup after giving power from the
wall (UPS) regardless of what the boxes' individual power sources can
do. Conveniently, they also allow to do a remote hard-reset of hung
boxes without walking to the server room ;)

My 2c,
//Jim Klimov
Tim Cook
2013-03-16 19:43:08 UTC
Permalink
Post by Jim Klimov
Post by Bob Friesenhahn
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between JBOD
chassis? Does the server need to be powered up last so that it does not
time out on the zfs import?
I guess you can use managed PDUs like those from APC (many models for
varied socket types and amounts); they can be scripted on an advanced
level, and on a basic level I think delays can be just configured
per-socket to make the staggered startup after giving power from the
wall (UPS) regardless of what the boxes' individual power sources can
do. Conveniently, they also allow to do a remote hard-reset of hung
boxes without walking to the server room ;)
My 2c,
//Jim Klimov
Any modern JBOD should have the intelligence built in to stagger drive
spin-up. I wouldn't spend money on one that didn't. There's really no
need to stagger the JBOD power-up at the PDU.

As for the head, yes it should have a delayed power on which you can
typically set in the BIOS.

--Tim
Jim Klimov
2013-03-16 19:41:20 UTC
Permalink
Post by Bob Friesenhahn
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between JBOD
chassis? Does the server need to be powered up last so that it does not
time out on the zfs import?
Giving this question a second thought, I think JBODs should spin-up
quickly (i.e. when power is given) while the server head(s) take time
to pass POST, initialize their HBAs and other stuff. Booting 8 JBODs,
one every 15 seconds to complete a typical spin-up power draw, would
take a couple of minutes. It is likely that a server booted along with
the first JBOD won't get to importing the pool this quickly ;)

Anyhow, with such a system attention should be given to redundant power
and cooling, including redundant UPSes preferably fed from different
power lines going into the room.

This does not seem like a fantastic power sucker, however. 480 drives at
15W would consume 7200W; add a bit for processor/RAM heads (perhaps
a kW?) and this would still fit into 8-10kW, so a couple of 15kVA UPSes
(or more smaller ones) should suffice including redundancy. This might
overall exceed a rack in size though. But for power/cooling this seems
like a standard figure for a 42U rack or just a bit more.

//Jim
Kristoffer Sheather @ CloudCentral
2013-03-16 01:24:33 UTC
Permalink
Actually, you could use 3TB drives and with a 6/8 RAIDZ2 stripe achieve
1080 TB usable.

You'll also need 8-16 x SAS ports available on each storage head to provide
redundant multi-pathed SAS connectivity to the JBOD's, recommend LSI
9207-8E's for those and Intel X520-DA2's for the 10G NIC's.

----------------------------------------
From: "Kristoffer Sheather @ CloudCentral"
<***@cloudcentral.com.au>
Sent: Saturday, March 16, 2013 12:21 PM
To: ***@lists.illumos.org
Subject: re: [zfs] Petabyte pool?

Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..

Regards,

Kristoffer Sheather
Cloud Central
Scale Your Data Center In The Cloud
Phone: 1300 144 007 | Mobile: +61 414 573 130 | Email:
***@cloudcentral.com.au
LinkedIn: | Skype: kristoffer.sheather | Twitter:
http://twitter.com/kristofferjon

----------------------------------------
From: "Marion Hakanson" <***@ohsu.edu>
Sent: Saturday, March 16, 2013 12:12 PM
To: ***@lists.illumos.org
Subject: [zfs] Petabyte pool?

Greetings,

Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper
"power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182191/23629987-2afa167a
Modify Your Subscription:
https://www.listbox.com/member/?member_id=23629987&id_secret=23629987-c48148
a8
Powered by Listbox: http://www.listbox.com
Schlacta, Christ
2013-03-16 01:25:19 UTC
Permalink
I keep thinking the way to go is to create multiple zfs raid zn storage
enclosures. 2-3 enclosures properly configured with a single zvol exported
over iscsi or fibre channel. Import all those volumes on a head end where
you can create a new zvol as either stripe or raid zn depending on your
needs. Not a great performance option.

If you can reasonably sort your data by some category that's a good
grouping (I'm guessing by date) I'd say just create a filesystem per host
and use nfs or samba dfs to be able to only spin up the datasets in use
right now if you can. You'll save cost on power requirements.
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Trey Palmer
2013-03-16 02:18:57 UTC
Permalink
I tried using zvol's aggregated via COMSTAR on a single head as a way to get recursive RAIDZ, but found performance was disappointing. This was was using qlt/qlc with direct-connected QLogic 2562's.

Nowadays I would favor GlusterFS atop ZFSonLinux for the OP's project. Seems tailor-made.

-- Trey
I keep thinking the way to go is to create multiple zfs raid zn storage enclosures. 2-3 enclosures properly configured with a single zvol exported over iscsi or fibre channel. Import all those volumes on a head end where you can create a new zvol as either stripe or raid zn depending on your needs. Not a great performance option.
If you can reasonably sort your data by some category that's a good grouping (I'm guessing by date) I'd say just create a filesystem per host and use nfs or samba dfs to be able to only spin up the datasets in use right now if you can. You'll save cost on power requirements.
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-03-16 12:00:28 UTC
Permalink
Your zvols likely had a volblocksize of 8K, while the pool on top used
ashift=9. Your poor performance was likely caused by read-copy-modify
operations that were incurred by writing 512-byte logical sectors into
8192-byte "physical" sectors. Performance would likely have been
significant better had you made the iSCSI-backed pool with ashift=13.

Marion said that he was told that low performance was fine, so using
COMSTAR as you did would likely be acceptable. However, it should be
possible for Marion to attach the 288 to 289 4TB disks required to get a
traditional petabyte to a single system using SAS expanders.

On that note, building a 1PB system involves enough money that it is
possible for Marion to innovate should he be willing to entertain that
idea. He could have custom 4U SAS expanders made and then connect all of
them to a single system that is mounted in the rack. The 4U SAS
expanders could use a design similar to the Backblaze storage pods (or
the X4540 for those that know that Sun did this first):

http://blog.backblaze.com/category/storage-pod/

They would function similarly to the units here:

http://www.sasexpanders.com/vs/enclosures/

This would effectively turn a 44U rack into a single system. It should
be more cost effective than anything else built to date. If Marion does
do this, a public writeup about it like what Backblaze did would be
wonderful.
Post by Trey Palmer
I tried using zvol's aggregated via COMSTAR on a single head as a way to get recursive RAIDZ, but found performance was disappointing. This was was using qlt/qlc with direct-connected QLogic 2562's.
Nowadays I would favor GlusterFS atop ZFSonLinux for the OP's project. Seems tailor-made.
-- Trey
I keep thinking the way to go is to create multiple zfs raid zn storage enclosures. 2-3 enclosures properly configured with a single zvol exported over iscsi or fibre channel. Import all those volumes on a head end where you can create a new zvol as either stripe or raid zn depending on your needs. Not a great performance option.
If you can reasonably sort your data by some category that's a good grouping (I'm guessing by date) I'd say just create a filesystem per host and use nfs or samba dfs to be able to only spin up the datasets in use right now if you can. You'll save cost on power requirements.
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jan Owoc
2013-03-16 01:29:47 UTC
Permalink
Post by Marion Hakanson
Has anyone out there built a 1-petabyte pool?
I'm not advising against your building/configuring a system yourself,
but I suggest taking look at the "Petarack":
http://www.aberdeeninc.com/abcatg/petarack.htm

It shows it's been done with ZFS :-).

Jan
Marion Hakanson
2013-03-16 01:31:11 UTC
Permalink
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
datapool 978T 298T 680T 30% 1.00x ONLINE -
syspool 278G 104G 174G 37% 1.00x ONLINE -
Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to
a couple of LSI SAS switches.
Thanks Ray,

We've been looking at those too (we've had good luck with our MD1200's).

How many HBA's in the R720?

Thanks and regards,

Marion
Ray Van Dolson
2013-03-16 01:56:10 UTC
Permalink
Post by Marion Hakanson
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
datapool 978T 298T 680T 30% 1.00x ONLINE -
syspool 278G 104G 174G 37% 1.00x ONLINE -
Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to
a couple of LSI SAS switches.
Thanks Ray,
We've been looking at those too (we've had good luck with our MD1200's).
How many HBA's in the R720?
Thanks and regards,
Marion
We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]).

Ray

[1] http://accessories.us.dell.com/sna/productdetail.aspx?c=us&l=en&s=hied&cs=65&sku=a4614101
Marion Hakanson
2013-03-16 02:35:25 UTC
Permalink
Post by Ray Van Dolson
Post by Marion Hakanson
Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed
to a couple of LSI SAS switches.
How many HBA's in the R720?
We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]).
Sounds similar in approach to the Aberdeen product another sender referred to,
with SAS switch layout:
Loading Image...

One concern I had is that I compared our SuperMicro JBOD with 40x 4TB drives
in it, connected via a dual-port LSI SAS 9200-8e HBA, to the same pool layout
on a 40-slot server with 40x SATA drives in it. But the server uses no SAS
expanders, instead using SAS-to-SATA octopus cables to connect the drives
directly to three internal SAS HBA's (2x 9201-16i's, 1x 9211-8i).

What I found was that the internal pool was significantly faster for both
sequential and random I/O than the pool on the external JBOD.

My conclusion was that I would not want to exceed ~48 drives on a single
8-port SAS HBA. So I thought that running the I/O of all your hundreds
of drives through only two HBA's would be a bottleneck.

LSI's specs say 4800MBytes/sec for an 8-port SAS HBA, but 4000MBytes/sec
for that card in an x8 PCIe-2.0 slot. Sure, the newer 9207-8e is rated
at 8000MBytes/sec in an x8 PCIe-3.0 slot, but it still has only the same
8 SAS ports going at 4800MBytes/sec.

Yes, I know the disks probably can't go that fast. But in my tests
above, the internal 40-disk pool measures 2000MBytes/sec sequential
reads and writes, while the external 40-disk JBOD measures at 1500
to 1700 MBytes/sec. Not a lot slower, but significantly slower, so
I do think the number of HBA's makes a difference.

At the moment, I'm leaning toward piling six, eight, or ten HBA's into
a server, preferably one with dual IOH's (thus two PCIe busses), and
connecting dual-path JBOD's in that manner.

I hadn't looked into SAS switches much, but they do look more reliable
than daisy-chaining a bunch of JBOD's together. I just haven't seen
how to get more bandwidth through them to a single host.

Regards,

Marion
Trey Palmer
2013-03-16 05:30:41 UTC
Permalink
I know it's heresy these days, but given the I/O throughput you're looking for and the amount you're going to spend on disks, a T5-2 could make sense when they're released (I think) later this month.

Crucial sells RAM they guarantee for use in SPARC T-series, and since you're at an edu the academic discount is 35%. So A T4-2 with 512GB RAM could be had for under $35K shortly after release, 4-5 months before the E5 Xeon was released. It seemed a surprisingly good deal to me.

The T5-2 has 32x3.6GHz cores, 256 threads and ~150GB/s aggregate memory bandwidth. In my testing a T4-1 can compete with a 12-core E-5 box on I/O and memory bandwidth, and this thing is about 5 times bigger than the T4-1. It should have at least 10 PCIe's and will take 32 DIMMs minimum, maybe 64. And is likely to cost you less than $50K with aftermarket RAM.

-- Trey
Post by Marion Hakanson
Post by Ray Van Dolson
Post by Marion Hakanson
Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed
to a couple of LSI SAS switches.
How many HBA's in the R720?
We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]).
Sounds similar in approach to the Aberdeen product another sender referred to,
http://www.aberdeeninc.com/images/1-up-petarack2.jpg
One concern I had is that I compared our SuperMicro JBOD with 40x 4TB drives
in it, connected via a dual-port LSI SAS 9200-8e HBA, to the same pool layout
on a 40-slot server with 40x SATA drives in it. But the server uses n
expanders, instead using SAS-to-SATA octopus cables to connect the drives
directly to three internal SAS HBA's (2x 9201-16i's, 1x 9211-8i).
What I found was that the internal pool was significantly faster for both
sequential and random I/O than the pool on the external JBOD.
My conclusion was that I would not want to exceed ~48 drives on a single
8-port SAS HBA. So I thought that running the I/O of all your hundreds
of drives through only two HBA's would be a bottleneck.
LSI's specs say 4800MBytes/sec for an 8-port SAS HBA, but 4000MBytes/sec
for that card in an x8 PCIe-2.0 slot. Sure, the newer 9207-8e is rated
at 8000MBytes/sec in an x8 PCIe-3.0 slot, but it still has only the same
8 SAS ports going at 4800MBytes/sec.
Yes, I know the disks probably can't go that fast. But in my tests
above, the internal 40-disk pool measures 2000MBytes/sec sequential
reads and writes, while the external 40-disk JBOD measures at 1500
to 1700 MBytes/sec. Not a lot slower, but significantly slower, so
I do think the number of HBA's makes a difference.
At the moment, I'm leaning toward piling six, eight, or ten HBA's into
a server, preferably one with dual IOH's (thus two PCIe busses), and
connecting dual-path JBOD's in that manner.
I hadn't looked into SAS switches much, but they do look more reliable
than daisy-chaining a bunch of JBOD's together. I just haven't seen
how to get more bandwidth through them to a single host.
Regards,
Marion
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22500336-78e51065
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Marion Hakanson
2013-03-16 02:47:33 UTC
Permalink
Have you looked at GlusterFS or iRODS? These solutions seem tailor made for
you.
Both will give you a single namespace, and both can be run on top of ZFS for
its data integrity.
I'd run them on Linux now using the LLNL port, but there are folks running
both on Illumos.
GlusterFS will give you a single namespace and good POSIX access including
NFS.
iRODS was designed for archiving and gives you sophisticated metadata,
replication rules, better geo-clustering, etc.
Boy, I really want to thank you and all the other responders for sharing
their experiences and ideas. Very helpful.

I've looked at GlusterFS, yes, but not iRODS (yet) . The one issue I've
seen with GlusterFS is that there is only a native client for Linux. While
Linux does cover the majority of our client platforms, I'm not a fan of
the Samba alternative for the Windows & Mac clients around here, and
last I heard the only NFS support is for NFS version 3 (the NFSv4 ACL's
are pretty useful in certain cases).

I've also been hearing about OpenAFS recently -- Robert Milkowski has
done a presentation about using that on top of Solaris-11/ZFS for
hosting petabytes of data. Might be worth a look, too.
http://milek.blogspot.de/2012/10/running-openafs-on-solaris-11-x86-zfs.html

Thanks and regards,

Marion
Trey Palmer
2013-03-16 04:39:02 UTC
Permalink
Marion,

Agreed, Gluster is much better when you are mostly dealing with Linux clients. They say NFSv4 is "coming soon".... :-/

And thank you, this is a great discussion.

BTW I have also gotten somewhat better throughput using direct connections to cheap SATA disks than using SAS expanders with nearline disks. We don't find that it makes any real difference serving NFS via ixgbe in real life though. Also we seem to run into a hard single-stream zfs send/recv throughput limit at 280MB/s.

We have a stack of 36-disk Supermicros built using desktop SATA drives hooked to 3 LSI 9201-16i's via 9 SFF-8087 cables. They perform in the same ballpark you stated, at least local on a new pool, and they have been rock solid thus far despite violating the pain-inspired SAS-only ZFS CW.

However, there has been considerable teeth gnashing because the combination of non-sequential, non-obvious disk locations and no SES makes disk replacement trying.

-- Trey
Post by Marion Hakanson
Have you looked at GlusterFS or iRODS? These solutions seem tailor made for
you.
Both will give you a single namespace, and both can be run on top of ZFS for
its data integrity.
I'd run them on Linux now using the LLNL port, but there are folks running
both on Illumos.
GlusterFS will give you a single namespace and good POSIX access including
NFS.
iRODS was designed for archiving and gives you sophisticated metadata,
replication rules, better geo-clustering, etc.
Boy, I really want to thank you and all the other responders for sharing
their experiences and ideas. Very helpful.
I've looked at GlusterFS, yes, but not iRODS (yet) . The one issue I've
seen with GlusterFS is that there is only a native client for Linux. While
Linux does cover the majority of our client platforms, I'm not a fan of
the Samba alternative for the Windows & Mac clients around here, and
last I heard the only NFS support is for NFS version 3 (the NFSv4 ACL's
are pretty useful in certain cases).
I've also been hearing about OpenAFS recently -- Robert Milkowski has
done a presentation about using that on top of Solaris-11/ZFS for
hosting petabytes of data. Might be worth a look, too.
http://milek.blogspot.de/2012/10/running-openafs-on-solaris-11-x86-zfs.html
Thanks and regards,
Marion
Richard Elling
2013-03-16 04:57:10 UTC
Permalink
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool?
Yes, I've done quite a few.
Post by Marion Hakanson
I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
Yes. NB, for the PHB, using N^2 is found 2B less effective than N^10.
Post by Marion Hakanson
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace
without needing the complexity of NFSv4.1, lustre, glusterfs, etc.
Post by Marion Hakanson
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Don't forget about backups :-)
-- richard


--

***@RichardElling.com
+1-760-896-4422
Richard Yao
2013-03-16 12:23:07 UTC
Permalink
Post by Richard Elling
Post by Marion Hakanson
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Don't forget about backups :-)
-- richard
Transferring 1 PB over a 10 gigabit link will take at least 10 days when
overhead is taken into account. The backup system should have a
dedicated 10 gigabit link at the minimum and using incremental send/recv
will be extremely important.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-03-16 11:14:33 UTC
Permalink
Memory-wise, ZFS should have no special memory requirements unless you
are doing data deduplication. More RAM is always better, but the system
will likely be fine with even a relatively small amount of RAM, such as 1GB.

With that said, this reminds me about a theoretical scaling issue in the
code about which I vaguely remember reading. The issue is that the sync
thread should limit pool throughput in pools that consist of thousands
of top level vdevs. A 1-petabyte pool built using 80MB hard drives from
the late 1980s / early 1990s would likely suffer from this issue. If you
use modern drives (which I am >99.999% certain that you will), the pool
should be too small to exhibit this problem. Since "low performance" in
your situation is fine, things should be fine even if you were to use
80MB hard drives.
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip
2013-03-17 01:05:22 UTC
Permalink
I just recently built an OpenIndiana 151a7 system that is currently 1/2 PB
that will be expanded to 1 PB as we collect imaging data for the Human
Connectome Project at Washington University in St. Louis. It is very much
like your use case as this is an offsite backup system that will write once
and read rarely.

It has displaced a BlueArc DR system because their mechanisms for syncing
over distances could not keep up with our data generation rate. The fact
it cost 5x per TB as homebrew helped the decision also.

It is currently 180 4TB SAS Seagate Constellations in 4 Supermicro JBODs.
The JBODS currently are in two branches only cascading once. When
expanded 4 JBODs will be on each branch. The pool is configured as 9 zvols
of 19 drives in raidz3. The remaining disks are configured as hot
spares. Metedata only is cached in 128GB ram and 2 480GB Intel 520 SSDs
for L2ARC. Sync (ZIL) is turned off since the worst that would happen is
that we would need to rerun an rsync job.

Two identical servers were built for a cold standby configuration. Since
it is a DR system the need for a hot standby was ruled out since even
several hours downtime would not be an issue. Each server is fitted with 2
LSI 9207-8e HBAs configured as redundant multipath to the JBODs.

Before putting in into service I ran several iozone tests to benchmark the
pool. Even with really fat vdevs the performance is impressive. If
you're interested in that data let me know. It has many hours of idle
time each day so additional performance tests are not out of the question
either.

Actually I should say I designed and configured the system. The system was
assembled by a colleague at UMINN. If you would like more details on the
hardware I have a very detailed assembly doc I wrote and would be happy to
share.

The system receives daily rsyncs from our production BlueArc system. The
rsyncs are split into 120 parallel rsync jobs. This overcomes the latency
slow down TCP suffers from and we see total throughput between
500-700Mb/s. The BlueArc has 120TB of 15k SAS tiered to NL-SAS. All
metadata is on the SAS pool. The ZFS system outpaces the BlueArc on
metadata when rsync does its tree walk.

Given all the safeguards built into ZFS, I would not hesitate to build a
production system at the multi-petabyte scale. If a channel to disks are
no longer available it will simply stop writing and data will be safe.
Given the redundant paths, power supplies, etc, the odds of that happening
are very unlikely. The single points of failure left when running a single
server remain at the motherboard, CPU and RAM level. Build a hot standby
server and human error becomes the most likely failure.

-Chip
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Darren Reed
2013-03-17 10:19:14 UTC
Permalink
What if I said I wanted a 1PB pool but better than three nines uptime?

Darren
Schweiss, Chip
2013-03-17 12:48:43 UTC
Permalink
3 nines is a little less than 9 hours a year. If you have well trained
staff near by all the times that can address issues this should be
achievable with a cold standby server. If not you had better build a hot
standby or use commercial clustering software.

Hot standby and clustering both will require all SSDs to be SAS not sata so
they can be dual connected. This is a significant price jump on those
components, but not so much in the scope of the entire system.

Also never use things like dedupe that will significantly increase the pool
import time in the event of a fail-over.

4 nines or better and you realistically need the ability to fail over to
another system in another facility all together. ZFS is very good at
keeping systems in sync with frequent snap, send & receive as long as your
network connection between the two locations can keep up with the data rate
change. Your software accessing the storage needs to be built to handle
fail-over just as well as the storage system.



-Chip
Post by Darren Reed
What if I said I wanted a 1PB pool but better than three nines uptime?
Darren
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21878139-69539aca
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-03-17 15:33:29 UTC
Permalink
3 nines is a little less than 9 hours a year. If you have well trained staff near by all the times that can address issues this should be achievable with a cold standby server. If not you had better build a hot standby or use commercial clustering software.
A well-managed system can easily achieve 98% annualized uptime. This implies that a
pooly-managed system is not likely to achieve 98% annualized uptime, regardless of
the chosen hardware. Good idea: start with a well-managed system :-)
Hot standby and clustering both will require all SSDs to be SAS not sata so they can be dual connected. This is a significant price jump on those components, but not so much in the scope of the entire system.
Plan for about $20 per disk. This number has been decreasing over time and some of the
newer disks have almost no price differential.
Also never use things like dedupe that will significantly increase the pool import time in the event of a fail-over.
Dedup does not significantly impact pool import time. The biggest contributors to import
time are sharing services (NFS, SMB, COMSTAR) and device enumeration (volumes, snapshots)

For some clusters, SCSI reservations are also significant -- and difficult to safely design around.
It is not uncommon for the cluster's SCSI reservation management to take longer than the
pool import.
4 nines or better and you realistically need the ability to fail over to another system in another facility all together. ZFS is very good at keeping systems in sync with frequent snap, send & receive as long as your network connection between the two locations can keep up with the data rate change. Your software accessing the storage needs to be built to handle fail-over just as well as the storage system.
Yep
-- richard
--
ZFS and performance consulting
http://www.RichardElling.com
Darren Reed
2013-03-18 10:25:08 UTC
Permalink
Post by Richard Elling
...
Dedup does not significantly impact pool import time. The biggest contributors to import
time are sharing services (NFS, SMB, COMSTAR) and device enumeration (volumes, snapshots)
For some clusters, SCSI reservations are also significant -- and difficult to safely design around.
It is not uncommon for the cluster's SCSI reservation management to take longer than the
pool import.
4 nines or better and you realistically need the ability to fail over to another system in another facility all together. ZFS is very good at keeping systems in sync with frequent snap, send & receive as long as your network connection between the two locations can keep up with the data rate change. Your software accessing the storage needs to be built to handle fail-over just as well as the storage system.
Yep
Continuous zfs send/receive does not seem like a winning solution to me, rather a hack. Similarly depending on client smarts is not a realistic option.

Are folks using the Nexenta "HA plugin" to get 4 nines or better?

Darren
Schweiss, Chip
2013-03-18 16:32:39 UTC
Permalink
On Sun, Mar 17, 2013 at 10:33 AM, Richard Elling
Post by Schweiss, Chip
Hot standby and clustering both will require all SSDs to be SAS not sata
so they can be dual connected. This is a significant price jump on those
components, but not so much in the scope of the entire system.
Plan for about $20 per disk. This number has been decreasing over time and some of the
newer disks have almost no price differential.
Yes, for spinning disks when your talking about enterprise grade disks.
What I was referring to is SSDs for ZIL and L2ARC so they can be part of
forced import in a hot fail-over situation. SAS SSDs costs are
significantly higher than SATA SSDs. If you have SATA SSDs they cannot be
dual connected to two hosts and must be dropped to fail over.

The other option would be to have SATA SSDs on both hosts and drop and
re-add them as part of a fail-over process.

-Chip



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-03-18 18:43:19 UTC
Permalink
Post by Richard Elling
Hot standby and clustering both will require all SSDs to be SAS not sata so they can be dual connected. This is a significant price jump on those components, but not so much in the scope of the entire system.
Plan for about $20 per disk. This number has been decreasing over time and some of the
newer disks have almost no price differential.
Yes, for spinning disks when your talking about enterprise grade disks. What I was referring to is SSDs for ZIL and L2ARC so they can be part of forced import in a hot fail-over situation. SAS SSDs costs are significantly higher than SATA SSDs. If you have SATA SSDs they cannot be dual connected to two hosts and must be dropped to fail over.
SATA/SAS interposers cost less than $25. Some "SAS SSDs" simply put the interposer
onboard the disk. But remember, enterprise-class is not the same as consumer-grade
plus bubble gum and bailing wire.
Post by Richard Elling
The other option would be to have SATA SSDs on both hosts and drop and re-add them as part of a fail-over process.
Having watched many people try this, I can say conclusively that this method is
penny wise and pound foolish. If you need HA, you need to properly do HA.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Keith Wesolowski
2013-03-18 19:19:32 UTC
Permalink
Post by Richard Elling
Post by Richard Elling
The other option would be to have SATA SSDs on both hosts and drop and re-add them as part of a fail-over process.
Having watched many people try this, I can say conclusively that this method is
penny wise and pound foolish. If you need HA, you need to properly do HA.
That's being kind. This approach is a first-class disaster even at
small scale, and becomes completely unworkable at the petabyte scale.
SATA and multi-head storage clustering are incompatible, period. It's
impossible to address the affiliation problem in any kind of sensible
way; in the best case, takeovers, failbacks, and anything involving
probing the fabric or adding devices to the pool will be extremely slow.
The only option is to replace expanders with DA RAID controllers in
JBODs, which (a) works poorly with ZFS, (b) is expensive, and (c) either
reduces redundancy or serves only to push the problem from observable
open-source software into unobservable closed-source firmware. By the
time you've done that, using SAS disks would have been cheaper anyway.

If you care enough about availability to spend an extra $100k+ in HW and
licensing on the snake oil that is "HA" clustering, you surely care
enough to spend a lot less than that to reduce takeover time and the
risk of serious bugs. At Sun we found that ditching SATA resulted in
over an order of magnitude improvement in the speed of these coordinated
activities, with far fewer bugs. At the scale discussed in this thread,
the difference could easily be 2 orders of magnitude, and the bugs that
we encountered/introduced trying to deal with SATA precluded us entirely
from supporting large-scale fabrics. I covered this pretty thoroughly
in
https://blogs.oracle.com/wesolows/entry/7000_series_takeover_and_failback.

Bottom line: SATA disks are the most expensive disks on the planet.
Tim Cook
2013-03-18 19:40:12 UTC
Permalink
On Mon, Mar 18, 2013 at 2:19 PM, Keith Wesolowski <
Post by Schweiss, Chip
Post by Richard Elling
Post by Schweiss, Chip
The other option would be to have SATA SSDs on both hosts and drop and
re-add them as part of a fail-over process.
Post by Richard Elling
Having watched many people try this, I can say conclusively that this
method is
Post by Richard Elling
penny wise and pound foolish. If you need HA, you need to properly do HA.
That's being kind. This approach is a first-class disaster even at
small scale, and becomes completely unworkable at the petabyte scale.
SATA and multi-head storage clustering are incompatible, period. It's
impossible to address the affiliation problem in any kind of sensible
way; in the best case, takeovers, failbacks, and anything involving
probing the fabric or adding devices to the pool will be extremely slow.
The only option is to replace expanders with DA RAID controllers in
JBODs, which (a) works poorly with ZFS, (b) is expensive, and (c) either
reduces redundancy or serves only to push the problem from observable
open-source software into unobservable closed-source firmware. By the
time you've done that, using SAS disks would have been cheaper anyway.
If you care enough about availability to spend an extra $100k+ in HW and
licensing on the snake oil that is "HA" clustering, you surely care
enough to spend a lot less than that to reduce takeover time and the
risk of serious bugs. At Sun we found that ditching SATA resulted in
over an order of magnitude improvement in the speed of these coordinated
activities, with far fewer bugs. At the scale discussed in this thread,
the difference could easily be 2 orders of magnitude, and the bugs that
we encountered/introduced trying to deal with SATA precluded us entirely
from supporting large-scale fabrics. I covered this pretty thoroughly
in
https://blogs.oracle.com/wesolows/entry/7000_series_takeover_and_failback.
Bottom line: SATA disks are the most expensive disks on the planet.
Sorry, a bit off topic, but has anyone thought about gathering all the blog
posts from your (the collective you ex-Sun employees) days at Sun and
putting them into one searchable archive? I fear that they may be lost
forever if we leave it up to the whims of Oracle to decide what stays and
what goes. And I find the posts both extremely informative and interesting
reading material. If it's already been done, link?

--Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Timothy Coalson
2013-03-18 19:52:52 UTC
Permalink
Post by Schweiss, Chip
The other option would be to have SATA SSDs on both hosts and drop and
re-add them as part of a fail-over process.
Having watched many people try this, I can say conclusively that this method is
penny wise and pound foolish. If you need HA, you need to properly do HA.
A second point, perhaps beating a dead horse, but if you do this and your
ZIL is on one of those SATA SSDs, it is useless, because the data written
to it doesn't migrate to the failover machine before pool import. Thus, it
may as well have never been written at all (ie, equivalent to
sync=disabled, but without the performance benefits), defeating the entire
purpose of the ZIL.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip
2013-03-18 20:01:05 UTC
Permalink
I realized that shortly after posting that idea. This was the reason I
ditched a hot stand by for the system I was designing. Since it was a
backup and disaster recovery, not much point in several $K more for SAS
SSDs to be dual connected.

If it ever gets escalated to production SAS SSDs and a dual connected JBOD
are in order.

-Chip
Post by Timothy Coalson
Post by Schweiss, Chip
The other option would be to have SATA SSDs on both hosts and drop and
re-add them as part of a fail-over process.
Having watched many people try this, I can say conclusively that this method is
penny wise and pound foolish. If you need HA, you need to properly do HA.
A second point, perhaps beating a dead horse, but if you do this and
your ZIL is on one of those SATA SSDs, it is useless, because the data
written to it doesn't migrate to the failover machine before pool import.
Thus, it may as well have never been written at all (ie, equivalent to
sync=disabled, but without the performance benefits), defeating the entire
purpose of the ZIL.
Tim
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21878139-69539aca> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...