Are separate pools fundamentally ill?

Discussion:

Attila Nagy

2013-09-25 14:11:30 UTC

Hi,

I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.

Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror
da0 da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on
it to match the directory layout.

Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded,
there were no significantly more or less busy disk pairs in the separate
zpool case on machine A.

Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on
machine B.
The difference was in write, the read IOPS numbers were on par.

Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.

The OS is FreeBSD stable/9, r255573.

Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?

ps: I'm off list, so please keep me CC-ed.

Steven Hartland

2013-09-25 19:03:55 UTC

Permalink

If your using mirrored zpool's you might want to try the following
patch, which adjusts the way vdev's are chosen to take into account
locality of requests which has been demonstrated to significantly
improve the throughput for reads on mirrored pools.

Its against a recent HEAD so may need tweaking to apply against
stable/9.

http://blog.multiplay.co.uk/dropzone/freebsd/zfs-mirror-load.patch

Regards
Steve
----- Original Message -----
From: "Attila Nagy" <***@fsn.hu>
To: <***@lists.illumos.org>
Sent: Wednesday, September 25, 2013 3:11 PM
Subject: [zfs] Are separate pools fundamentally ill?

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror
da0 da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on
it to match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded,
there were no significantly more or less busy disk pairs in the separate
zpool case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on
machine B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?
ps: I'm off list, so please keep me CC-ed.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

George Wilson

2013-09-25 19:22:47 UTC

Permalink

One thing that is happening with multiple pools is that you don't get to
amortize the pool metadata like you do on single pool. This means that
every a single write to every pool on machine A results in 6 metadata
update (one to every pool) while machine B only needs to do a single update.

- George

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs
mirror da0 da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems
created on it to match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the
file systems were evenly (as far as the application can do this)
loaded, there were no significantly more or less busy disk pairs in
the separate zpool case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on
machine B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run
on machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?
ps: I'm off list, so please keep me CC-ed.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-09-25 20:45:13 UTC

Permalink

To answer your subject question: in a word, yes.

It is exceptionally rare to run into a use-case that validly needs
separate pools, but wouldn't also then be better done with each pool
running on a different box. I can't recall the last time I actively
suggested such a thing to a customer. I am pretty keen on one pool per
system.

In addition to the comments so far and the things you witnessed, most zfs
tunables are (regrettably) global, and cannot be applied on a per-pool
basis, further impacting the supportability and efficiency of multi-pool
boxes. There are many reasons not to multi-pool on the same box, and very
few reasons to do so.

- Andrew

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror da0
da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on it to
match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded, there
were no significantly more or less busy disk pairs in the separate zpool
case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS avg
on each devices) and well under 5(!) (10-15 IOPS per device) on machine B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?
ps: I'm off list, so please keep me CC-ed.
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24484421-62d25f20<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=24484421-04fe8ef2<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Schlacta, Christ

2013-09-25 21:22:23 UTC

Permalink

Rpool + separate tank. 'Nuff said.

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror da0
da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on it to
match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded, there
were no significantly more or less busy disk pairs in the separate zpool
case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on machine
B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?
ps: I'm off list, so please keep me CC-ed.
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24484421-62d25f20<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20>
Modify Your Subscription: https://www.listbox.com/**member/?&id_**
secret=24484421-04fe8ef2 <https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Richard Elling

2013-09-25 21:44:13 UTC

Permalink

Strongly disagree. It is quite common to have two or more pools in a system.

Post by Andrew Galloway
In addition to the comments so far and the things you witnessed, most zfs tunables are (regrettably) global, and cannot be applied on a per-pool basis, further impacting the supportability and efficiency of multi-pool boxes. There are many reasons not to multi-pool on the same box, and very few reasons to do so.

There are no global tunables that need to be set here. Out of the box, ZFS works pretty good
for general purpose workloads with many pools. I can think of no global ZFS tunable that
would apply in this case... more below.

Post by Andrew Galloway
- Andrew
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror da0 da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on it to match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file systems were evenly (as far as the application can do this) loaded, there were no significantly more or less busy disk pairs in the separate zpool case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS avg on each devices) and well under 5(!) (10-15 IOPS per device) on machine B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?

I suspect hardware. First, check to see that the read and write performance of the disks is
similar on both systems. "iostat -x" is your friend here, though it combines read and write
latency into a single metric: svc_t. You can see the read/write ratio in the stats, too, so you
can infer a ratio. Or run a dtrace equivalent that separate reads from writes.

Without knowing more about the hardware in use, be aware that we can see very different
write performance profiles from HBAs with write caches enabled vs disabled.

Other problems that cause inconsistent write performance include:
+ noise in the interconnects, this can be asymmetrical causing retries in one direction
+ vibration
+ bad power supplies

Secondly, the other potential change that comes to mind is if one of the machines is very
low on memory such that the ARC is less than 500 MB or so. The ARC is used to buffer
writes and can buffer more with more memory.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-09-25 22:05:20 UTC

Permalink

Aside from syspool (rpool) and then a data pool, it is not 'quite common'
to have two or more pools in a system, in my experience. In fact, when I
run into it, it is also almost always a mistake, often done due to
misconceptions about ZFS or eventually ended up being a regretful choice.

Are you suggesting that in this case, it would be better to have 3 separate
2-disk mirror pools instead of 1 pool with 3 mirror vdevs? I most strongly
disagree with such an assertion unless there's some criteria that would
justify it that I have not heard. The single pool of 3 vdevs will be both
more efficient and be able to bring more power to bear on any given client
than would 3 separate pools. The only benefit of the 3 separate pools is
some separation of load, such that one client hitting pool A can't starve
out clients hitting pool B, but that's not only rarely if ever a reason to
really justify such separation, Attila specifically stated a balanced
workload was going on. Thus, one pool trumps multiple pools, both as a
general rule and in this specific case of a number of clients passing
balanced load at the storage.

My comment about tuneables was a generic one in response to the question
"are multiple pools fundamentally ill". That many of the tuneables /are/
global is a compelling reason to answer yes. Suggesting that the default
tunings are usually alright is not only irrelevant, it's not necessarily
factual. It may be factual when a single data pool is in use on the system,
but when you have 2? 3? 10? Some of them start making less sense, and
certainly in many cases if you did have a need to tune them on just a
single pool, you'd be in trouble.

As for Attila's specific problem and if it may be hardware - could be. My
response was to the generic question. Don't do multiple pools, one data
pool per box, and what few specific exceptions might exist merely prove the
general rule. :)

- Andrew

Post by Andrew Galloway
To answer your subject question: in a word, yes.
It is exceptionally rare to run into a use-case that validly needs
separate pools, but wouldn't also then be better done with each pool
running on a different box. I can't recall the last time I actively
suggested such a thing to a customer. I am pretty keen on one pool per
system.
Strongly disagree. It is quite common to have two or more pools in a system.
In addition to the comments so far and the things you witnessed, most zfs
tunables are (regrettably) global, and cannot be applied on a per-pool
basis, further impacting the supportability and efficiency of multi-pool
boxes. There are many reasons not to multi-pool on the same box, and very
few reasons to do so.
There are no global tunables that need to be set here. Out of the box, ZFS
works pretty good
for general purpose workloads with many pools. I can think of no global ZFS tunable that
would apply in this case... more below.
- Andrew

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror da0
da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on it to
match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded, there
were no significantly more or less busy disk pairs in the separate zpool
case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on machine
B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?

I suspect hardware. First, check to see that the read and write performance of the disks is
similar on both systems. "iostat -x" is your friend here, though it combines read and write
latency into a single metric: svc_t. You can see the read/write ratio in
the stats, too, so you
can infer a ratio. Or run a dtrace equivalent that separate reads from writes.
Without knowing more about the hardware in use, be aware that we can see very different
write performance profiles from HBAs with write caches enabled vs disabled.
+ noise in the interconnects, this can be asymmetrical causing retries in one direction
+ vibration
+ bad power supplies
Secondly, the other potential change that comes to mind is if one of the machines is very
low on memory such that the ARC is less than 500 MB or so. The ARC is used to buffer
writes and can buffer more with more memory.
-- richard
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Richard Elling

2013-09-25 22:47:05 UTC

Permalink

AG, I see your point and how it relates to this specific case. Until recently I also advocated a
pool approach based on dependability (RAID), performance, and isolation considerations as the
primary decision points. And then I found a use case where we still needed separate pools
with the same load-balanced dependability and performance characteristics -- root cause was an
app that doesn't scale when presented with petabytes of storage. So, as usual in systems
engineering, YMMV :-)

My comment about tuneables was a generic one in response to the question "are multiple pools fundamentally ill". That many of the tuneables /are/ global is a compelling reason to answer yes. Suggesting that the default tunings are usually alright is not only irrelevant, it's not necessarily factual. It may be factual when a single data pool is in use on the system, but when you have 2? 3? 10? Some of them start making less sense, and certainly in many cases if you did have a need to tune them on just a single pool, you'd be in trouble.

I'll still disagree here, but more from the perspective of striving to remove the need for
global tunables. That work continues and new development, such as the improved write
throttle, are examples of good ideas from the community. If you have a list, let's start a
new thread and try to tackle them.
-- richard

As for Attila's specific problem and if it may be hardware - could be. My response was to the generic question. Don't do multiple pools, one data pool per box, and what few specific exceptions might exist merely prove the general rule. :)
- Andrew

Strongly disagree. It is quite common to have two or more pools in a system.

I suspect hardware. First, check to see that the read and write performance of the disks is
similar on both systems. "iostat -x" is your friend here, though it combines read and write
latency into a single metric: svc_t. You can see the read/write ratio in the stats, too, so you
can infer a ratio. Or run a dtrace equivalent that separate reads from writes.
Without knowing more about the hardware in use, be aware that we can see very different
write performance profiles from HBAs with write caches enabled vs disabled.
+ noise in the interconnects, this can be asymmetrical causing retries in one direction
+ vibration
+ bad power supplies
Secondly, the other potential change that comes to mind is if one of the machines is very
low on memory such that the ARC is less than 500 MB or so. The ARC is used to buffer
writes and can buffer more with more memory.
-- richard
--
+1-760-896-4422
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-09-25 23:52:06 UTC

Permalink

I am in violent agreement with your last paragraph. Across all
distros/implementations, too. The global tunables should all go away, no
matter which OS you're running. Perhaps an Open-ZFS project, directive,
or.. whatever it would be?

But we're probably a bit off-course now. :)

Attila - I agree with Richard on validating that the boxes are not somehow
dissimilar beyond the pool configs, such as hardware fault. Perhaps
swapping all the disks from each box to the other, and then re-running?
That'd see if the problem follows them or not, helping to narrow down
potential culprits.

Post by Andrew Galloway
Aside from syspool (rpool) and then a data pool, it is not 'quite common'
to have two or more pools in a system, in my experience. In fact, when I
run into it, it is also almost always a mistake, often done due to
misconceptions about ZFS or eventually ended up being a regretful choice.
Are you suggesting that in this case, it would be better to have 3
separate 2-disk mirror pools instead of 1 pool with 3 mirror vdevs? I most
strongly disagree with such an assertion unless there's some criteria that
would justify it that I have not heard. The single pool of 3 vdevs will be
both more efficient and be able to bring more power to bear on any given
client than would 3 separate pools. The only benefit of the 3 separate
pools is some separation of load, such that one client hitting pool A can't
starve out clients hitting pool B, but that's not only rarely if ever a
reason to really justify such separation, Attila specifically stated a
balanced workload was going on. Thus, one pool trumps multiple pools, both
as a general rule and in this specific case of a number of clients passing
balanced load at the storage.
AG, I see your point and how it relates to this specific case. Until
recently I also advocated a
pool approach based on dependability (RAID), performance, and isolation
considerations as the
primary decision points. And then I found a use case where we still needed separate pools
with the same load-balanced dependability and performance characteristics
-- root cause was an
app that doesn't scale when presented with petabytes of storage. So, as usual in systems
engineering, YMMV :-)
My comment about tuneables was a generic one in response to the question
"are multiple pools fundamentally ill". That many of the tuneables /are/
global is a compelling reason to answer yes. Suggesting that the default
tunings are usually alright is not only irrelevant, it's not necessarily
factual. It may be factual when a single data pool is in use on the system,
but when you have 2? 3? 10? Some of them start making less sense, and
certainly in many cases if you did have a need to tune them on just a
single pool, you'd be in trouble.
I'll still disagree here, but more from the perspective of striving to remove the need for
global tunables. That work continues and new development, such as the improved write
throttle, are examples of good ideas from the community. If you have a list, let's start a
new thread and try to tackle them.
-- richard
As for Attila's specific problem and if it may be hardware - could be. My
response was to the generic question. Don't do multiple pools, one data
pool per box, and what few specific exceptions might exist merely prove the
general rule. :)
- Andrew

Post by Andrew Galloway
To answer your subject question: in a word, yes.
It is exceptionally rare to run into a use-case that validly needs
separate pools, but wouldn't also then be better done with each pool
running on a different box. I can't recall the last time I actively
suggested such a thing to a customer. I am pretty keen on one pool per
system.
Strongly disagree. It is quite common to have two or more pools in a system.
In addition to the comments so far and the things you witnessed, most zfs
tunables are (regrettably) global, and cannot be applied on a per-pool
basis, further impacting the supportability and efficiency of multi-pool
boxes. There are many reasons not to multi-pool on the same box, and very
few reasons to do so.
There are no global tunables that need to be set here. Out of the box,
ZFS works pretty good
for general purpose workloads with many pools. I can think of no global ZFS tunable that
would apply in this case... more below.
- Andrew

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror da0
da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on it to
match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded, there
were no significantly more or less busy disk pairs in the separate zpool
case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on machine
B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?

I suspect hardware. First, check to see that the read and write
performance of the disks is
similar on both systems. "iostat -x" is your friend here, though it
combines read and write
latency into a single metric: svc_t. You can see the read/write ratio in
the stats, too, so you
can infer a ratio. Or run a dtrace equivalent that separate reads from writes.
Without knowing more about the hardware in use, be aware that we can see very different
write performance profiles from HBAs with write caches enabled vs disabled.
+ noise in the interconnects, this can be asymmetrical causing retries in one direction
+ vibration
+ bad power supplies
Secondly, the other potential change that comes to mind is if one of the machines is very
low on memory such that the ARC is less than 500 MB or so. The ARC is used to buffer
writes and can buffer more with more memory.
-- richard
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>

*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Garrett D'Amore

2013-09-26 01:21:46 UTC

Permalink

Actually, there are some tunables that make sense, but most of them are properly either per-dataset (and we have those in properties!) or per-vdev. (The per-vdev tunables that make sense are things like ashift, maxpending, etc.) And ideally for most people they would self-tune to reasonable values.

I see little justification in a per-pool tunable.

- Garrett

Andrew Galloway

2013-09-26 03:04:36 UTC

Permalink

I wouldn't rule out some of them making the most sense at a per-pool level
just as a blanket statement, a few I can think of probably have no choice
but to be per-pool, but as I said, 'global tunables should all go away', if
it makes more sense for them to be per-dataset, per-vdev, per-leaf-vdev,
that's a criteria for determine for each one, I would think. I admit, there
are probably a few that either do make sense as global, or might prove too
difficult to make less than global in scope, but there's a slew of them
that come up regularly in my world (and that I don't believe have been made
more granular in later releases than NexentaStor is on, as yet).

Off the top of my head, ones that probably make the most sense as per-pool
tunables that as far as I know are still globals would be anything related
to resilver or scrub (though perhaps an argument could be made for per-vdev
for resilver, though I can't off the top of my head think why you'd ever
want to modify it on per-vdev basis in any sane configuration.. perhaps for
insane ones, where the vdevs are not equal in some way..), write throttling
& limit stuff as well as txg tuneables (I haven't had time to look at the
new code from DelphiX, but as I (perhaps mis-)recall it did still have some
sort of tuneables?) seems a likely candidate, prefetch stuff (though again
an argument might be made for more granular, I just can't think of a good
reason offhand), zfs_nocacheflush, zvol_immediate_write_sz, I'm sure more
I'm not thinking of right now.

- Andrew

Post by Andrew Galloway

Post by Andrew Galloway
I am in violent agreement with your last paragraph. Across all

distros/implementations, too. The global tunables should all go away, no
matter which OS you're running. Perhaps an Open-ZFS project, directive,
or.. whatever it would be?
Actually, there are some tunables that make sense, but most of them are
properly either per-dataset (and we have those in properties!) or per-vdev.
(The per-vdev tunables that make sense are things like ashift, maxpending,
etc.) And ideally for most people they would self-tune to reasonable
values.
I see little justification in a per-pool tunable.
- Garrett
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Attila Nagy

2013-09-26 12:31:40 UTC

Permalink

<html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type" /></head><body bgcolor="#FFFFFF" text="#000000"><div class="moz-cite-prefix">On 09/25/13 23:44, Richard Elling
wrote:</div><blockquote cite="mid:75FA956B-C198-439D-AF65-***@gmail.com" type="cite"><meta content="text/html;
charset=ISO-8859-1" http-equiv="Content-Type" /><div><br /><div>I suspect hardware. First, check to see that the read and
write performance of the disks is</div><div>similar on both systems. "iostat -x" is your friend here,
though it combines read and write</div><div>latency into a single metric: svc_t. You can see the
read/write ratio in the stats, too, so you</div><div>can infer a ratio. Or run a dtrace equivalent that separate
reads from writes.</div></div></blockquote>
The hardware is now the same.<br />
Long story short: we've had some strange performance issues with
separate zpools on quite different machines.<br />
The IOPS requirement for the same workload was sometimes an order of
magnitude higher than on UFS.<br />
Then I've installed a machine with one pool instead of several
smaller and that machine had IOs per sec somewhere above UFS.<br />
I've picked two completely identical machines and compared them, you
can read the findings in my original e-mail.<br />
Then I've reinstalled the separate zpool machine as a single zpool
one, and it started to work like a charm (done yesterday, so I
couldn't yet write about that)<br /><br /><blockquote cite="mid:75FA956B-C198-439D-AF65-***@gmail.com" type="cite"><div><div>Secondly, the other potential change that comes to mind is
if one of the machines is very</div><div>low on memory such that the ARC is less than 500 MB or so.
The ARC is used to buffer</div><div>writes and can buffer more with more memory.</div><br /></div></blockquote>
The first machine on which I could observe this (I use zfs since
around 2005, but never used more than one (busy) pool until
recently) had 8G of RAM, FreeBSD autotunes to these:<br />
vfs.zfs.arc_min: 899950592<br />
vfs.zfs.arc_max: 7199604736<br /><br />
Since it's an NFS server, it can use all of that.<br /><div bgcolor="#ffffff" style="width:auto;margin:0;padding:5px;background-color:#fff;clear:both;border-top: 1px solid #ccc;"><table bgcolor="#ffffff" border="0" cellpadding="0" cellspacing="0" style="background-color:#fff" width="100%"><tr><td padding="4px"><font color="#333333" size="1" style="font-family:helvetica, sans-serif;">
<strong>illumos-zfs</strong> | <a href="https://www.listbox.com/member/archive/182191/=now" style="text-decoration:none;color:#669933;border-bottom: 1px solid #444444" title="Go to archives for illumos-zfs">Archives</a>
<a border="0" href="https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d" style="text-decoration:none;color:#669933" title="RSS feed for illumos-zfs"><img border="0" src="Loading Image...

/></a>
| <a href="https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f" style="text-decoration:none;color:#669933;border-bottom: 1px solid #444444" title="">Modify</a>
Your Subscription<td align="right" valign="top"><a href="http://www.listbox.com" style="border-bottom:none;">
<img border="0" src="Loading Image...

title="Powered by Listbox" /></a></td></font></td></tr></table></div></body></html></html>

Garrett D'Amore

2013-09-25 22:10:41 UTC

Permalink

I disagree. For example, an Oracle database might store transaction logs on an all flash pool, but leave other data in a more conventional spinning media pool. Unless all your consumers on a given box are of the same class and have identical storage needs, then separate pools are definitely reasonable.

In a bigger box with multitenancy, using separate storage pools can give you a different fault isolation boundary, allowing some services to stay up, while others tank. Again, this is more true when you can indicate you have different data class types: for example a pool backing a Swift cluster node needs no redundancy (Swift handles redundancy at the application layer), whereas a pool backing for NFS needs a bit more care.

Allowing later expansion is another reason to do multipools. While a stripe of mirrors is easy to manage, upgrading such a strip is harder the larger the stripe. And, there may be performance implications from extremely wide stripes. And RAID-Z is effectively non-upgradeable, but adding more RAID-Z pools is a cheap and easy way to quickly add storage.

This becomes all the more important in the face of SANs, where the number of spindles available to a single chassis may be enormous.

Some of the tunables are indeed global, but not all of them. And there is work underway to further allow more fine grained tuning.

I think you have a very narrow view of how storage is used, and I would be cautious about applying this particular advice without a careful understanding of the problem(s) you are trying to solve. For a simple small configuration a single pool is often (perhaps even usually) best. Its far from true that this advice should be taken universally though.

Now, that said, the configuration described below (6 pools of two disks mirrored) seems iffy. I suspect that this is not an ideal configuration, but I won't say for certain without understanding more carefully the underlying reasons for that configuration.

One thing that is true, using small pools means that a bad (high latency) disk is going to be more tragic for I/O then in a larger pool, because in the larger pool ZFS can distribute I/Os based on actual device readiness, and its more likely that it will be able to find an idle disk if it has 12 to choose from instead of just 2. The read/write pattern matters as well, btw.

- Garrett

Andrew Galloway

2013-09-25 22:36:24 UTC

Permalink

Not only is this in and of itself actually not a common use-case, but it is
a great example of a specific requirement that proves the general rule. You
may indeed find it wise to create a separate, faster disk (or SSD) pool to
handle the logs while putting the rest onto a separate slower pool. You
might also put that pool on another box, however, if you're of such size
where such a thing is even worth the effort and expense in the first place,
though I admit not always. A good specific example to prove the general
rule. You'd also still only be left with 2 real data pools on the system,
and hopefully with only one doing a lot of the work; if the other is super
busy, I go back to 'multiple systems'. Not 10, for example.

Post by Andrew Galloway
In a bigger box with multitenancy, using separate storage pools can give
you a different fault isolation boundary, allowing some services to stay
up, while others tank. Again, this is more true when you can indicate you
have different data class types: for example a pool backing a Swift cluster
node needs no redundancy (Swift handles redundancy at the application
layer), whereas a pool backing for NFS needs a bit more care.

In almost every case where I see this cited as a reason for breaking the
pool, I again go to 'and on to another system, too'. Nothing wrong with
having lots of pools -- each on its own system. :) -- We have a working
example here of a university customer with many pools on one storage box,
built in this manner for I believe this specific reason, and to this day (a
multi-year issue) it still causes headaches. Would not do again. :)

Post by Andrew Galloway
Allowing later expansion is another reason to do multipools. While a
stripe of mirrors is easy to manage, upgrading such a strip is harder the
larger the stripe. And, there may be performance implications from
extremely wide stripes. And RAID-Z is effectively non-upgradeable, but
adding more RAID-Z pools is a cheap and easy way to quickly add storage.

Sure, but not really advisable. What is academically plausible and even
technically sane sounding is not always what production finds to be a good
idea. Growing a storage system with a 24-disk raidz2 pool of 3 x 8 disk
vdevs by adding a whole new pool of 6 disks in a raidz1, and then later
again in another pool with 12 disks in a raidz3, and then agai -- yeah. No.
Putting on my zfs sysadmin hat, I say this is insane. If you /must/ grow
that pool, add more 8-disk raidz2 vdevs to the existing pool, with the
understanding it may take nigh on forever to actually balance properly -
but hey, at least there's a chance it might.. a new pool never will. Plus,
if I need to grow, there's some chance I need to grow an existing dataset..
one that might be too big to fit on a smaller new pool.

Post by Andrew Galloway
This becomes all the more important in the face of SANs, where the number
of spindles available to a single chassis may be enormous.

Often, single chassis with enormous number of disks is a mistake. Scale
out. Not up. The main area where this is not necessarily true is archival
-- where multiple pools would likely be completely pointless.

Post by Andrew Galloway
In addition to the comments so far and the things you witnessed, most zfs
tunables are (regrettably) global, and cannot be applied on a per-pool
basis, further impacting the supportability and efficiency of multi-pool
boxes. There are many reasons not to multi-pool on the same box, and very
few reasons to do so.
Some of the tunables are indeed global, but not all of them. And there is
work underway to further allow more fine grained tuning.

I know, and that will go a long way to increasing the list of use-cases
that may be OK to apply multiple pools on a single system to work with.
However, what is today and what will be in a year are not the same thing. :)

Post by Andrew Galloway
I think you have a very narrow view of how storage is used, and I would be
cautious about applying this particular advice without a careful
understanding of the problem(s) you are trying to solve. For a simple
small configuration a single pool is often (perhaps even usually) best.
Its far from true that this advice should be taken universally though.

At the risk of responding to what appears to be something close to an ad
hominem attack, I like to think I have an extremely broad view of how
storage is used, as you should know, as you know what I do and where, and
with the sort of clients I am commonly engaged. I have been on many
hundreds, if not thousands, of /production/ ZFS SAN appliances of all
shapes, sizes and colors.

I can find you lots of exceptions for any general rule, but that doesn't
mean the general rule isn't generally true, and generally speaking,
multiple pools is something to avoid until you're reasonably assured your
use-case requires them and that they're not insane to do. Such specific
exceptions certainly exist. They're definitely exceptions, however.

Post by Andrew Galloway
Now, that said, the configuration described below (6 pools of two disks
mirrored) seems iffy. I suspect that this is not an ideal configuration,
but I won't say for certain without understanding more carefully the
underlying reasons for that configuration.

I think I've made it clear I think 'iffy' is a little weak a description
for such a thing. I can think of no reason to split at such a small scale,
especially when the split pools are comprised of identical drives, hit by
identical clients (as Attila already described). :)

Post by Andrew Galloway
One thing that is true, using small pools means that a bad (high latency)
disk is going to be more tragic for I/O then in a larger pool, because in
the larger pool ZFS can distribute I/Os based on actual device readiness,
and its more likely that it will be able to find an idle disk if it has 12
to choose from instead of just 2. The read/write pattern matters as well,
btw.

Agreed. Another of what I'm sure if we spent time going over it is probably
a fairly extensive list of reasons to go for single instead of multiple
pools within a single storage box, with some limited exceptions (that,
again, just prove the general rule).

Post by Andrew Galloway
- Garrett
- Andrew

Post by Attila Nagy
Hi,
I've had two, completely identical machines, both with 12 SATA disks.
They act as NFS servers, for once written (WORM type) files.
Machine A had 6 zpools, each of them made of two disk mirrors (zpool
create fs1 mirror da0 da1, zpool create fs2 mirror da2 da3 ...), while
machine B had 1 zpool, made of two disk mirrors (zpool create fs mirror da0
da1 mirror da2 da3 mirror da4 da5 ...) and 6 file systems created on it to
match the directory layout.
Each machine got a similar (real, not simulated) load. Each of the file
systems were evenly (as far as the application can do this) loaded, there
were no significantly more or less busy disk pairs in the separate zpool
case on machine A.
Yet, machine A with 6 separate zpools struggled under the load, while
machine B with 1 zpool could easily satisfy the requests.
I'm talking here about 40% average disk usage on machine A (50-70 IOPS
avg on each devices) and well under 5(!) (10-15 IOPS per device) on machine
B.
The difference was in write, the read IOPS numbers were on par.
Also, every zfs operations (zpool list, status etc) took ages to run on
machine A, while ran fast on machine B.
The OS is FreeBSD stable/9, r255573.
Any ideas what causes this? Is this a ZFS issue at all, or a FreeBSD one?
ps: I'm off list, so please keep me CC-ed.
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24484421-62d25f20<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20>
Modify Your Subscription: https://www.listbox.com/**member/?&id_**
secret=24484421-04fe8ef2 <https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Attila Nagy

2013-09-26 12:22:39 UTC

Permalink

Hi,

Rationale: with X two disk mirror pools, it's much easier to handle the
loss of two drives in the same mirror, than with one pool with X mirrors
in it.
(mirrors with three disks, or raidz is not an option)

Andrew Galloway

2013-09-26 18:45:03 UTC

Permalink

IMHO, that's a scary reason to rationalize split pools instead of a single
pool. I get what you're saying, you're reducing failure domain, but you're
also doing so in a way that introduces a variety of other problems,
inefficiencies, etc. My response would be: keep backups (and no, a snapshot
is not a backup) that meet your retention requirements, and go with a
single pool.

- Andrew

Hi,

Post by Andrew Galloway
To answer your subject question: in a word, yes.
It is exceptionally rare to run into a use-case that validly needs
separate pools, but wouldn't also then be better done with each pool
running on a different box. I can't recall the last time I actively
suggested such a thing to a customer. I am pretty keen on one pool per
system.
In addition to the comments so far and the things you witnessed, most zfs
tunables are (regrettably) global, and cannot be applied on a per-pool
basis, further impacting the supportability and efficiency of multi-pool
boxes. There are many reasons not to multi-pool on the same box, and very
few reasons to do so.
Rationale: with X two disk mirror pools, it's much easier to handle the

loss of two drives in the same mirror, than with one pool with X mirrors in
it.
(mirrors with three disks, or raidz is not an option)

Andrew Galloway

2013-09-26 18:49:27 UTC

Permalink

And before I get jumped on, let me rephrase slightly - that's a scary
reason to rationalize splitting a 6-disk pool down to 3 2-disk pools. It
would be, again IMHO, a potentially valid rationale if you were dealing
with much larger quantities of drives, and a very common and sane rationale
if the splitting of said pools also involved putting them on different
systems.

Post by Andrew Galloway
IMHO, that's a scary reason to rationalize split pools instead of a single
pool. I get what you're saying, you're reducing failure domain, but you're
also doing so in a way that introduces a variety of other problems,
inefficiencies, etc. My response would be: keep backups (and no, a snapshot
is not a backup) that meet your retention requirements, and go with a
single pool.
- Andrew

Hi,

Post by Andrew Galloway
To answer your subject question: in a word, yes.
It is exceptionally rare to run into a use-case that validly needs
separate pools, but wouldn't also then be better done with each pool
running on a different box. I can't recall the last time I actively
suggested such a thing to a customer. I am pretty keen on one pool per
system.
In addition to the comments so far and the things you witnessed, most
zfs tunables are (regrettably) global, and cannot be applied on a per-pool
basis, further impacting the supportability and efficiency of multi-pool
boxes. There are many reasons not to multi-pool on the same box, and very
few reasons to do so.
Rationale: with X two disk mirror pools, it's much easier to handle the

loss of two drives in the same mirror, than with one pool with X mirrors in
it.
(mirrors with three disks, or raidz is not an option)

Attila Nagy

2013-09-27 08:58:29 UTC

Permalink

<html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type" /></head><body bgcolor="#FFFFFF" text="#000000"><div class="moz-cite-prefix">Let's abstract from this.<br /><br />
My question is:<br />
why 6 two way mirror zpools need an order of magnitude more
(write) IOPS (and is more sluggish) for the same workload than a
zpool with 6 two way mirrors?<br />
Is this normal? Could this be better?<br />
Is it possible to achieve this with configuration, or is it by
design?<br /><br />
Or is this just FreeBSD (possibly with NFS), or my system?<br />
Is there a known reason, which could cause this?<br /><br />
On 09/26/13 20:45, Andrew Galloway wrote:<br /></div><blockquote cite="mid:CAG_dmw62zQe5gxsmp3_ndJFhvs0UuNBdY+pjn4OKcRw+***@mail.gmail.com" type="cite"><div dir="ltr">IMHO, that's a scary reason to rationalize split
pools instead of a single pool. I get what you're saying, you're
reducing failure domain, but you're also doing so in a way that
introduces a variety of other problems, inefficiencies, etc. My
response would be: keep backups (and no, a snapshot is not a
backup) that meet your retention requirements, and go with a
single pool.
<div><br /></div><div>- Andrew</div></div><div class="gmail_extra"><br /><br /><div class="gmail_quote">On Thu, Sep 26, 2013 at 5:22 AM, Attila
Nagy <span dir="ltr"><<a href="mailto:***@fsn.hu" moz-do-not-send="true" target="_blank">***@fsn.hu</a>></span>
wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,
<div class="im"><br /><br />
On 09/25/13 22:45, Andrew Galloway wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex"><br />
To answer your subject question: in a word, yes.<br /><br />
It is exceptionally rare to run into  a use-case that
validly needs separate pools, but wouldn't also then be
better done with each pool running on a different box. I
can't recall the last time I actively suggested such a
thing to a customer. I am pretty keen on one pool per
system.<br /><br />
In addition to the comments so far and the things you
witnessed, most zfs tunables are (regrettably) global,
and cannot be applied on a per-pool basis, further
impacting the supportability and efficiency of
multi-pool boxes. There are many reasons not to
multi-pool on the same box, and very few reasons to do
so.<br /><br /><br /></blockquote></div>
Rationale: with X two disk mirror pools, it's much easier to
handle the loss of two drives in the same mirror, than with
one pool with X mirrors in it.<br />
(mirrors with three disks, or raidz is not an option)<br /><br /></blockquote></div><br /></div></blockquote><br /><div bgcolor="#ffffff" style="width:auto;margin:0;padding:5px;background-color:#fff;clear:both;border-top: 1px solid #ccc;"><table bgcolor="#ffffff" border="0" cellpadding="0" cellspacing="0" style="background-color:#fff" width="100%"><tr><td padding="4px"><font color="#333333" size="1" style="font-family:helvetica, sans-serif;">
<strong>illumos-zfs</strong> | <a href="https://www.listbox.com/member/archive/182191/=now" style="text-decoration:none;color:#669933;border-bottom: 1px solid #444444" title="Go to archives for illumos-zfs">Archives</a>
<a border="0" href="https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d" style="text-decoration:none;color:#669933" title="RSS feed for illumos-zfs"><img border="0" src="Loading Image...

title="Powered by Listbox" /></a></td></font></td></tr></table></div></body></html></html>

Chris Siebenmann

2013-09-26 19:58:49 UTC

Permalink

Andrew Galloway:
| It is exceptionally rare to run into a use-case that validly needs
| separate pools, but wouldn't also then be better done with each pool
| running on a different box. I can't recall the last time I actively
| suggested such a thing to a customer. I am pretty keen on one pool per
| system.

We use multiple pools on single machines here for what we feel are
sensible reasons. The short version is that we divide up space between
different groups by giving them separate pools. Groups have highly
variable space desires, we have a significant number of groups, and we
need to aggregate together both the fileserver hardware and the the
number of disks in a single backend enclosure on our iSCSI SAN[*] for
cost reasons.

We feel that using one giant pool on each fileserver in which all
groups resided in (quota-controlled) space would have a number of
serious drawbacks, both technically and administratively. For example,
splitting different pools on different disks creates better performance
isolation between groups (and yes, we've seen a group's activities slow
down their pool noticably).

I certainly hope that this is considered a legitimate and supported
usage case for Illumos ZFS, not something that is 'use at your own risk,
we don't like it, if it breaks you get to keep all the pieces'.

- cks
[*: to head off any snap reactions: the iSCSI backends export plain disks
to the fileservers, where ZFS handles all mirroring et al. We mirror
vdev disks between backends so that the loss of a single backend will
not offline or destroy any pools.
]

Andrew Galloway

2013-09-26 20:52:56 UTC

Permalink

Post by Chris Siebenmann
| It is exceptionally rare to run into a use-case that validly needs
| separate pools, but wouldn't also then be better done with each pool
| running on a different box. I can't recall the last time I actively
| suggested such a thing to a customer. I am pretty keen on one pool per
| system.
We use multiple pools on single machines here for what we feel are
sensible reasons.

If you have reasons you find sensible, then you might be one of the the
exceptions in my 'exceptionally rare'. :)

Post by Chris Siebenmann
The short version is that we divide up space between
different groups by giving them separate pools. Groups have highly
variable space desires, we have a significant number of groups, and we
need to aggregate together both the fileserver hardware and the the
number of disks in a single backend enclosure on our iSCSI SAN[*] for
cost reasons.

This sounds more and more legitimate, due to environmental and business
constraints, two factors that have to go into any architectural choice.
You'd have to get more specific for me to be certain of my stance on your
build - what constitutes 'number of disks', what is the iSCSI SAN enclosure
behind it, so on (and this may not be the proper venue for it, I'd add).

Though I'd add that when you say 'highly variable space desires', I go back
to thinking single pool, simply because if you build a pool for one group
and their space usage ends up being 10%, while another group eats up 75%,
it would have been, purely from a space usage perspective, better to house
both groups on one pool of those combined drives. But it is only one factor
to consider.

Post by Chris Siebenmann
We feel that using one giant pool on each fileserver in which all
groups resided in (quota-controlled) space would have a number of
serious drawbacks, both technically and administratively. For example,
splitting different pools on different disks creates better performance
isolation between groups (and yes, we've seen a group's activities slow
down their pool noticably).

This is another one of the reasons I generally argue against this. You had
a pool slow down noticeably due to load. Were all the other pools on that
same box similarly slow at the same time? If not, then a single pool (or
fewer pools) of all those disks would have been able to better absorb the
client requests, balancing the load across the greater number of spindles
available instead of burning up a subset of them while the rest remained
underutilized. If you had idle disks while others were hot, in a single
chassis. I personally find this inefficient, but again, this is only one
reason for not splitting, and it's hardly the deciding or only factor.

Post by Chris Siebenmann
I certainly hope that this is considered a legitimate and supported
usage case for Illumos ZFS, not something that is 'use at your own risk,
we don't like it, if it breaks you get to keep all the pieces'.

I'm not sure what you mean by 'supported'. Absolutely nothing, AFAIK, is
supported by 'illumos ZFS', in that it isn't 'use at your own risk'. I'm
not aware of any claims by illumos to provide assistance to you if
something goes wrong with your environment, beyond things like this mailing
list, which are answered completely voluntarily with no obligation. There's
certainly no /legal/ obligation to help you if something goes wrong. So in
that regard, regardless of what you've built, like-able or not,
'legitimate' or not, it /is/ 'use at your own risk, if it breaks you get to
keep all the pieces'. I don't mean to say you won't find help available
should you ask here, or on IRC, and so on - many people are more than
willing to take time to help out, but certainly if you mean a contractually
obligated support agreement, you'll need to go to a corporation that sells
such a thing, and if such an entity has agreed to support what you've got,
then one can hope they will do that. :)

Post by Chris Siebenmann
- cks
[*: to head off any snap reactions: the iSCSI backends export plain disks
to the fileservers, where ZFS handles all mirroring et al. We mirror
vdev disks between backends so that the loss of a single backend will
not offline or destroy any pools.
]
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Chris Siebenmann

2013-09-26 21:13:48 UTC

Permalink

(I'm reordering bits of the message to put my most interesting replies
first.)

| Though I'd add that when you say 'highly variable space desires', I
| go back to thinking single pool, simply because if you build a pool
| for one group and their space usage ends up being 10%, while another
| group eats up 75%, it would have been, purely from a space usage
| perspective, better to house both groups on one pool of those combined
| drives. But it is only one factor to consider.

The pool sizes themselves vary based on what the groups want (and are
willing to fund). Groups usually use most of the space that they ask for
but how much they ask for varies tremendously (probably by a factor of
at least five). For obvious reasons we try to get groups to fund more
space when they start closing in on the size of their pool.

| > We feel that using one giant pool on each fileserver in which all
| > groups resided in (quota-controlled) space would have a number
| > of serious drawbacks, both technically and administratively. For
| > example, splitting different pools on different disks creates better
| > performance isolation between groups (and yes, we've seen a group's
| > activities slow down their pool noticably).
|
| This is another one of the reasons I generally argue against this. You
| had a pool slow down noticeably due to load. Were all the other pools
| on that same box similarly slow at the same time? If not, then a
| single pool (or fewer pools) of all those disks would have been able
| to better absorb the client requests, balancing the load across the
| greater number of spindles available instead of burning up a subset of
| them while the rest remained underutilized. [...]

Whether this particular tradeoff makes sense depends on what your
priorities are. In our environment it is more important to preserve pool
and system responsiveness for other groups rather than to allow one
group to saturate every last IOP theoretically available in the overall
system. So it is an active feature that those other spindles are sitting
under-utilized and delivering fast performance to other groups in this
situation.

(We would actually like *more* separation than we currently get, but
again costs come into the picture.)

| > I certainly hope that this is considered a legitimate and supported
| > usage case for Illumos ZFS, not something that is 'use at your
| > own risk, we don't like it, if it breaks you get to keep all the
| > pieces'.
|
| I'm not sure what you mean by 'supported'. Absolutely nothing, AFAIK,
| is supported by 'illumos ZFS', in that it isn't 'use at your own
| risk'. [...]

I meant 'supported' in the open source sense, as in 'the developers
will not tell you that you are crazy and will try to make the general
usage case work'. My impression is that there are definitely usage cases
that the ZFS developers consider 'not supported' in this sense and that
they aren't devoting any particular effort to make really work (for
example, using ZFS in small or very small amounts of memory). My hope
is that 'many pools' is not such a case.

| This sounds more and more legitimate, due to environmental and
| business constraints, two factors that have to go into any
| architectural choice. You'd have to get more specific for me to be
| certain of my stance on your build - what constitutes 'number of
| disks', what is the iSCSI SAN enclosure behind it, so on (and this may
| not be the proper venue for it, I'd add).

People who are interested in reading more details about our (current)
environment can see some writeups of bits of its size and scope (and
design goals, partly through links in these pages):

http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSFileserverSetup
http://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurScaleII
http://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurCommodityFileservers

We're currently in the process of turning over the hardware and parts of
the software environment (eg moving from Solaris 10 to Illumos) and likely
fiddling with bits around the edges, such as:

http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSLocalL2ARCTrick

- cks

Andrew Galloway

2013-09-26 21:29:45 UTC

Permalink

I'll take a look at the links when time allows, but I wanted to point out
that again -- an 'exceptionally rare' comment means there are exceptions.
The reason I'm pointing this out, and the reason I've argued this, is what
I absolutely do not want to happen is someone new to ZFS and designing
their build to find a thread where everyone seems completely OK with tons
of pools on a system with no particular context as to why and think it's a
common architecture. It isn't. It just isn't. I'm not saying that Chris' or
anyone else's current or upcoming build that does so is /wrong/, I'm saying
that /in general/, you should avoiding multiple pools in the same system,
and thinking of it as a design constraint that to break you need specific
reasons for doing so.. don't just opt for lots of pools because it's fun
looking.

If you are knowledgeable in ZFS-fu, or are in communication with someone
who is while building out your design, and multiple pools are part of it,
you're probably OK as you or they know what you're getting into and you
hopefully know why you're splitting it up and why it's necessary. If you're
brand new and don't yet have any significant ZFS-fu, you should be taking
away from this thread that by default you should be building a single data
pool on the system, and only splitting it up if you've got significant and
sensible reasons why, and should perhaps even solicit advice from more
knowledgeable ZFS folks before committing to splitting.

I also want to add -- this doesn't mean NOT splitting is always a great
idea, either, /especially/ if the split involves putting the two pools on
different heads. Scale out, not up. Go with smaller pools (up to 96 disks
or so is my /general/ advice) and build more actual systems to put them in,
as opposed to putting 1000 disks in one pool on one box (almost never a
good idea).

- Andrew

Post by Chris Siebenmann
(I'm reordering bits of the message to put my most interesting replies
first.)
| Though I'd add that when you say 'highly variable space desires', I
| go back to thinking single pool, simply because if you build a pool
| for one group and their space usage ends up being 10%, while another
| group eats up 75%, it would have been, purely from a space usage
| perspective, better to house both groups on one pool of those combined
| drives. But it is only one factor to consider.
The pool sizes themselves vary based on what the groups want (and are
willing to fund). Groups usually use most of the space that they ask for
but how much they ask for varies tremendously (probably by a factor of
at least five). For obvious reasons we try to get groups to fund more
space when they start closing in on the size of their pool.
| > We feel that using one giant pool on each fileserver in which all
| > groups resided in (quota-controlled) space would have a number
| > of serious drawbacks, both technically and administratively. For
| > example, splitting different pools on different disks creates better
| > performance isolation between groups (and yes, we've seen a group's
| > activities slow down their pool noticably).
|
| This is another one of the reasons I generally argue against this. You
| had a pool slow down noticeably due to load. Were all the other pools
| on that same box similarly slow at the same time? If not, then a
| single pool (or fewer pools) of all those disks would have been able
| to better absorb the client requests, balancing the load across the
| greater number of spindles available instead of burning up a subset of
| them while the rest remained underutilized. [...]
Whether this particular tradeoff makes sense depends on what your
priorities are. In our environment it is more important to preserve pool
and system responsiveness for other groups rather than to allow one
group to saturate every last IOP theoretically available in the overall
system. So it is an active feature that those other spindles are sitting
under-utilized and delivering fast performance to other groups in this
situation.
(We would actually like *more* separation than we currently get, but
again costs come into the picture.)
| > I certainly hope that this is considered a legitimate and supported
| > usage case for Illumos ZFS, not something that is 'use at your
| > own risk, we don't like it, if it breaks you get to keep all the
| > pieces'.
|
| I'm not sure what you mean by 'supported'. Absolutely nothing, AFAIK,
| is supported by 'illumos ZFS', in that it isn't 'use at your own
| risk'. [...]
I meant 'supported' in the open source sense, as in 'the developers
will not tell you that you are crazy and will try to make the general
usage case work'. My impression is that there are definitely usage cases
that the ZFS developers consider 'not supported' in this sense and that
they aren't devoting any particular effort to make really work (for
example, using ZFS in small or very small amounts of memory). My hope
is that 'many pools' is not such a case.
| This sounds more and more legitimate, due to environmental and
| business constraints, two factors that have to go into any
| architectural choice. You'd have to get more specific for me to be
| certain of my stance on your build - what constitutes 'number of
| disks', what is the iSCSI SAN enclosure behind it, so on (and this may
| not be the proper venue for it, I'd add).
People who are interested in reading more details about our (current)
environment can see some writeups of bits of its size and scope (and
http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSFileserverSetup
http://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurScaleII
http://utcc.utoronto.ca/~cks/space/blog/sysadmin/OurCommodityFileservers
We're currently in the process of turning over the hardware and parts of
the software environment (eg moving from Solaris 10 to Illumos) and likely
http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSLocalL2ARCTrick
- cks

Garrett D'Amore

2013-09-26 22:44:07 UTC

Permalink

Really if you don't know what you're doing then you should hire someone or acquire a complete and supported solution from someone who does.

Putting huge numbers of disks in a pool (over 20 or so) is IMO very often NOT the right answer. But it also often IS the right config.

The only right answer is to get an expert to help you design your system.

If you've only got a dozen or so drives then probably a single pool is what you want. Probably.

Sent from my iPhone

I'll take a look at the links when time allows, but I wanted to point out that again -- an 'exceptionally rare' comment means there are exceptions. The reason I'm pointing this out, and the reason I've argued this, is what I absolutely do not want to happen is someone new to ZFS and designing their build to find a thread where everyone seems completely OK with tons of pools on a system with no particular context as to why and think it's a common architecture. It isn't. It just isn't. I'm not saying that Chris' or anyone else's current or upcoming build that does so is /wrong/, I'm saying that /in general/, you should avoiding multiple pools in the same system, and thinking of it as a design constraint that to break you need specific reasons for doing so.. don't just opt for lots of pools because it's fun looking.
If you are knowledgeable in ZFS-fu, or are in communication with someone who is while building out your design, and multiple pools are part of it, you're probably OK as you or they know what you're getting into and you hopefully know why you're splitting it up and why it's necessary. If you're brand new and don't yet have any significant ZFS-fu, you should be taking away from this thread that by default you should be building a single data pool on the system, and only splitting it up if you've got significant and sensible reasons why, and should perhaps even solicit advice from more knowledgeable ZFS folks before committing to splitting.
I also want to add -- this doesn't mean NOT splitting is always a great idea, either, /especially/ if the split involves putting the two pools on different heads. Scale out, not up. Go with smaller pools (up to 96 disks or so is my /general/ advice) and build more actual systems to put them in, as opposed to putting 1000 disks in one pool on one box (almost never a good idea).
- Andrew

illumos-zfs | Archives | Modify Your Subscription

Darren Reed

2013-09-27 00:08:09 UTC

Permalink

Post by Garrett D'Amore
Really if you don't know what you're doing then you should hire
someone or acquire a complete and supported solution from someone who
does.
Putting huge numbers of disks in a pool (over 20 or so) is IMO very
often NOT the right answer. But it also often IS the right config.
The only right answer is to get an expert to help you design your system.
If you've only got a dozen or so drives then probably a single pool is
what you want. Probably.

Yup.

Getting storage right in an environment that is larger than your home server
needs care and more care than you'll likely get from a mailing list.

As much as people would like to make it easy, the devil is always in the
details.

Darren

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Bob Friesenhahn

2013-09-26 21:26:51 UTC

Permalink

--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/

Andrew Galloway

2013-09-26 21:38:53 UTC

Permalink

All true, at least some of the time (well, except the OS image one, that's
just a given; when I speak of single pool, I mean single data pool;
rpool/syspool is what it is and if you're using ZFS for the OS, I would
agree it SHOULD be its own small pool separate from data pool in almost all
cases -- I don't consider a box with a data pool and an OS pool to BE a
multi-pool system, I ignore syspool in such terms).

But see email I just sent. You can all argue any number of convincing
reasons why you might split and have multiple pools in one box, with
justifiable logic as to why. I won't disagree that you can do that. You
certainly can. What some of you perhaps have not had to do is explain to
user after user, as I have, that their system with 20 pools, one disk in
each, isn't how you set up ZFS. Or had to explain to user after user, as I
have, that their 5 pools on the same box using one raidz2 vdev each when
they have absolutely no constraint leading to that design and are now
hitting all sorts of problems due to it (like running out of space on one,
while others are empty, and so on) was again not the way to go about it.
Or, as an example of multiple pool being good but on same system being bad,
explain to a user, as I have, why their system with a decent data pool on
spinning disks and another data pool made up of a large number of SSD's,
serving incredibly different use-cases and not managing to perform very
well, will be now more difficult to deal with than if they'd put them on
separate systems. You're all seemingly ZFS knowledgeable people - please
bear in mind people with absolutely zero ZFS knowledge are reading this
mailing list, and finding hits to it on Google searches.

By default, assume your design should have one data pool per system. Then
only modify that to 2+ pools on the same system as sensible design
constraints require. And again I'd suggest if you find yourself doing that
that you be sure it is in fact sensible, and if in doubt, ask someone who's
been there and done that a few times before. :)

On Thu, Sep 26, 2013 at 2:26 PM, Bob Friesenhahn <

There are many valid reasons do to so. Some significant ones were listed
today by Chris Siebenmann.
If the storage chassis is physically/logically removable from the host
system, then this is a reason to consider it to be an indepent pool.
If the storage system uses a different power system than other pools used
by the host system, then that is a reason to consider it to be an
independent pool.
If data may need to be moved to a different location, then that is a good
reason for it to not be commingled in the same zfs pool with other data.
Being able to move the data seconds or minutes vs days or weeks is often a
significant consideration.
If I/O performance needs to be isolated between users, then that is a
reason to consider putting it in an indepenent pool.
If the data constitutes the OS image, then that is a good reason for it to
be in its own pool.
Bob
--
Bob Friesenhahn
users/bfriesen/ <http://www.simplesystems.org/users/bfriesen/>
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24484421-62d25f20<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=24484421-04fe8ef2<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com