Discussion:
[developer] New ZFS throttle and txg sync times
Bryan Cantrill
2013-12-18 01:02:09 UTC
Permalink
We recently had a production machine on which provisions were failing.
Investigating the issue, ZFS dataset creation was taking many minutes --
exceeding the timeout we have in place for a successful provision (five
minutes).

The issue appears to be that we were taking a very long time to sync out
transactions (over a minute in some cases) which was in turn due to the fact
that our I/O rate was getting significantly throttled by the rate returned
from vdev_queue_max_async_writes(). An important qualifier: this is one of
our older systems, so it has hardware RAID (no comment!) -- and therefore a
single vdev. Further, because it's multi-tenant, it would be quite unusual to
have dirty data that exceeds zfs_vdev_async_write_active_min_dirty_percent of
the zfs_dirty_data_max_percent of DRAM. (Given the defaults, any machine more
than 40G of DRAM will end up hitting the zfs_dirty_data_max_max cap of 4G --
resulting in a dirty minimum of 1.2G before vdev_queue_max_async_writes()
starts to increment the maximum number of writes from the default minimum
maximum of 1.) And indeed, what we observed on the machine was that the rate
of asynchronous writes is enough to result in extraordinary long transactions,
but not so much as to dirty enough memory to increase the cap on the number of
async write operations per vdev. The attached graphs (also available at
https://us-east.manta.joyent.com/bcantrill/public/OS-2659.pdf) show this. In
the first, you can dirty megabytes versus transaction sync time over a period
of about two hours; in the second, you can see the maximum async writes
responding to the amount of dirty data as designed (and the amount of dirty
data falling as result) -- but it's responding too late.

So, a couple of thoughts. First, it seems to me that a missing input with
respect to the throttle is transaction time: because operations like ZFS
dataset creation actually block on transactions going out, it seems that we
want transaction time to be part of the feedback loop for the throttle. That
is, in addition to looking at the amount of dirty data (and using that to
throttle up our outstanding I/Os), it seems we should also have a "target"
transaction time (5 seconds? 10 seconds?) such that as transaction times
exceed our target time, we start prefering throughput over latency -- and that
we increase that preference as/if transaction times continue to grow. This
would (I think) largely solve the (admittedly pathological) single vdev case
without forcing those that have single vdev pools to pick different tunables
(which, if it needs to be said, is a galactic pain in the ass for anyone like
us who has thousands of heterogeneous machines).

Second, it seems that the minimum maximum on asynchronous writes per vdev is
conservatively low (namely, 1). I appreciate that the single vdev case is an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.

Finally, it seems that we might be being too conservative with the default
value of zfs_vdev_async_write_active_min_dirty_percent (namely, 30%); it seems
to me that this should perhaps be kicking in with dirty data less than 1.2G,
and it seems to me that it should be kicking in harder and sticking (i.e., it
needs potentially non-linear response and hysteresis -- you can see clear
porpoising in the second graph). That said, it's clear that the mechanism
_is_ working -- in the first graph, you can clearly see a band around that
1.2G threshold -- I just wonder if it should be turning on at lower levels
of dirty data.

Matt, Adam, Eric, others: thoughts here? I'm curious in particular as to the
modelling behind the current values (namely, minimum of 1, max of 10, min
dirty percent of 30); these values were presumably not randomly selected, and
I would particularly like to understand how changing some of them (in
particular, increasing the zfs_vdev_async_write_min_active from 1 to (say) 3
and/or decreasing the zfs_vdev_async_write_active_min_dirty_percent from 30 to
(say) 5) would affect that modelling. I do believe that factoring in
transaction time is ultimately the right approach here, but changing the
tunables seems to be a much quicker fix -- and one that will largely solve the
single vdev async write problem (at least in this manifestation).

- Bryan



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Adam Leventhal
2013-12-18 04:56:17 UTC
Permalink
Hey Bryan,

Let me first address the issue of long administrative tasks
(synctasks), zfs create in your case. Longer transactions mean longer
potential delays for these operations. When developing the new IO
scheduler, Matt and I discussed using the presence of pending synctask
as part of the vdev_queue_max_async_writes() algorithm which would
effectively push it to the max. This would at least ameliorate the
situation you're seeing: a pile of asynchronous work being slowly
addressed while synchronous work (the zfs create) is blocked.

I agree that time could also factor into the calculation in
vdev_queue_max_async_writes() -- increase the number of concurrent IOs
as a txg gets older. I think it's worth trying the adjustment above to
see if that solves the problem in practice first, though another
benefit of a time-based "urgency" would be to put a soft limit on the
amount of async data lost in a outage (panic, power loss, etc).


In terms of specific tunables, we took a swing based on the limited
data we had, and we tried to use numbers that replicated behavior as
close as possible to the previous write throttle while still
delivering consistent latency and a consistent IO workload on the
backend. Once we -- Delphix and the entire ZFS community -- had more
data, we'd figure out better values.

zfs_vdev_async_write_min_active - we chose 1 as the default because it
was the lowest feasible limit; in practice it has worked for our
customers who tend to have a fast SAN on the backend.

zfs_vdev_async_write_max_active - default of 10 to match the old write
throttle roughly.

zfs_vdev_async_write_active_min_dirty_percent - 30% a little more
finger in the wind, but roughly 3 active txgs with 10% left over for
slop (i.e. not deep reasoning).

I'm supportive of changing the defaults in illumos, however, we at
Delphix will stay with the current defaults until we have more
feedback from customer systems. We don't want to make changes yet
until we have more data -- things are apparently working well for our
customers now that we've fixed the most acute problems with the old
write throttle.


Bryan, on your system I'd suggest tuning zfs_dirty_data_max down much
lower. You could also turn
zfs_vdev_async_write_active_min_dirty_percent way down. We wanted to
give the system a chance to amortize metadata, not writing the same
metadata in every txg, so let some data build up, but 30% sounds high
for your system, and may be high in general. Also note that the design
of these variables was them to be reasonably tuned independently; for
example, changing zfs_vdev_async_write_min_active to 3 wouldn't
require any changes for other tunables.


Matt and I will look into increasing IO throughput with an outstanding
sync task (and try to remember why we didn't include that). I've also
been assembling a multi-part blog post on the old write throttle, the
new IO scheduler, and how to turn the various (hopefully more
comprehensible) knobs.

You mentioned zfs_vdev_async_write_active_min_dirty_percent as a value
that could benefit from dynamic tuning; zfs_dirty_data_max is another.
We intentionally eliminated any statistical values, or computed
hysteresis in the initial cut. We wanted to keep it as simple as
possible, and learn from manual tuning before we tried to apply some
automation.

Adam
Post by Bryan Cantrill
We recently had a production machine on which provisions were failing.
Investigating the issue, ZFS dataset creation was taking many minutes --
exceeding the timeout we have in place for a successful provision (five
minutes).
The issue appears to be that we were taking a very long time to sync out
transactions (over a minute in some cases) which was in turn due to the fact
that our I/O rate was getting significantly throttled by the rate returned
from vdev_queue_max_async_writes(). An important qualifier: this is one of
our older systems, so it has hardware RAID (no comment!) -- and therefore a
single vdev. Further, because it's multi-tenant, it would be quite unusual to
have dirty data that exceeds zfs_vdev_async_write_active_min_dirty_percent of
the zfs_dirty_data_max_percent of DRAM. (Given the defaults, any machine more
than 40G of DRAM will end up hitting the zfs_dirty_data_max_max cap of 4G --
resulting in a dirty minimum of 1.2G before vdev_queue_max_async_writes()
starts to increment the maximum number of writes from the default minimum
maximum of 1.) And indeed, what we observed on the machine was that the rate
of asynchronous writes is enough to result in extraordinary long transactions,
but not so much as to dirty enough memory to increase the cap on the number of
async write operations per vdev. The attached graphs (also available at
https://us-east.manta.joyent.com/bcantrill/public/OS-2659.pdf) show this. In
the first, you can dirty megabytes versus transaction sync time over a period
of about two hours; in the second, you can see the maximum async writes
responding to the amount of dirty data as designed (and the amount of dirty
data falling as result) -- but it's responding too late.
So, a couple of thoughts. First, it seems to me that a missing input with
respect to the throttle is transaction time: because operations like ZFS
dataset creation actually block on transactions going out, it seems that we
want transaction time to be part of the feedback loop for the throttle. That
is, in addition to looking at the amount of dirty data (and using that to
throttle up our outstanding I/Os), it seems we should also have a "target"
transaction time (5 seconds? 10 seconds?) such that as transaction times
exceed our target time, we start prefering throughput over latency -- and that
we increase that preference as/if transaction times continue to grow. This
would (I think) largely solve the (admittedly pathological) single vdev case
without forcing those that have single vdev pools to pick different tunables
(which, if it needs to be said, is a galactic pain in the ass for anyone like
us who has thousands of heterogeneous machines).
Second, it seems that the minimum maximum on asynchronous writes per vdev is
conservatively low (namely, 1). I appreciate that the single vdev case is an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
Finally, it seems that we might be being too conservative with the default
value of zfs_vdev_async_write_active_min_dirty_percent (namely, 30%); it seems
to me that this should perhaps be kicking in with dirty data less than 1.2G,
and it seems to me that it should be kicking in harder and sticking (i.e., it
needs potentially non-linear response and hysteresis -- you can see clear
porpoising in the second graph). That said, it's clear that the mechanism
_is_ working -- in the first graph, you can clearly see a band around that
1.2G threshold -- I just wonder if it should be turning on at lower levels
of dirty data.
Matt, Adam, Eric, others: thoughts here? I'm curious in particular as to the
modelling behind the current values (namely, minimum of 1, max of 10, min
dirty percent of 30); these values were presumably not randomly selected, and
I would particularly like to understand how changing some of them (in
particular, increasing the zfs_vdev_async_write_min_active from 1 to (say) 3
and/or decreasing the zfs_vdev_async_write_active_min_dirty_percent from 30 to
(say) 5) would affect that modelling. I do believe that factoring in
transaction time is ultimately the right approach here, but changing the
tunables seems to be a much quicker fix -- and one that will largely solve the
single vdev async write problem (at least in this manifestation).
- Bryan
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21174948-f7cfeeac
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Adam Leventhal
CTO, Delphix
http://blog.delphix.com/ahl


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-12-18 18:29:41 UTC
Permalink
Post by Bryan Cantrill
Second, it seems that the minimum maximum on asynchronous writes per vdev is
conservatively low (namely, 1). I appreciate that the single vdev case is an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
For systems where there is one rotating disk behind each vdev, increasing
zfs_vdev_async_write_min_active above 1 will have an impact on synchronous
reads, which will potentially have to wait for e.g. 3 async writes before
the hardware services them, vs. 1. This increases the latency and
variability of sync reads unnecessarily, and does not have a dramatic
impact on write performance (in this case of 1 disk per vdev).

Given that you want to have one set of tunables for a variety of hardware
configurations, you will need to make the trade off of whether to increase
zfs_vdev_async_write_min_active, but I could see it making sense for you to
increase it to 2 or 3. I think that the defaults in illumos should provide
good behavior for the unsophisticated user who is not going to go to the
trouble of doing a thorough performance investigation like you have.
Typically such unsophisticated users are in the single disk per vdev
scenario, so I'd like to be a little cautious about changing the default.

I think this is an area that could benefit from some automatic tuning as
well. The system could measure performance at varying queue depths and see
how that impacts throughput and latency. This could be done when the vdev
is added/reconfigured, or dynamically as the system is running. If
throughput is nearly linear with queue depth, and latency is nearly
unchanged, then we should automatically increase *_min_active.

--matt



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Steven Hartland
2013-12-18 21:55:53 UTC
Permalink
----- Original Message -----
Post by Matthew Ahrens
Post by Bryan Cantrill
Second, it seems that the minimum maximum on asynchronous writes per vdev is
conservatively low (namely, 1). I appreciate that the single vdev case is an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
For systems where there is one rotating disk behind each vdev, increasing
zfs_vdev_async_write_min_active above 1 will have an impact on synchronous
reads, which will potentially have to wait for e.g. 3 async writes before
the hardware services them, vs. 1. This increases the latency and
variability of sync reads unnecessarily, and does not have a dramatic
impact on write performance (in this case of 1 disk per vdev).
Given that you want to have one set of tunables for a variety of hardware
configurations, you will need to make the trade off of whether to increase
zfs_vdev_async_write_min_active, but I could see it making sense for you to
increase it to 2 or 3. I think that the defaults in illumos should provide
good behavior for the unsophisticated user who is not going to go to the
trouble of doing a thorough performance investigation like you have.
Typically such unsophisticated users are in the single disk per vdev
scenario, so I'd like to be a little cautious about changing the default.
I think this is an area that could benefit from some automatic tuning as
well. The system could measure performance at varying queue depths and see
how that impacts throughput and latency. This could be done when the vdev
is added/reconfigured, or dynamically as the system is running. If
throughput is nearly linear with queue depth, and latency is nearly
unchanged, then we should automatically increase *_min_active.
With this assumption are you ignoring the benefit of HW level queuing and
reordering? If so an queued requests are limited at the ZFS layer this
could easily result in significantly reduced performance in all
configurations, particularly those involving rotating media.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-12-18 22:58:56 UTC
Permalink
Post by Steven Hartland
Post by Bryan Cantrill
Second, it seems that the minimum maximum on asynchronous writes per vdev
Post by Bryan Cantrill
is
conservatively low (namely, 1). I appreciate that the single vdev case
is
an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
For systems where there is one rotating disk behind each vdev, increasing
zfs_vdev_async_write_min_active above 1 will have an impact on synchronous
reads, which will potentially have to wait for e.g. 3 async writes before
the hardware services them, vs. 1. This increases the latency and
variability of sync reads unnecessarily, and does not have a dramatic
impact on write performance (in this case of 1 disk per vdev).
Given that you want to have one set of tunables for a variety of hardware
configurations, you will need to make the trade off of whether to increase
zfs_vdev_async_write_min_active, but I could see it making sense for you to
increase it to 2 or 3. I think that the defaults in illumos should provide
good behavior for the unsophisticated user who is not going to go to the
trouble of doing a thorough performance investigation like you have.
Typically such unsophisticated users are in the single disk per vdev
scenario, so I'd like to be a little cautious about changing the default.
I think this is an area that could benefit from some automatic tuning as
well. The system could measure performance at varying queue depths and see
how that impacts throughput and latency. This could be done when the vdev
is added/reconfigured, or dynamically as the system is running. If
throughput is nearly linear with queue depth, and latency is nearly
unchanged, then we should automatically increase *_min_active.
With this assumption are you ignoring the benefit of HW level queuing and
reordering? If so an queued requests are limited at the ZFS layer this
could easily result in significantly reduced performance in all
Post by Steven Hartland
configurations, particularly those involving rotating media.
By changing the number of outstanding async writes, we are dynamically
trading off between low latency for synchronous operations and high
throughput for async writes. How exactly do you propose we improve on that?

Also note that when each vdev is a single rotating spindle, there isn't
much of a trade off. We are already feeding it i/os in LBA order, so
increasing # outstanding i/os doesn't provide much increased throughput
anyway.

--matt



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore
2013-12-19 01:15:27 UTC
Permalink
Turns out that disksort based on LBAs doesn't really work the way you think
it does -- even with spinning disks. LBAs are not necessarily located in
convenient sequential order. Drive firmwares will remap the drive using
algorithms that are proprietary and not known by mere mortals. We hope
that sequential accesses to the LBAs will mostly be sequential accesses on
the hardware, but its not guaranteed.

For non-spinning disks, or disks in a hardware array, you really don't want
to use disksort at all.
Post by Steven Hartland
Post by Steven Hartland
Post by Bryan Cantrill
Second, it seems that the minimum maximum on asynchronous writes per vdev
Post by Bryan Cantrill
is
conservatively low (namely, 1). I appreciate that the single vdev case
is
an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
For systems where there is one rotating disk behind each vdev, increasing
zfs_vdev_async_write_min_active above 1 will have an impact on synchronous
reads, which will potentially have to wait for e.g. 3 async writes before
the hardware services them, vs. 1. This increases the latency and
variability of sync reads unnecessarily, and does not have a dramatic
impact on write performance (in this case of 1 disk per vdev).
Given that you want to have one set of tunables for a variety of hardware
configurations, you will need to make the trade off of whether to increase
zfs_vdev_async_write_min_active, but I could see it making sense for you to
increase it to 2 or 3. I think that the defaults in illumos should provide
good behavior for the unsophisticated user who is not going to go to the
trouble of doing a thorough performance investigation like you have.
Typically such unsophisticated users are in the single disk per vdev
scenario, so I'd like to be a little cautious about changing the default.
I think this is an area that could benefit from some automatic tuning as
well. The system could measure performance at varying queue depths and see
how that impacts throughput and latency. This could be done when the vdev
is added/reconfigured, or dynamically as the system is running. If
throughput is nearly linear with queue depth, and latency is nearly
unchanged, then we should automatically increase *_min_active.
With this assumption are you ignoring the benefit of HW level queuing and
reordering? If so an queued requests are limited at the ZFS layer this
could easily result in significantly reduced performance in all
Post by Steven Hartland
configurations, particularly those involving rotating media.
By changing the number of outstanding async writes, we are dynamically
trading off between low latency for synchronous operations and high
throughput for async writes. How exactly do you propose we improve on that?
Also note that when each vdev is a single rotating spindle, there isn't
much of a trade off. We are already feeding it i/os in LBA order, so
increasing # outstanding i/os doesn't provide much increased throughput
anyway.
--matt
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Steven Hartland
2013-12-19 01:30:33 UTC
Permalink
Indeed Garret, I've experienced first had that sorting IO's seriously
hurting performance on media which didn't need it purely due to the
sort overhead. This is why FreeBSD's CAM layer now avoids BIO sorting
by default if it detects none rotating media.

----- Original Message -----
Post by Garrett D'Amore
Turns out that disksort based on LBAs doesn't really work the way you think
it does -- even with spinning disks. LBAs are not necessarily located in
convenient sequential order. Drive firmwares will remap the drive using
algorithms that are proprietary and not known by mere mortals. We hope
that sequential accesses to the LBAs will mostly be sequential accesses on
the hardware, but its not guaranteed.
For non-spinning disks, or disks in a hardware array, you really don't want
to use disksort at all.
Post by Steven Hartland
Post by Steven Hartland
Post by Bryan Cantrill
Second, it seems that the minimum maximum on asynchronous writes per vdev
Post by Bryan Cantrill
is
conservatively low (namely, 1). I appreciate that the single vdev case
is
an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
For systems where there is one rotating disk behind each vdev, increasing
zfs_vdev_async_write_min_active above 1 will have an impact on synchronous
reads, which will potentially have to wait for e.g. 3 async writes before
the hardware services them, vs. 1. This increases the latency and
variability of sync reads unnecessarily, and does not have a dramatic
impact on write performance (in this case of 1 disk per vdev).
Given that you want to have one set of tunables for a variety of hardware
configurations, you will need to make the trade off of whether to increase
zfs_vdev_async_write_min_active, but I could see it making sense for you to
increase it to 2 or 3. I think that the defaults in illumos should provide
good behavior for the unsophisticated user who is not going to go to the
trouble of doing a thorough performance investigation like you have.
Typically such unsophisticated users are in the single disk per vdev
scenario, so I'd like to be a little cautious about changing the default.
I think this is an area that could benefit from some automatic tuning as
well. The system could measure performance at varying queue depths and see
how that impacts throughput and latency. This could be done when the vdev
is added/reconfigured, or dynamically as the system is running. If
throughput is nearly linear with queue depth, and latency is nearly
unchanged, then we should automatically increase *_min_active.
With this assumption are you ignoring the benefit of HW level queuing and
reordering? If so an queued requests are limited at the ZFS layer this
could easily result in significantly reduced performance in all
Post by Steven Hartland
configurations, particularly those involving rotating media.
By changing the number of outstanding async writes, we are dynamically
trading off between low latency for synchronous operations and high
throughput for async writes. How exactly do you propose we improve on that?
Also note that when each vdev is a single rotating spindle, there isn't
much of a trade off. We are already feeding it i/os in LBA order, so
increasing # outstanding i/os doesn't provide much increased throughput
anyway.
--matt
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore
2013-12-19 16:02:13 UTC
Permalink
I had planned a simple change to do the same -- using the same SSD check
that we just integrated into illumos. I don't remember if that change ever
got implemented outside of my private workspace. I can check into it.
(Its a tiny piece of code to add.)
Post by Steven Hartland
Indeed Garret, I've experienced first had that sorting IO's seriously
hurting performance on media which didn't need it purely due to the
sort overhead. This is why FreeBSD's CAM layer now avoids BIO sorting
by default if it detects none rotating media.
Turns out that disksort based on LBAs doesn't really work the way you
Post by Garrett D'Amore
think
it does -- even with spinning disks. LBAs are not necessarily located in
convenient sequential order. Drive firmwares will remap the drive using
algorithms that are proprietary and not known by mere mortals. We hope
that sequential accesses to the LBAs will mostly be sequential accesses on
the hardware, but its not guaranteed.
For non-spinning disks, or disks in a hardware array, you really don't want
to use disksort at all.
On Wed, Dec 18, 2013 at 1:55 PM, Steven Hartland <
----- Original Message ----- From: "Matthew Ahrens" <
On Tue, Dec 17, 2013 at 5:02 PM, Bryan Cantrill <
Second, it seems that the minimum maximum on asynchronous writes per vdev
Post by Bryan Cantrill
is
conservatively low (namely, 1). I appreciate that the single vdev case
is
an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
For systems where there is one rotating disk behind each vdev, increasing
zfs_vdev_async_write_min_active above 1 will have an impact on synchronous
reads, which will potentially have to wait for e.g. 3 async writes before
the hardware services them, vs. 1. This increases the latency and
variability of sync reads unnecessarily, and does not have a dramatic
impact on write performance (in this case of 1 disk per vdev).
Given that you want to have one set of tunables for a variety of hardware
configurations, you will need to make the trade off of whether to increase
zfs_vdev_async_write_min_active, but I could see it making sense for you to
increase it to 2 or 3. I think that the defaults in illumos should provide
good behavior for the unsophisticated user who is not going to go to the
trouble of doing a thorough performance investigation like you have.
Typically such unsophisticated users are in the single disk per vdev
scenario, so I'd like to be a little cautious about changing the default.
I think this is an area that could benefit from some automatic tuning as
well. The system could measure performance at varying queue depths and see
how that impacts throughput and latency. This could be done when the vdev
is added/reconfigured, or dynamically as the system is running. If
throughput is nearly linear with queue depth, and latency is nearly
unchanged, then we should automatically increase *_min_active.
With this assumption are you ignoring the benefit of HW level queuing and
reordering? If so an queued requests are limited at the ZFS layer this
could easily result in significantly reduced performance in all
configurations, particularly those involving rotating media.
By changing the number of outstanding async writes, we are dynamically
trading off between low latency for synchronous operations and high
throughput for async writes. How exactly do you propose we improve on that?
Also note that when each vdev is a single rotating spindle, there isn't
much of a trade off. We are already feeding it i/os in LBA order, so
increasing # outstanding i/os doesn't provide much increased throughput
anyway.
--matt
*illumos-zfs* | Archives<https://www.listbox.
com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227> |
Modify<https://www.listbox.com/member/?member_id=
22035932&>Your Subscription
<http://www.listbox.com>
================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and
the person or entity to whom it is addressed. In the event of misdirection,
the recipient is prohibited from using, copying, printing or otherwise
disseminating it or any information contained in it.
In the event of misdirection, illegible or incomplete transmission please
telephone +44 845 868 1337
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Andrew Gabriel
2013-12-22 14:05:06 UTC
Permalink
I played with disk sorting in the past on spinning rust (although not on
Solaris).

Some things I observed...

Don't start sorting at all whilst your queue size is less than the
disk's internal queue size, as the disk may do it better, but no worse
than you.

I used a 2-way elevator sort, and I'm skeptical of the justification for
not doing this in sd.c. However for the backward sweep, you don't want
to hit the disk with completely inverted block order - you want to
divide the disk into about 100 ranges of blocks and only invert the
order between ranges and not within ranges. (Back when we knew where
cylinder boundaries were, you divide the disk on cylinder boundaries,
but that's no longer possible.)

The current sd.c code adds new requests into the current sweep, and that
can have some pathologically bad outliers. If I issue a request to read
a block at the end of the disk, and someone else is reading sequentially
from the beginning of the disk, my request can wait one hell of a long
time, because all their requests are being inserted ahead of mine. One
way around this is to say don't insert new requests into the current
sweep. You can choose to do this selectively, e.g. sync requests go into
current sweep, but async ones don't, and/or stop putting requests in
current sweep when oldest request reaches a certain age, etc. This may
be a tradeoff between throughput and outliers (although I didn't see
throughput change).

Having done this, I was achieving continuously pretty much the bit rate
under the heads, with nothing measurable lost in seek time. Disk
dynamics have changed a bit since I did this (much bigger disks could
make the outlier problem much worse), bigger disk caches, and ZFS's use
of disks is significantly different from most other filesystems', but at
least here's some food for thought.
Post by Steven Hartland
Indeed Garret, I've experienced first had that sorting IO's seriously
hurting performance on media which didn't need it purely due to the
sort overhead. This is why FreeBSD's CAM layer now avoids BIO sorting
by default if it detects none rotating media.
Post by Garrett D'Amore
Turns out that disksort based on LBAs doesn't really work the way you think
it does -- even with spinning disks. LBAs are not necessarily located in
convenient sequential order. Drive firmwares will remap the drive using
algorithms that are proprietary and not known by mere mortals. We hope
that sequential accesses to the LBAs will mostly be sequential accesses on
the hardware, but its not guaranteed.
For non-spinning disks, or disks in a hardware array, you really don't want
to use disksort at all.
--
Andrew Gabriel


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-12-23 17:58:54 UTC
Permalink
I played with disk sorting in the past on spinning rust (although not on Solaris).
Some things I observed...
Don't start sorting at all whilst your queue size is less than the disk's internal queue size, as the disk may do it better, but no worse than you.
Anecdotally, with the newer disks (HDDs) having 128MB of cache, the
behaviour we’re seeing suggests they have also significantly improved
their algorithms.

IMHO, it is time to turn off sd disksort by default.
— richard
I used a 2-way elevator sort, and I'm skeptical of the justification for not doing this in sd.c. However for the backward sweep, you don't want to hit the disk with completely inverted block order - you want to divide the disk into about 100 ranges of blocks and only invert the order between ranges and not within ranges. (Back when we knew where cylinder boundaries were, you divide the disk on cylinder boundaries, but that's no longer possible.)
The current sd.c code adds new requests into the current sweep, and that can have some pathologically bad outliers. If I issue a request to read a block at the end of the disk, and someone else is reading sequentially from the beginning of the disk, my request can wait one hell of a long time, because all their requests are being inserted ahead of mine. One way around this is to say don't insert new requests into the current sweep. You can choose to do this selectively, e.g. sync requests go into current sweep, but async ones don't, and/or stop putting requests in current sweep when oldest request reaches a certain age, etc. This may be a tradeoff between throughput and outliers (although I didn't see throughput change).
Having done this, I was achieving continuously pretty much the bit rate under the heads, with nothing measurable lost in seek time. Disk dynamics have changed a bit since I did this (much bigger disks could make the outlier problem much worse), bigger disk caches, and ZFS's use of disks is significantly different from most other filesystems', but at least here's some food for thought.
Post by Steven Hartland
Indeed Garret, I've experienced first had that sorting IO's seriously
hurting performance on media which didn't need it purely due to the
sort overhead. This is why FreeBSD's CAM layer now avoids BIO sorting
by default if it detects none rotating media.
Post by Garrett D'Amore
Turns out that disksort based on LBAs doesn't really work the way you think
it does -- even with spinning disks. LBAs are not necessarily located in
convenient sequential order. Drive firmwares will remap the drive using
algorithms that are proprietary and not known by mere mortals. We hope
that sequential accesses to the LBAs will mostly be sequential accesses on
the hardware, but its not guaranteed.
For non-spinning disks, or disks in a hardware array, you really don't want
to use disksort at all.
--
Andrew Gabriel
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore
2013-12-24 16:51:03 UTC
Permalink
I tend to agree... disksort is possibly destructive on these disks. I'd be
happy if we made this default dependant on the bus. Older parallel SCSI
(does anyone have them) or IDE (not SATA) drives could use disksort, but
anything on a more modern bus probably has a deep enough internal queue to
benefit.

HOWEVER, note that whether disksort can meaningfully do anything (either on
the disk or in the driver) is going to rely upon it getting enough requests
inbound, i.e. the value of whatever tunable has replaced the
zfs_vdev_maxpending.

The other thing to check, is it possible with any of these devices that the
on-device disksort can lead to very high latencies or even starvation? Old
sd.c's disksort had this problem, and reducing the zfs_vdev_maxpending was
really a measure taken to mitigate this behavior, IMO.
Post by Andrew Gabriel
Post by Andrew Gabriel
I played with disk sorting in the past on spinning rust (although not on
Solaris).
Post by Andrew Gabriel
Some things I observed...
Don't start sorting at all whilst your queue size is less than the
disk's internal queue size, as the disk may do it better, but no worse than
you.
Anecdotally, with the newer disks (HDDs) having 128MB of cache, the
behaviour we’re seeing suggests they have also significantly improved
their algorithms.
IMHO, it is time to turn off sd disksort by default.
— richard
Post by Andrew Gabriel
I used a 2-way elevator sort, and I'm skeptical of the justification for
not doing this in sd.c. However for the backward sweep, you don't want to
hit the disk with completely inverted block order - you want to divide the
disk into about 100 ranges of blocks and only invert the order between
ranges and not within ranges. (Back when we knew where cylinder boundaries
were, you divide the disk on cylinder boundaries, but that's no longer
possible.)
Post by Andrew Gabriel
The current sd.c code adds new requests into the current sweep, and that
can have some pathologically bad outliers. If I issue a request to read a
block at the end of the disk, and someone else is reading sequentially from
the beginning of the disk, my request can wait one hell of a long time,
because all their requests are being inserted ahead of mine. One way around
this is to say don't insert new requests into the current sweep. You can
choose to do this selectively, e.g. sync requests go into current sweep,
but async ones don't, and/or stop putting requests in current sweep when
oldest request reaches a certain age, etc. This may be a tradeoff between
throughput and outliers (although I didn't see throughput change).
Post by Andrew Gabriel
Having done this, I was achieving continuously pretty much the bit rate
under the heads, with nothing measurable lost in seek time. Disk dynamics
have changed a bit since I did this (much bigger disks could make the
outlier problem much worse), bigger disk caches, and ZFS's use of disks is
significantly different from most other filesystems', but at least here's
some food for thought.
Post by Andrew Gabriel
Post by Steven Hartland
Indeed Garret, I've experienced first had that sorting IO's seriously
hurting performance on media which didn't need it purely due to the
sort overhead. This is why FreeBSD's CAM layer now avoids BIO sorting
by default if it detects none rotating media.
----- Original Message ----- From: "Garrett D'Amore" <
Post by Garrett D'Amore
Turns out that disksort based on LBAs doesn't really work the way you
think
Post by Andrew Gabriel
Post by Steven Hartland
Post by Garrett D'Amore
it does -- even with spinning disks. LBAs are not necessarily located
in
Post by Andrew Gabriel
Post by Steven Hartland
Post by Garrett D'Amore
convenient sequential order. Drive firmwares will remap the drive
using
Post by Andrew Gabriel
Post by Steven Hartland
Post by Garrett D'Amore
algorithms that are proprietary and not known by mere mortals. We hope
that sequential accesses to the LBAs will mostly be sequential
accesses on
Post by Andrew Gabriel
Post by Steven Hartland
Post by Garrett D'Amore
the hardware, but its not guaranteed.
For non-spinning disks, or disks in a hardware array, you really don't
want
Post by Andrew Gabriel
Post by Steven Hartland
Post by Garrett D'Amore
to use disksort at all.
--
Andrew Gabriel
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
https://www.listbox.com/member/?&
Post by Andrew Gabriel
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...