Bryan Cantrill
2013-12-18 01:02:09 UTC
We recently had a production machine on which provisions were failing.
Investigating the issue, ZFS dataset creation was taking many minutes --
exceeding the timeout we have in place for a successful provision (five
minutes).
The issue appears to be that we were taking a very long time to sync out
transactions (over a minute in some cases) which was in turn due to the fact
that our I/O rate was getting significantly throttled by the rate returned
from vdev_queue_max_async_writes(). An important qualifier: this is one of
our older systems, so it has hardware RAID (no comment!) -- and therefore a
single vdev. Further, because it's multi-tenant, it would be quite unusual to
have dirty data that exceeds zfs_vdev_async_write_active_min_dirty_percent of
the zfs_dirty_data_max_percent of DRAM. (Given the defaults, any machine more
than 40G of DRAM will end up hitting the zfs_dirty_data_max_max cap of 4G --
resulting in a dirty minimum of 1.2G before vdev_queue_max_async_writes()
starts to increment the maximum number of writes from the default minimum
maximum of 1.) And indeed, what we observed on the machine was that the rate
of asynchronous writes is enough to result in extraordinary long transactions,
but not so much as to dirty enough memory to increase the cap on the number of
async write operations per vdev. The attached graphs (also available at
https://us-east.manta.joyent.com/bcantrill/public/OS-2659.pdf) show this. In
the first, you can dirty megabytes versus transaction sync time over a period
of about two hours; in the second, you can see the maximum async writes
responding to the amount of dirty data as designed (and the amount of dirty
data falling as result) -- but it's responding too late.
So, a couple of thoughts. First, it seems to me that a missing input with
respect to the throttle is transaction time: because operations like ZFS
dataset creation actually block on transactions going out, it seems that we
want transaction time to be part of the feedback loop for the throttle. That
is, in addition to looking at the amount of dirty data (and using that to
throttle up our outstanding I/Os), it seems we should also have a "target"
transaction time (5 seconds? 10 seconds?) such that as transaction times
exceed our target time, we start prefering throughput over latency -- and that
we increase that preference as/if transaction times continue to grow. This
would (I think) largely solve the (admittedly pathological) single vdev case
without forcing those that have single vdev pools to pick different tunables
(which, if it needs to be said, is a galactic pain in the ass for anyone like
us who has thousands of heterogeneous machines).
Second, it seems that the minimum maximum on asynchronous writes per vdev is
conservatively low (namely, 1). I appreciate that the single vdev case is an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
Finally, it seems that we might be being too conservative with the default
value of zfs_vdev_async_write_active_min_dirty_percent (namely, 30%); it seems
to me that this should perhaps be kicking in with dirty data less than 1.2G,
and it seems to me that it should be kicking in harder and sticking (i.e., it
needs potentially non-linear response and hysteresis -- you can see clear
porpoising in the second graph). That said, it's clear that the mechanism
_is_ working -- in the first graph, you can clearly see a band around that
1.2G threshold -- I just wonder if it should be turning on at lower levels
of dirty data.
Matt, Adam, Eric, others: thoughts here? I'm curious in particular as to the
modelling behind the current values (namely, minimum of 1, max of 10, min
dirty percent of 30); these values were presumably not randomly selected, and
I would particularly like to understand how changing some of them (in
particular, increasing the zfs_vdev_async_write_min_active from 1 to (say) 3
and/or decreasing the zfs_vdev_async_write_active_min_dirty_percent from 30 to
(say) 5) would affect that modelling. I do believe that factoring in
transaction time is ultimately the right approach here, but changing the
tunables seems to be a much quicker fix -- and one that will largely solve the
single vdev async write problem (at least in this manifestation).
- Bryan
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Investigating the issue, ZFS dataset creation was taking many minutes --
exceeding the timeout we have in place for a successful provision (five
minutes).
The issue appears to be that we were taking a very long time to sync out
transactions (over a minute in some cases) which was in turn due to the fact
that our I/O rate was getting significantly throttled by the rate returned
from vdev_queue_max_async_writes(). An important qualifier: this is one of
our older systems, so it has hardware RAID (no comment!) -- and therefore a
single vdev. Further, because it's multi-tenant, it would be quite unusual to
have dirty data that exceeds zfs_vdev_async_write_active_min_dirty_percent of
the zfs_dirty_data_max_percent of DRAM. (Given the defaults, any machine more
than 40G of DRAM will end up hitting the zfs_dirty_data_max_max cap of 4G --
resulting in a dirty minimum of 1.2G before vdev_queue_max_async_writes()
starts to increment the maximum number of writes from the default minimum
maximum of 1.) And indeed, what we observed on the machine was that the rate
of asynchronous writes is enough to result in extraordinary long transactions,
but not so much as to dirty enough memory to increase the cap on the number of
async write operations per vdev. The attached graphs (also available at
https://us-east.manta.joyent.com/bcantrill/public/OS-2659.pdf) show this. In
the first, you can dirty megabytes versus transaction sync time over a period
of about two hours; in the second, you can see the maximum async writes
responding to the amount of dirty data as designed (and the amount of dirty
data falling as result) -- but it's responding too late.
So, a couple of thoughts. First, it seems to me that a missing input with
respect to the throttle is transaction time: because operations like ZFS
dataset creation actually block on transactions going out, it seems that we
want transaction time to be part of the feedback loop for the throttle. That
is, in addition to looking at the amount of dirty data (and using that to
throttle up our outstanding I/Os), it seems we should also have a "target"
transaction time (5 seconds? 10 seconds?) such that as transaction times
exceed our target time, we start prefering throughput over latency -- and that
we increase that preference as/if transaction times continue to grow. This
would (I think) largely solve the (admittedly pathological) single vdev case
without forcing those that have single vdev pools to pick different tunables
(which, if it needs to be said, is a galactic pain in the ass for anyone like
us who has thousands of heterogeneous machines).
Second, it seems that the minimum maximum on asynchronous writes per vdev is
conservatively low (namely, 1). I appreciate that the single vdev case is an
outlier, but what do folks think about having this number be slightly larger
-- like 3? This would largely solve our problem (it's a 3X increase in
bandwidth -- very significant) without introducing overly pathological
latency.
Finally, it seems that we might be being too conservative with the default
value of zfs_vdev_async_write_active_min_dirty_percent (namely, 30%); it seems
to me that this should perhaps be kicking in with dirty data less than 1.2G,
and it seems to me that it should be kicking in harder and sticking (i.e., it
needs potentially non-linear response and hysteresis -- you can see clear
porpoising in the second graph). That said, it's clear that the mechanism
_is_ working -- in the first graph, you can clearly see a band around that
1.2G threshold -- I just wonder if it should be turning on at lower levels
of dirty data.
Matt, Adam, Eric, others: thoughts here? I'm curious in particular as to the
modelling behind the current values (namely, minimum of 1, max of 10, min
dirty percent of 30); these values were presumably not randomly selected, and
I would particularly like to understand how changing some of them (in
particular, increasing the zfs_vdev_async_write_min_active from 1 to (say) 3
and/or decreasing the zfs_vdev_async_write_active_min_dirty_percent from 30 to
(say) 5) would affect that modelling. I do believe that factoring in
transaction time is ultimately the right approach here, but changing the
tunables seems to be a much quicker fix -- and one that will largely solve the
single vdev async write problem (at least in this manifestation).
- Bryan
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com