On dirty write buffers and interaction with UMA

Karl Denninger via illumos-zfs

2014-09-26 17:45:18 UTC

Postulate for the day: /The dirty write buffer sizing and handling in
ZFS is poorly-executed and, in some circumstances, causes pathological
behavior by the underlying operating system, particularly when UMA is in
use.

/ZFS "out of the box" with the rebuilt I/O scheduler (as described here:
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/) permits up to
10% of system RAM to be used for dirty write buffer caching on a
per-pool basis, with a 4GB maximum.

Unfortunately the correct sizing of a dirty write buffer pool has little
to do with system memory size other than imposing a maximum reasonable
limit on what you're willing to allow to be consumed for this purpose.
Rather, that size is more-appropriately determined by I/O subsystem
performance.

Since I/O subsystem performance may vary materially by pool (and
potentially even by vdev, but we hope not) a "one size fits all" design
is inappropriate. For example you may have older spinning rust on your
system in a mirrored configuration that can ping-pong reads and thus is
reasonably fast, but for writes is much slower (because all mirrored
copies must be updated) or you may have a raidz2 pool that must update
both parity stripes plus the actual data write itself. At the same time
there may be a SSD pool on the same machine which is blindingly fast by
comparison -- perhaps as much as 5-10x as fast or even more.

During times when the machine is overcommitted on writes (that is, it
cannot drain them as fast as you generate them) there is no benefit and
a fair bit of cost to having a very large write buffer for the spinning
rust, but the same is essential for allowing at least one full buffer
set to be "ready" when the SSDs can accept a new transfer. In addition
there is the economy of write-grouping where contiguous writes take
place and thus can avoid, for spinning drives, rotational latency that
would otherwise be incurred.

This would only result in higher than reasonable latency under the
circumstance were the buffer size is inappropriately large were it not
for the use of UMA. The UMA allocator, once it grabs physical memory,
will "hold" freed allocations in the expectation that you will ask for a
block of the same size again. This is good provided that your demand
for allocations of size "X" is reasonably constant.

It is very bad if your allocation pattern, say for dirty txg buffers,
looks like this:

1. Ask for 10,000 64KB txgs, use them and then release them.
2. Ask for 10,000 62KB txgs, use them and then release them.
3. Ask for 10,000 48KB txgs, use them and then release them.

In that case there are 30,000 blocks, 10,000 each of 64, 62 and 48KB,
sitting unused yet allocated and unavailable for any other purpose. ZFS
sees this as memory pressure but has no way to know if it is real, since
whether it will actually allocate a new block of RAM or re-use an old
one depends on the distribution of the old ones by size, and that
knowledge is opaque to ZFS' memory routines. If you ask for a 32KB txg
you will force another allocation from the system even though there are
30,000 buffers already allocated and sitting simply because the new
request is of the wrong size.

The consequence of this is that in the face of variable-txg-size demand
for dirty write buffers that materially exceed the I/O systems capacity
to drain them the UMA system will potentially allocate RAM for a
*multiple* of the dirty_data_max number of buffers, with the multiple
being defined by the number of different sizes of buffers requested.
While the arc maintenance thread does clear these unused allocations
under heavy memory pressure (where aggressive reclaim is initiated) that
is a sledge-hammer approach in that it results in "breathing" in the
system memory state and ARC cache that is both unnecessary and
contributes to undesirable performance characteristics as a whole.

This can and under some circumstances does result in severe pathology in
system operation when the I/O channel becomes overcommitted and the
current code is incapable of adjusting to this condition despite the I/O
system having a nominal and expected means of doing so. While I have
managed to code up means of detecting the pathology on an incipient
basis and preventing it from causing system "pauses" and other bad
behavior (such as unwarranted swapping that can and does block zio_
threads) that is a mitigation strategy rather than a true fix.

It would appear that the following changes should be considered:

1. Modifying the code so that the dirty_data_max buffer size maximum is
computed on a per-pool basis predicated on dynamic I/O performance, with
a system-wide limit on the RAM that can be committed to this use. In
this fashion a fast SSD pool will receive a greater (and usable)
allocation from same while a pool of slower, spinning rust will receive
a large enough allocation to maximize performance, but not an excessive
allocation that only increases latency while pressuring system RAM. In
addition a degraded pool, which has materially lower performance, will
have its dirty data maximum reduced automatically for the duration of
the resilver operation.

2. Remove the dirty data buffer allocation and management code from the
common ARC memory mechanism to its own management routine that allocates
up to a space of "X" (the maximum configured) size on demand and manages
it internally to ZFS, treating the entire aggregate as a collective pool
of RAM from which to draw from and release allocations to predicated on
their current I/O performance levels. This prevents UMA from
multiplying the allocation by the number of different write txg sizes
that are queued.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com