Review request #3995: Memory leak of compressed buffers in l2arc_write_done

Discussion:

Review request #3995: Memory leak of compressed buffers in l2arc_write_done

Saso Kiselkov

2013-08-05 21:18:18 UTC

I'd like to ask for a quick review of a fix for an issue discovered by
jimmyH of the ZFS on Linux project. There is a race condition in
l2arc_write_done which can result in the leaking of an L2ARC buffer.

Webrev: http://cr.illumos.org/~webrev/skiselkov/3995/
Issue: https://www.illumos.org/issues/3995

Cheers,

--
Saso

Matthew Ahrens

2013-08-05 21:42:28 UTC

Looks good to me.

--matt

Post by Saso Kiselkov
I'd like to ask for a quick review of a fix for an issue discovered by
jimmyH of the ZFS on Linux project. There is a race condition in
l2arc_write_done which can result in the leaking of an L2ARC buffer.
Webrev: http://cr.illumos.org/~webrev/skiselkov/3995/
Issue: https://www.illumos.org/issues/3995
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

George Wilson

2013-08-05 22:42:26 UTC

Saso,

LGTM. I was thinking through the problem and it seems like a similar
thing could occur here (although less likely):

l2arc_write_done()
{
<snip>
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}

Presumably if the io got an error and we can't grab the hash_lock then
will not clean up the b_l2hdr. Is that correct?

Thanks,
George

Post by Saso Kiselkov
I'd like to ask for a quick review of a fix for an issue discovered by
jimmyH of the ZFS on Linux project. There is a race condition in
l2arc_write_done which can result in the leaking of an L2ARC buffer.
Webrev: http://cr.illumos.org/~webrev/skiselkov/3995/
Issue: https://www.illumos.org/issues/3995
Cheers,

Saso Kiselkov

2013-08-05 23:52:34 UTC

Post by George Wilson
Saso,
LGTM. I was thinking through the problem and it seems like a similar

Thanks for the review.

Post by George Wilson
l2arc_write_done()
{
<snip>
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
Presumably if the io got an error and we can't grab the hash_lock then
will not clean up the b_l2hdr. Is that correct?

Nope, this bit looks leak-free to me. "abl2" is the same as ab->b_l2hdr,
so the initial set just clears the pointer and we subsequently use abl2
to free it. Unfortunately the L2ARC code base uses two names to refer to
the L2-header: 'abl2' (arc buffer level 2) and l2hdr (level 2 header -
which I personally prefer). I've been trying to make the variable naming
more consistent, but I probably missed a few bits. I can roll the name
change for occurences of 'abl2' into this fix.

Cheers,

--
Saso

George Wilson

2013-08-06 12:50:03 UTC

Post by Saso Kiselkov

Post by George Wilson
Saso,
LGTM. I was thinking through the problem and it seems like a similar

Thanks for the review.

Post by George Wilson
l2arc_write_done()
{
<snip>
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
Presumably if the io got an error and we can't grab the hash_lock then
will not clean up the b_l2hdr. Is that correct?

Nope, this bit looks leak-free to me. "abl2" is the same as ab->b_l2hdr,
so the initial set just clears the pointer and we subsequently use abl2
to free it. Unfortunately the L2ARC code base uses two names to refer to
the L2-header: 'abl2' (arc buffer level 2) and l2hdr (level 2 header -
which I personally prefer). I've been trying to make the variable naming
more consistent, but I probably missed a few bits. I can roll the name
change for occurences of 'abl2' into this fix.
Cheers,

Let me ask this a different way. Assume a case where the I/O gets an
error and the arc_buf_hdr_t is unable to get the hash_lock. Who does the
kmem_free on the b_l2hdr for that arc buf?

- George

Saso Kiselkov

2013-08-06 13:05:17 UTC

Post by George Wilson

Post by Saso Kiselkov

Post by George Wilson
Saso,
LGTM. I was thinking through the problem and it seems like a similar

Thanks for the review.

Post by George Wilson
l2arc_write_done()
{
<snip>
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
Presumably if the io got an error and we can't grab the hash_lock then
will not clean up the b_l2hdr. Is that correct?

Nope, this bit looks leak-free to me. "abl2" is the same as ab->b_l2hdr,
so the initial set just clears the pointer and we subsequently use abl2
to free it. Unfortunately the L2ARC code base uses two names to refer to
the L2-header: 'abl2' (arc buffer level 2) and l2hdr (level 2 header -
which I personally prefer). I've been trying to make the variable naming
more consistent, but I probably missed a few bits. I can roll the name
change for occurences of 'abl2' into this fix.
Cheers,

Let me ask this a different way. Assume a case where the I/O gets an
error and the arc_buf_hdr_t is unable to get the hash_lock. Who does the
kmem_free on the b_l2hdr for that arc buf?

So the sequence of conditions here is *first* we try to acquire the hash
lock and only then we attempt to check for IO errors. If the hash lock
misses out, the buffer will never leave the ARC_L2_WRITING state,
meaning, reads to it will not be allowed. If the write failed it won't
matter, since we'll never read it back. The b_l2hdr struct will be freed
once l2arc_evict comes around to try to evict this block from L2ARC.

Cheers,

--
Saso

George Wilson

2013-08-06 13:40:27 UTC

Post by Saso Kiselkov

Post by George Wilson

Post by Saso Kiselkov

Post by George Wilson
Saso,
LGTM. I was thinking through the problem and it seems like a similar

Thanks for the review.

Post by George Wilson
l2arc_write_done()
{
<snip>
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
Presumably if the io got an error and we can't grab the hash_lock then
will not clean up the b_l2hdr. Is that correct?

Nope, this bit looks leak-free to me. "abl2" is the same as ab->b_l2hdr,
so the initial set just clears the pointer and we subsequently use abl2
to free it. Unfortunately the L2ARC code base uses two names to refer to
the L2-header: 'abl2' (arc buffer level 2) and l2hdr (level 2 header -
which I personally prefer). I've been trying to make the variable naming
more consistent, but I probably missed a few bits. I can roll the name
change for occurences of 'abl2' into this fix.
Cheers,

Let me ask this a different way. Assume a case where the I/O gets an
error and the arc_buf_hdr_t is unable to get the hash_lock. Who does the
kmem_free on the b_l2hdr for that arc buf?

So the sequence of conditions here is *first* we try to acquire the hash
lock and only then we attempt to check for IO errors. If the hash lock
misses out, the buffer will never leave the ARC_L2_WRITING state,
meaning, reads to it will not be allowed. If the write failed it won't
matter, since we'll never read it back. The b_l2hdr struct will be freed
once l2arc_evict comes around to try to evict this block from L2ARC.
Cheers,

Thanks for the clarification. With that in mind, should the freeing of
b_tmp_cdata follow the same logic? It seems like it would make more
sense for all things that fail to grab the hash_lock to get cleaned up
in a similar place.

- George

Saso Kiselkov

2013-08-06 14:37:01 UTC

Post by George Wilson

Post by Saso Kiselkov

Post by George Wilson

Post by Saso Kiselkov

Post by George Wilson
Saso,
LGTM. I was thinking through the problem and it seems like a similar

Thanks for the review.

Post by George Wilson
l2arc_write_done()
{
<snip>
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
ARCSTAT_INCR(arcstat_l2_asize,
-abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
Presumably if the io got an error and we can't grab the hash_lock then
will not clean up the b_l2hdr. Is that correct?

Nope, this bit looks leak-free to me. "abl2" is the same as
ab->b_l2hdr,
so the initial set just clears the pointer and we subsequently use abl2
to free it. Unfortunately the L2ARC code base uses two names to refer to
the L2-header: 'abl2' (arc buffer level 2) and l2hdr (level 2 header -
which I personally prefer). I've been trying to make the variable naming
more consistent, but I probably missed a few bits. I can roll the name
change for occurences of 'abl2' into this fix.
Cheers,

Let me ask this a different way. Assume a case where the I/O gets an
error and the arc_buf_hdr_t is unable to get the hash_lock. Who does the
kmem_free on the b_l2hdr for that arc buf?

So the sequence of conditions here is *first* we try to acquire the hash
lock and only then we attempt to check for IO errors. If the hash lock
misses out, the buffer will never leave the ARC_L2_WRITING state,
meaning, reads to it will not be allowed. If the write failed it won't
matter, since we'll never read it back. The b_l2hdr struct will be freed
once l2arc_evict comes around to try to evict this block from L2ARC.
Cheers,

Thanks for the clarification. With that in mind, should the freeing of
b_tmp_cdata follow the same logic? It seems like it would make more
sense for all things that fail to grab the hash_lock to get cleaned up
in a similar place.

I think a better approach would be to replace the tryenter() with a
regular enter() operation. The l2arc_write_done callback is only ever
invoked in the context of the l2arc feed thread (it's invoked by the
zio_wait() at the end of l2arc_write_buffers) and I see no reason why we
should be so time-critical in this portion of the writer. My guess is
this was baked in when l2ad_writing = B_TRUE meant that reads to the
L2ARC device are denied, resulting in increased L2 cache misses during
the writing phase, so the motivation was to "get out of there ASAP", but
this was based on some problematic hardware Sun was shipping inside of
their storage appliances (the cachezillas in the storage appliance
product). In fact, I think the whole l2arc_norw machinery ought to be
removed, or at least reversed (so that no-read-while-write behavior is
off by default).

Cheers,

--
Saso

Saso Kiselkov

2013-08-06 15:25:56 UTC

I have a new webrev up for issue #3995 and the newly introduced issue
#3997. These issues do:

#3995: Fixes a memory leak in l2arc_write_done. This is done by
replacing the mutex_tryenter with a regular mutex_enter, since
#3997 below changes how we handle l2arc_norw and l2ad_writing,
so a little wait here isn't an issue and should stop us
throwing away good L2ARC buffers simply because of a hash lock
miss.

#3997: Reverses the default l2arc_norw behavior to instead allow reads
to an L2ARC device that's currently writing. Users with buggy
L2ARC hardware that this problem affects can always alter this
to disallow reads while writing. When disallowing reads while
writing is set, there is a potential performance bottleneck in
#3995. Normally the l2ad_writing flag remains set during the
entire write-done callback, denying reads to the entire L2ARC
device, so it would be crucial that we exit the callback as
soon as possible. The change in #3995 can extend the duration of
this function by making it patiently wait for the hash lock, so
we instead moved the lowering of l2ad_writing to the start of
the write-done callback itself. Now as soon as the physical I/O
is complete and the write-done callback is called, we mark the
device as "not writing anymore", allowing reads to all stored
buffers (except those marked in the previous fill cycle as
ARC_L2_WRITING), allowing us to take the time to go through
all just written L2ARC buffers, acquire their hash locks and
properly mark them as written.

Webrevs:
https://www.illumos.org/issues/3995
https://www.illumos.org/issues/3997

Issues:
https://www.illumos.org/issues/3995
https://www.illumos.org/issues/3997

Cheers,

--
Saso

Saso Kiselkov

2013-08-06 16:12:33 UTC

Correct webrev: http://cr.illumos.org/~webrev/skiselkov/3995_3997/

Sorry for the mixup.

Cheers,

--
Saso

Matthew Ahrens

2013-08-06 16:17:08 UTC

Post by Saso Kiselkov
I have a new webrev up for issue #3995 and the newly introduced issue
#3995: Fixes a memory leak in l2arc_write_done. This is done by
replacing the mutex_tryenter with a regular mutex_enter, since
#3997 below changes how we handle l2arc_norw and l2ad_writing,
so a little wait here isn't an issue and should stop us
throwing away good L2ARC buffers simply because of a hash lock
miss.

You're sure that we weren't using mutex_tryenter(hash_lock) because it
would be unsafe to wait for the mutex? I.e. this change can't introduce a
lock-ordering problem? What is the lock ordering between the hash lock and
any other mutexes we might hold at this point (e.g. l2arc_buflist_mtx)?

--matt

Post by Saso Kiselkov
#3997: Reverses the default l2arc_norw behavior to instead allow reads
to an L2ARC device that's currently writing. Users with buggy
L2ARC hardware that this problem affects can always alter this
to disallow reads while writing. When disallowing reads while
writing is set, there is a potential performance bottleneck in
#3995. Normally the l2ad_writing flag remains set during the
entire write-done callback, denying reads to the entire L2ARC
device, so it would be crucial that we exit the callback as
soon as possible. The change in #3995 can extend the duration of
this function by making it patiently wait for the hash lock, so
we instead moved the lowering of l2ad_writing to the start of
the write-done callback itself. Now as soon as the physical I/O
is complete and the write-done callback is called, we mark the
device as "not writing anymore", allowing reads to all stored
buffers (except those marked in the previous fill cycle as
ARC_L2_WRITING), allowing us to take the time to go through
all just written L2ARC buffers, acquire their hash locks and
properly mark them as written.
https://www.illumos.org/issues/3995
https://www.illumos.org/issues/3997
https://www.illumos.org/issues/3995
https://www.illumos.org/issues/3997
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-06 16:19:55 UTC

Post by Matthew Ahrens
You're sure that we weren't using mutex_tryenter(hash_lock) because it
would be unsafe to wait for the mutex? I.e. this change can't introduce
a lock-ordering problem? What is the lock ordering between the hash
lock and any other mutexes we might hold at this point (e.g.
l2arc_buflist_mtx)?

I considered this possibility, but I missed l2arc_buflist_mtx. I'll need
to rethink this.

Cheers,

--
Saso

Garrett D'Amore

2013-08-06 16:43:03 UTC

Post by Saso Kiselkov
I have a new webrev up for issue #3995 and the newly introduced issue
#3995: Fixes a memory leak in l2arc_write_done. This is done by
replacing the mutex_tryenter with a regular mutex_enter, since
#3997 below changes how we handle l2arc_norw and l2ad_writing,
so a little wait here isn't an issue and should stop us
throwing away good L2ARC buffers simply because of a hash lock
miss.
#3997: Reverses the default l2arc_norw behavior to instead allow reads
to an L2ARC device that's currently writing. Users with buggy
L2ARC hardware that this problem affects can always alter this
to disallow reads while writing. When disallowing reads while
writing is set, there is a potential performance bottleneck in
#3995. Normally the l2ad_writing flag remains set during the
entire write-done callback, denying reads to the entire L2ARC
device, so it would be crucial that we exit the callback as
soon as possible. The change in #3995 can extend the duration of
this function by making it patiently wait for the hash lock, so
we instead moved the lowering of l2ad_writing to the start of
the write-done callback itself. Now as soon as the physical I/O
is complete and the write-done callback is called, we mark the
device as "not writing anymore", allowing reads to all stored
buffers (except those marked in the previous fill cycle as
ARC_L2_WRITING), allowing us to take the time to go through
all just written L2ARC buffers, acquire their hash locks and
properly mark them as written.

I'm uncomfortable with the latter change, unless we can make the detection and handling of known buggy devices (cachezillas) automatic.

- Garrett

Post by Saso Kiselkov
https://www.illumos.org/issues/3995
https://www.illumos.org/issues/3997
https://www.illumos.org/issues/3995
https://www.illumos.org/issues/3997
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-06 17:21:20 UTC

Post by Garrett D'Amore
I'm uncomfortable with the latter change, unless we can make the
detection and handling of known buggy devices (cachezillas) automatic.

There was a discussion on this in the issue #3137 thread on this.
Brendan back then remarked:

"That only existed because our SSDs didn't do separate read and write
command queues properly (firmware or driver bug), and writes interfered
with reads. If modern SSDs don't have a problem with simultaneous reads
& writes, then this can be set to false."

I'm not aware of any SSDs that would do this, so it must have been some
odd buggy firmware in some Sun product, and so this behavior got baked
into the defaults for obvious reasons (ZFS was Sun's product). This
should never have been the default and it has real-world performance
implications (ARC hits during the write period will fail-back to pool
disks, resulting in occasional performance dips).

Based on Brendan's description I suspect there's no easy way to reliably
programatically detect this, but it appears to me to be highly odd to
force a performance degradation onto everybody simply because Sun
happened to ship some broken pieces of kit many years ago (presumably >5
years by now).

Cheers,

--
Saso

Matthew Ahrens

2013-08-06 21:28:05 UTC

Post by Saso Kiselkov

Post by Garrett D'Amore
I'm uncomfortable with the latter change, unless we can make the
detection and handling of known buggy devices (cachezillas) automatic.

There was a discussion on this in the issue #3137 thread on this.
"That only existed because our SSDs didn't do separate read and write
command queues properly (firmware or driver bug), and writes interfered
with reads. If modern SSDs don't have a problem with simultaneous reads
& writes, then this can be set to false."
I'm not aware of any SSDs that would do this, so it must have been some
odd buggy firmware in some Sun product, and so this behavior got baked
into the defaults for obvious reasons (ZFS was Sun's product). This
should never have been the default and it has real-world performance
implications (ARC hits during the write period will fail-back to pool
disks, resulting in occasional performance dips).
Based on Brendan's description I suspect there's no easy way to reliably
programatically detect this, but it appears to me to be highly odd to
force a performance degradation onto everybody simply because Sun
happened to ship some broken pieces of kit many years ago (presumably >5
years by now).

I agree with Saso. Modern SSDs and controllers do not have the extreme
performance issues that this code was implemented to work around. This
behavior of falling back on the main pool is hurting performance more than
it is helping.

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-08-07 03:53:25 UTC

Post by Saso Kiselkov

Post by Garrett D'Amore
I'm uncomfortable with the latter change, unless we can make the
detection and handling of known buggy devices (cachezillas) automatic.

There was a discussion on this in the issue #3137 thread on this.
"That only existed because our SSDs didn't do separate read and write
command queues properly (firmware or driver bug), and writes interfered
with reads. If modern SSDs don't have a problem with simultaneous reads
& writes, then this can be set to false."
I'm not aware of any SSDs that would do this, so it must have been some
odd buggy firmware in some Sun product, and so this behavior got baked
into the defaults for obvious reasons (ZFS was Sun's product). This
should never have been the default and it has real-world performance
implications (ARC hits during the write period will fail-back to pool
disks, resulting in occasional performance dips).

My point is, can we detect and change the default for those broken devices? Are they in use with any illumos systems?

Post by Saso Kiselkov
Based on Brendan's description I suspect there's no easy way to reliably
programatically detect this, but it appears to me to be highly odd to
force a performance degradation onto everybody simply because Sun
happened to ship some broken pieces of kit many years ago (presumably >5
years by now).

The concern is, suboptimal performance, but correct function, for everyone. Or optimum performance, but total breakage for others. We've tried to stay away from breaking people who have working configurations.

Some work to detect the presence of broken cachezillas (perhaps just by inquiry data) would probably mitigate this greatly.

- Garrett

Matthew Ahrens

2013-08-07 04:13:42 UTC

Post by Garrett D'Amore

Post by Saso Kiselkov

Post by Garrett D'Amore
I'm uncomfortable with the latter change, unless we can make the
detection and handling of known buggy devices (cachezillas) automatic.

There was a discussion on this in the issue #3137 thread on this.
"That only existed because our SSDs didn't do separate read and write
command queues properly (firmware or driver bug), and writes interfered
with reads. If modern SSDs don't have a problem with simultaneous reads
& writes, then this can be set to false."
I'm not aware of any SSDs that would do this, so it must have been some
odd buggy firmware in some Sun product, and so this behavior got baked
into the defaults for obvious reasons (ZFS was Sun's product). This
should never have been the default and it has real-world performance
implications (ARC hits during the write period will fail-back to pool
disks, resulting in occasional performance dips).

My point is, can we detect and change the default for those broken
devices? Are they in use with any illumos systems?

Post by Saso Kiselkov
Based on Brendan's description I suspect there's no easy way to reliably
programatically detect this, but it appears to me to be highly odd to
force a performance degradation onto everybody simply because Sun
happened to ship some broken pieces of kit many years ago (presumably >5
years by now).

The concern is, suboptimal performance, but correct function, for
everyone. Or optimum performance, but total breakage for others. We've
tried to stay away from breaking people who have working configurations.

Sorry, what's going to cause total breakage? Are you saying that there are
devices that do not function when sent both read and write requests at the
same time?

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Garrett D'Amore

2013-08-07 05:47:27 UTC

Post by Garrett D'Amore
The concern is, suboptimal performance, but correct function, for everyone. Or optimum performance, but total breakage for others. We've tried to stay away from breaking people who have working configurations.
Sorry, what's going to cause total breakage? Are you saying that there are devices that do not function when sent both read and write requests at the same time?

I thought that was what the situation was with cachezillas that the change made was not just performance but basic functionality.

If that's not the case, and this is strictly a performance consideration for buggy devices, then I retract my objection.

- Garrett

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-07 08:12:12 UTC

Post by Garrett D'Amore
The concern is, suboptimal performance, but correct function, for
everyone. Or optimum performance, but total breakage for others.
We've tried to stay away from breaking people who have working
configurations.
Sorry, what's going to cause total breakage? Are you saying that
there are devices that do not function when sent both read and write
requests at the same time?

I thought that was what the situation was with cachezillas … that the
change made was not just performance but basic functionality.
If that's not the case, and this is strictly a performance consideration
for buggy devices, then I retract my objection.

Unfortunately the only info I have is what Brendan gave. Perhaps some of
the former Fishworks guys know more, but unless or until they decide to
share this with the world, we can but guess at which devices *might* be
affected and how it affects their behavior.

Cheers,

--
Saso

Keith Wesolowski

2013-08-07 16:05:45 UTC

I thought that was what the situation was with cachezillas … that the change made was not just performance but basic functionality.
If that's not the case, and this is strictly a performance consideration for buggy devices, then I retract my objection.

The original Readzillas were quite old and I am very sure they are no
longer being sold by anyone except maybe possibly by Oracle as spares
for first-generation Fishworks systems (and even then, I doubt it very
much as they have had no source of them for years). I would be very
surprised if any system running illumos today contains them. I know of
no modern SSD with similar issues. This is clearly an instance where we
should be optimising for the common, non-broken case and handling any
rare exceptions as they may arise. The hardware in question is 6 years
old, was not widely sold, and went EOL in 2009. Let's move on.

Garrett D'Amore

2013-08-07 17:04:19 UTC

Post by Keith Wesolowski

I thought that was what the situation was with cachezillas … that the change made was not just performance but basic functionality.
If that's not the case, and this is strictly a performance consideration for buggy devices, then I retract my objection.

The original Readzillas were quite old and I am very sure they are no
longer being sold by anyone except maybe possibly by Oracle as spares
for first-generation Fishworks systems (and even then, I doubt it very
much as they have had no source of them for years). I would be very
surprised if any system running illumos today contains them. I know of
no modern SSD with similar issues. This is clearly an instance where we
should be optimising for the common, non-broken case and handling any
rare exceptions as they may arise. The hardware in question is 6 years
old, was not widely sold, and went EOL in 2009. Let's move on.

2009 isn't *that* long ago. However, if we're reasonably sure that nobody is using illumos on those fishworks appliances, and that the devices in question weren't in use anywhere else, then I tend to agree. If the problem with those devices was only a *performance* and not a *correctness* problem, then I *definitely* agree.

That said, I know some fishworks customers had inquired with other illumos based distro providers about running illumos on their kit. (I know Nexenta had contemplated this at one point, at least.)

(I just hate making changes that break existing deployments.)

It might be a good idea to have something like a release note/FAQ section in the code that distro maintainers could pick up and integrate into their own notes.

- Garrett

Tim Cook

2013-08-07 20:10:18 UTC

Post by Garrett D'Amore

Post by Keith Wesolowski

Post by Garrett D'Amore
I thought that was what the situation was with cachezillas that the

change made was not just performance but basic functionality.

Post by Keith Wesolowski

Post by Garrett D'Amore
If that's not the case, and this is strictly a performance

consideration for buggy devices, then I retract my objection.

Post by Keith Wesolowski
The original Readzillas were quite old and I am very sure they are no
longer being sold by anyone except maybe possibly by Oracle as spares
for first-generation Fishworks systems (and even then, I doubt it very
much as they have had no source of them for years). I would be very
surprised if any system running illumos today contains them. I know of
no modern SSD with similar issues. This is clearly an instance where we
should be optimising for the common, non-broken case and handling any
rare exceptions as they may arise. The hardware in question is 6 years
old, was not widely sold, and went EOL in 2009. Let's move on.

2009 isn't *that* long ago. However, if we're reasonably sure that nobody
is using illumos on those fishworks appliances, and that the devices in
question weren't in use anywhere else, then I tend to agree. If the
problem with those devices was only a *performance* and not a *correctness*
problem, then I *definitely* agree.
That said, I know some fishworks customers had inquired with other illumos
based distro providers about running illumos on their kit. (I know Nexenta
had contemplated this at one point, at least.)
(I just hate making changes that break existing deployments.)
It might be a good idea to have something like a release note/FAQ section
in the code that distro maintainers could pick up and integrate into their
own notes.
- Garrett

Even if they do wish to use Illumos on those systems, I don't think it's
unreasonable to ask them to replace those drives? I would suspect at 4-5
years old, the drives are at the point they should be up for replacement
anyways.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-08-07 20:19:41 UTC

Post by Tim Cook
Even if they do wish to use Illumos on those systems, I don't think it's
unreasonable to ask them to replace those drives? I would suspect at
4-5 years old, the drives are at the point they should be up for
replacement anyways.

Makes sense, for SSDs especially. Still, they could just as well be
informed to set toggle-flag X in /etc/system while installing their
illumos-based OS (given that we will have a flag for legacy behavior).

My2c,
Jim

Saso Kiselkov

2013-08-19 23:57:41 UTC

Post by Garrett D'Amore
2009 isn't *that* long ago. However, if we're reasonably sure that nobody is using illumos on those fishworks appliances, and that the devices in question weren't in use anywhere else, then I tend to agree. If the problem with those devices was only a *performance* and not a *correctness* problem, then I *definitely* agree.
That said, I know some fishworks customers had inquired with other illumos based distro providers about running illumos on their kit. (I know Nexenta had contemplated this at one point, at least.)
(I just hate making changes that break existing deployments.)
It might be a good idea to have something like a release note/FAQ section in the code that distro maintainers could pick up and integrate into their own notes.

Just as a further note and so other people know about this, I asked
Brendan about this and here's his response on why l2arc_norw = B_TRUE
was put in place:

-------- Original Message --------
Subject: Re: l2arc_norw default change - what devices might be affected?
Date: Mon, 12 Aug 2013 21:05:46 -0700
From: Brendan Gregg <***@joyent.com>
To: Saso Kiselkov <***@gmail.com>

Right, the problem was with the drives we were using in 2008, which from
memory were either the STEC ZeusIOPS or the Intel X25s. They were
supposed to have separate queues for reads and writes, but that
functionality wasn't working. I /think/ that was something we could fix
in the kernel, in our lower level driver stack, and there was a ticket
filed. But the ticket wasn't a high enough priority to be fixed. The SSD
vendor engineers were surprised that we weren't using the functionality.

The problem it caused was reads queueing behind writes. So, as you know,
the L2ARC writes in batches, which can take 100s of milliseconds to
complete. Reads could queue behind those, and so with both reads and
writes at the same time there were some reads taking 10s and 100s of
milliseconds. Not great for a device supposed to be super fast for
random read workloads!

What's the status today with drives? I don't know. Other than testing it
with the L2ARC, you may be able to test it using dd. Eg, do a random
read workload to /dev/rdsk and dtrace the I/O latency. Then, issues some
dd writes to the same device for 100s of Mbytes. See how much it
interferes with the reads.

I'd lean towards Matt's opinion though - this was set a long time ago,
and now probably only affects older devices.

Brendan

Matthew Ahrens

2013-10-18 18:13:07 UTC

So what's the status on setting l2arc_norw to B_FLASE by default? It looks
like the only concerns are coming from Garrett. Could you work with him to
sort it out? Perhaps some measurements of the performance improvement that
this change will bring would be convincing.

--matt

Post by Garrett D'Amore

Post by Garrett D'Amore
2009 isn't *that* long ago. However, if we're reasonably sure that

nobody is using illumos on those fishworks appliances, and that the devices
in question weren't in use anywhere else, then I tend to agree. If the
problem with those devices was only a *performance* and not a *correctness*
problem, then I *definitely* agree.

Post by Garrett D'Amore
That said, I know some fishworks customers had inquired with other

illumos based distro providers about running illumos on their kit. (I know
Nexenta had contemplated this at one point, at least.)

Post by Garrett D'Amore
(I just hate making changes that break existing deployments.)
It might be a good idea to have something like a release note/FAQ

section in the code that distro maintainers could pick up and integrate
into their own notes.
Just as a further note and so other people know about this, I asked
Brendan about this and here's his response on why l2arc_norw = B_TRUE
-------- Original Message --------
Subject: Re: l2arc_norw default change - what devices might be
affected?
Date: Mon, 12 Aug 2013 21:05:46 -0700
Right, the problem was with the drives we were using in 2008, which from
memory were either the STEC ZeusIOPS or the Intel X25s. They were
supposed to have separate queues for reads and writes, but that
functionality wasn't working. I /think/ that was something we could fix
in the kernel, in our lower level driver stack, and there was a ticket
filed. But the ticket wasn't a high enough priority to be fixed. The SSD
vendor engineers were surprised that we weren't using the functionality.
The problem it caused was reads queueing behind writes. So, as you know,
the L2ARC writes in batches, which can take 100s of milliseconds to
complete. Reads could queue behind those, and so with both reads and
writes at the same time there were some reads taking 10s and 100s of
milliseconds. Not great for a device supposed to be super fast for
random read workloads!
What's the status today with drives? I don't know. Other than testing it
with the L2ARC, you may be able to test it using dd. Eg, do a random
read workload to /dev/rdsk and dtrace the I/O latency. Then, issues some
dd writes to the same device for 100s of Mbytes. See how much it
interferes with the reads.
I'd lean towards Matt's opinion though - this was set a long time ago,
and now probably only affects older devices.
Brendan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-18 19:11:27 UTC

Post by Matthew Ahrens
So what's the status on setting l2arc_norw to B_FLASE by default? It looks
like the only concerns are coming from Garrett. Could you work with him to
sort it out? Perhaps some measurements of the performance improvement that
this change will bring would be convincing.
--matt

That change was already made in ZFSOnLinux and it is one of the things
we are discussing in the other thread. How about we just drop that
change from the webrev, merge the #3995 fix (which looks fine to me) and
move discussion of changing l2arc_norw to the ZFSOnLinux chnage porting
discussion?

Post by Matthew Ahrens

Post by Garrett D'Amore

Post by Garrett D'Amore
2009 isn't *that* long ago. However, if we're reasonably sure that

nobody is using illumos on those fishworks appliances, and that the devices
in question weren't in use anywhere else, then I tend to agree. If the
problem with those devices was only a *performance* and not a *correctness*
problem, then I *definitely* agree.

Post by Garrett D'Amore
That said, I know some fishworks customers had inquired with other

illumos based distro providers about running illumos on their kit. (I know
Nexenta had contemplated this at one point, at least.)

Post by Garrett D'Amore
(I just hate making changes that break existing deployments.)
It might be a good idea to have something like a release note/FAQ

section in the code that distro maintainers could pick up and integrate
into their own notes.
Just as a further note and so other people know about this, I asked
Brendan about this and here's his response on why l2arc_norw = B_TRUE
-------- Original Message --------
Subject: Re: l2arc_norw default change - what devices might be
affected?
Date: Mon, 12 Aug 2013 21:05:46 -0700
Right, the problem was with the drives we were using in 2008, which from
memory were either the STEC ZeusIOPS or the Intel X25s. They were
supposed to have separate queues for reads and writes, but that
functionality wasn't working. I /think/ that was something we could fix
in the kernel, in our lower level driver stack, and there was a ticket
filed. But the ticket wasn't a high enough priority to be fixed. The SSD
vendor engineers were surprised that we weren't using the functionality.
The problem it caused was reads queueing behind writes. So, as you know,
the L2ARC writes in batches, which can take 100s of milliseconds to
complete. Reads could queue behind those, and so with both reads and
writes at the same time there were some reads taking 10s and 100s of
milliseconds. Not great for a device supposed to be super fast for
random read workloads!
What's the status today with drives? I don't know. Other than testing it
with the L2ARC, you may be able to test it using dd. Eg, do a random
read workload to /dev/rdsk and dtrace the I/O latency. Then, issues some
dd writes to the same device for 100s of Mbytes. See how much it
interferes with the reads.
I'd lean towards Matt's opinion though - this was set a long time ago,
and now probably only affects older devices.
Brendan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-18 19:15:36 UTC

Post by Richard Yao

Post by Matthew Ahrens
So what's the status on setting l2arc_norw to B_FLASE by default? It looks
like the only concerns are coming from Garrett. Could you work with him to
sort it out? Perhaps some measurements of the performance improvement that
this change will bring would be convincing.
--matt

That change was already made in ZFSOnLinux and it is one of the things
we are discussing in the other thread. How about we just drop that
change from the webrev, merge the #3995 fix (which looks fine to me) and
move discussion of changing l2arc_norw to the ZFSOnLinux chnage porting
discussion?

On second thought, I do have 1 problem with this webrev: Why does
arcstat_l2_writes_hdr_miss go away?

I am not particularly attached to it, but I would like the rationale for
removing it stated.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Matthew Ahrens

2013-10-18 20:36:03 UTC

Post by Richard Yao

Post by Matthew Ahrens
So what's the status on setting l2arc_norw to B_FLASE by default? It

looks

Post by Matthew Ahrens
like the only concerns are coming from Garrett. Could you work with him

to

Post by Matthew Ahrens
sort it out? Perhaps some measurements of the performance improvement

that

Post by Matthew Ahrens
this change will bring would be convincing.
--matt

That change was already made in ZFSOnLinux and it is one of the things
we are discussing in the other thread. How about we just drop that
change from the webrev, merge the #3995 fix (which looks fine to me) and
move discussion of changing l2arc_norw to the ZFSOnLinux chnage porting
discussion?

As I mentioned in my email to the ZFSonLinux mailing list, my advice is to
deal with the l2arc_norw here as an illumos issue, rather than as a port
from linux.

--matt

Post by Richard Yao

Post by Matthew Ahrens

Post by Garrett D'Amore

Post by Garrett D'Amore
2009 isn't *that* long ago. However, if we're reasonably sure that

nobody is using illumos on those fishworks appliances, and that the

devices

Post by Matthew Ahrens

Post by Garrett D'Amore
in question weren't in use anywhere else, then I tend to agree. If the
problem with those devices was only a *performance* and not a

*correctness*

Post by Matthew Ahrens

Post by Garrett D'Amore
problem, then I *definitely* agree.

Post by Garrett D'Amore
That said, I know some fishworks customers had inquired with other

illumos based distro providers about running illumos on their kit. (I

know

Post by Matthew Ahrens

Post by Garrett D'Amore
Nexenta had contemplated this at one point, at least.)

Post by Garrett D'Amore
(I just hate making changes that break existing deployments.)
It might be a good idea to have something like a release note/FAQ

section in the code that distro maintainers could pick up and integrate
into their own notes.
Just as a further note and so other people know about this, I asked
Brendan about this and here's his response on why l2arc_norw = B_TRUE
-------- Original Message --------
Subject: Re: l2arc_norw default change - what devices might be
affected?
Date: Mon, 12 Aug 2013 21:05:46 -0700
Right, the problem was with the drives we were using in 2008, which from
memory were either the STEC ZeusIOPS or the Intel X25s. They were
supposed to have separate queues for reads and writes, but that
functionality wasn't working. I /think/ that was something we could fix
in the kernel, in our lower level driver stack, and there was a ticket
filed. But the ticket wasn't a high enough priority to be fixed. The SSD
vendor engineers were surprised that we weren't using the functionality.
The problem it caused was reads queueing behind writes. So, as you know,
the L2ARC writes in batches, which can take 100s of milliseconds to
complete. Reads could queue behind those, and so with both reads and
writes at the same time there were some reads taking 10s and 100s of
milliseconds. Not great for a device supposed to be super fast for
random read workloads!
What's the status today with drives? I don't know. Other than testing it
with the L2ARC, you may be able to test it using dd. Eg, do a random
read workload to /dev/rdsk and dtrace the I/O latency. Then, issues some
dd writes to the same device for 100s of Mbytes. See how much it
interferes with the reads.
I'd lean towards Matt's opinion though - this was set a long time ago,
and now probably only affects older devices.
Brendan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now

https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
https://www.listbox.com/member/?&

Post by Matthew Ahrens
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-10-18 20:40:16 UTC

Post by Matthew Ahrens

Post by Richard Yao

Post by Matthew Ahrens
So what's the status on setting l2arc_norw to B_FLASE by default? It

looks

Post by Matthew Ahrens
like the only concerns are coming from Garrett. Could you work with him

to

Post by Matthew Ahrens
sort it out? Perhaps some measurements of the performance improvement

that

Post by Matthew Ahrens
this change will bring would be convincing.
--matt

That change was already made in ZFSOnLinux and it is one of the things
we are discussing in the other thread. How about we just drop that
change from the webrev, merge the #3995 fix (which looks fine to me) and
move discussion of changing l2arc_norw to the ZFSOnLinux chnage porting
discussion?

As I mentioned in my email to the ZFSonLinux mailing list, my advice is to
deal with the l2arc_norw here as an illumos issue, rather than as a port
from linux.
--matt

I had not seen that email yet.and I am completely fine with that idea.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Matthew Ahrens

2013-11-22 20:27:42 UTC

It seems like everyone agreed that the fix for 3995 is good. Can you
please push that fix to illumos?

--matt

Post by Richard Yao

Post by Matthew Ahrens

Post by Richard Yao

Post by Matthew Ahrens
So what's the status on setting l2arc_norw to B_FLASE by default? It

looks

Post by Matthew Ahrens
like the only concerns are coming from Garrett. Could you work with

him

Post by Matthew Ahrens

Post by Richard Yao
to

Post by Matthew Ahrens
sort it out? Perhaps some measurements of the performance improvement

that

Post by Matthew Ahrens
this change will bring would be convincing.
--matt

That change was already made in ZFSOnLinux and it is one of the things
we are discussing in the other thread. How about we just drop that
change from the webrev, merge the #3995 fix (which looks fine to me) and
move discussion of changing l2arc_norw to the ZFSOnLinux chnage porting
discussion?

As I mentioned in my email to the ZFSonLinux mailing list, my advice is

to

Post by Matthew Ahrens
deal with the l2arc_norw here as an illumos issue, rather than as a port
from linux.
--matt

I had not seen that email yet.and I am completely fine with that idea.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

29 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Saso Kiselkov 2013-08-05 21:18:18 UTC

Matthew Ahrens 2013-08-05 21:42:28 UTC

George Wilson 2013-08-05 22:42:26 UTC

Saso Kiselkov 2013-08-05 23:52:34 UTC

George Wilson 2013-08-06 12:50:03 UTC

Saso Kiselkov 2013-08-06 13:05:17 UTC

George Wilson 2013-08-06 13:40:27 UTC

Saso Kiselkov 2013-08-06 14:37:01 UTC

Saso Kiselkov 2013-08-06 15:25:56 UTC

Saso Kiselkov 2013-08-06 16:12:33 UTC

Matthew Ahrens 2013-08-06 16:17:08 UTC

Saso Kiselkov 2013-08-06 16:19:55 UTC

Garrett D'Amore 2013-08-06 16:43:03 UTC

Saso Kiselkov 2013-08-06 17:21:20 UTC

Matthew Ahrens 2013-08-06 21:28:05 UTC

Garrett D'Amore 2013-08-07 03:53:25 UTC

Matthew Ahrens 2013-08-07 04:13:42 UTC

Garrett D'Amore 2013-08-07 05:47:27 UTC

Saso Kiselkov 2013-08-07 08:12:12 UTC

Keith Wesolowski 2013-08-07 16:05:45 UTC

Garrett D'Amore 2013-08-07 17:04:19 UTC

Tim Cook 2013-08-07 20:10:18 UTC

Jim Klimov 2013-08-07 20:19:41 UTC

Saso Kiselkov 2013-08-19 23:57:41 UTC

Matthew Ahrens 2013-10-18 18:13:07 UTC

Richard Yao 2013-10-18 19:11:27 UTC

Richard Yao 2013-10-18 19:15:36 UTC

Matthew Ahrens 2013-10-18 20:36:03 UTC

Richard Yao 2013-10-18 20:40:16 UTC

Matthew Ahrens 2013-11-22 20:27:42 UTC

about - legalese

Loading...