pool hang during zpool remove cache device

Discussion:

Joerg Goltermann

2014-04-04 13:07:00 UTC

Hi,

today i had to removed a cache device from a pool which caused a hang
of the complete pool for about 3 minutes.

I was too slow to debug the situation, but after looking at the code, I
think the function l2arc_evict() in arc.c is the problem.

The global mutex l2arc_buflist_mtx is entered at the beginning and if
there is no hash lock miss, it's hold until the complete list is evicted.

I did no further debugging or testing, but do you think something like
below can fix this?

default for zfs_free_max_blocks is UINT64_MAX, but I use a smaller
number like 100000.

- Joerg

--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4416,7 +4416,7 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t all)
l2arc_buf_hdr_t *abl2;
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
- uint64_t taddr;
+ uint64_t taddr, freed;

buflist = dev->l2ad_buflist;

@@ -4444,10 +4444,15 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t all)
uint64_t, taddr, boolean_t, all);

top:
+ freed = 0;
mutex_enter(&l2arc_buflist_mtx);
for (ab = list_tail(buflist); ab; ab = ab_prev) {
- ab_prev = list_prev(buflist, ab);
+ if (++freed >= zfs_free_max_blocks) {
+ mutex_exit(&l2arc_buflist_mtx);
+ goto top;
+ }

+ ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*

Richard Elling

2014-04-05 20:21:11 UTC

Permalink

Yes, we've used this technique elsewhere with success.

Post by Joerg Goltermann
default for zfs_free_max_blocks is UINT64_MAX, but I use a smaller
number like 100000.

I would argue 100000 is still too big. It doesn't really cost us to
drop the lock for the l2arc_evict() more often. How does 1000 sound?
-- richard

Post by Joerg Goltermann
- Joerg
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4416,7 +4416,7 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
l2arc_buf_hdr_t *abl2;
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
- uint64_t taddr;
+ uint64_t taddr, freed;
buflist = dev->l2ad_buflist;
@@ -4444,10 +4444,15 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
uint64_t, taddr, boolean_t, all);
+ freed = 0;
mutex_enter(&l2arc_buflist_mtx);
for (ab = list_tail(buflist); ab; ab = ab_prev) {
- ab_prev = list_prev(buflist, ab);
+ if (++freed >= zfs_free_max_blocks) {
+ mutex_exit(&l2arc_buflist_mtx);
+ goto top;
+ }
+ ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Kirill Davydychev

2014-04-05 20:50:30 UTC

Permalink

Hi Joerg

Take a look at Nexenta’s async evict code, which covers both ARC and L2ARC. I think it should not be that hard to upstream this code to illumos. In the default ZFS implementation, ARC/L2ARC eviction is synchronous, and may block I/O.

Commits in question:

https://github.com/Nexenta/illumos-nexenta/commit/1d33b61bc4b5e2a83a9297ad588b96a117ef335e - initial feature implementation.
https://github.com/Nexenta/illumos-nexenta/commit/3fe92f463590e8ea8ea8b0b7fd39327facd154b5 - minor fix.

Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.

Post by Joerg Goltermann
Hi,
today i had to removed a cache device from a pool which caused a hang
of the complete pool for about 3 minutes.
I was too slow to debug the situation, but after looking at the code, I
think the function l2arc_evict() in arc.c is the problem.
The global mutex l2arc_buflist_mtx is entered at the beginning and if
there is no hash lock miss, it's hold until the complete list is evicted.
I did no further debugging or testing, but do you think something like
below can fix this?
default for zfs_free_max_blocks is UINT64_MAX, but I use a smaller
number like 100000.
- Joerg
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4416,7 +4416,7 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
l2arc_buf_hdr_t *abl2;
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
- uint64_t taddr;
+ uint64_t taddr, freed;
buflist = dev->l2ad_buflist;
@@ -4444,10 +4444,15 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
uint64_t, taddr, boolean_t, all);
+ freed = 0;
mutex_enter(&l2arc_buflist_mtx);
for (ab = list_tail(buflist); ab; ab = ab_prev) {
- ab_prev = list_prev(buflist, ab);
+ if (++freed >= zfs_free_max_blocks) {
+ mutex_exit(&l2arc_buflist_mtx);
+ goto top;
+ }
+ ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22099383-fefe14de
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Gordon Ross

2014-04-16 03:32:49 UTC

Permalink

That locking change does seem like a bit of a "band aid", no?

The solution Nexenta folks came up with for this goes much further,
eliminating that long-held lock situation. It also allowed us to
significantly reduce export time.

Anyone interested in trying to get that upstream?

Gordon

On Sat, Apr 5, 2014 at 4:50 PM, Kirill Davydychev

Post by Kirill Davydychev
Hi Joerg
Take a look at Nexenta’s async evict code, which covers both ARC and L2ARC. I think it should not be that hard to upstream this code to illumos. In the default ZFS implementation, ARC/L2ARC eviction is synchronous, and may block I/O.
https://github.com/Nexenta/illumos-nexenta/commit/1d33b61bc4b5e2a83a9297ad588b96a117ef335e - initial feature implementation.
https://github.com/Nexenta/illumos-nexenta/commit/3fe92f463590e8ea8ea8b0b7fd39327facd154b5 - minor fix.
Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22050030-47af814e
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Gordon Ross <***@nexenta.com>
Nexenta Systems, Inc. www.nexenta.com
Enterprise class storage for everyone

Joerg Goltermann

2014-04-16 08:50:53 UTC

Permalink

Post by Gordon Ross
That locking change does seem like a bit of a "band aid", no?
The solution Nexenta folks came up with for this goes much further,
eliminating that long-held lock situation. It also allowed us to
significantly reduce export time.
Anyone interested in trying to get that upstream?

Gordon Ross

2014-04-16 16:47:11 UTC

Permalink

Oh. I saw this fix internally, but now I'm not sure it ended up in
the current release.
Sorry, I'll have to look into that, which will take some time, so
don't wait for me.

Gordon

Post by Joerg Goltermann

Maybe I missed something, but Nexenta's version of _l2arc_evict which
is called from the background task will held the l2arc_buflist_mtx too.
In one of the commit messages you can see "Real fix should rework l2arc
evict according to OS-53, but for now just longer queue should suffice."
What does OS-53 describe?
- Joerg
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22050030-47af814e
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Gordon Ross <***@nexenta.com>
Nexenta Systems, Inc. www.nexenta.com
Enterprise class storage for everyone

Matthew Ahrens

2014-04-05 21:52:32 UTC

Permalink

Something like this makes sense. Please implement it similar to what I did
with arc_evict_iterations in arc_evict() and arc_evict_ghost(). E.g. use
the existing variable (arc_evict_iterations), and call kpreempt() after
dropping the lock. Ideally, keep track of how far we got through the
iteration with a "marker" object.

--matt

Post by Joerg Goltermann
Hi,
today i had to removed a cache device from a pool which caused a hang
of the complete pool for about 3 minutes.
I was too slow to debug the situation, but after looking at the code, I
think the function l2arc_evict() in arc.c is the problem.
The global mutex l2arc_buflist_mtx is entered at the beginning and if
there is no hash lock miss, it's hold until the complete list is evicted.
I did no further debugging or testing, but do you think something like
below can fix this?
default for zfs_free_max_blocks is UINT64_MAX, but I use a smaller
number like 100000.
- Joerg
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4416,7 +4416,7 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t all)
l2arc_buf_hdr_t *abl2;
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
- uint64_t taddr;
+ uint64_t taddr, freed;
buflist = dev->l2ad_buflist;
@@ -4444,10 +4444,15 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t all)
uint64_t, taddr, boolean_t, all);
+ freed = 0;
mutex_enter(&l2arc_buflist_mtx);
for (ab = list_tail(buflist); ab; ab = ab_prev) {
- ab_prev = list_prev(buflist, ab);
+ if (++freed >= zfs_free_max_blocks) {
+ mutex_exit(&l2arc_buflist_mtx);
+ goto top;
+ }
+ ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Joerg Goltermann

2014-04-09 06:37:34 UTC

Permalink

Hi Matthew,

Post by Matthew Ahrens
Something like this makes sense. Please implement it similar to what I
did with arc_evict_iterations in arc_evict() and arc_evict_ghost().
E.g. use the existing variable (arc_evict_iterations), and call
kpreempt() after dropping the lock. Ideally, keep track of how far we
got through the iteration with a "marker" object.

thank you for the input, I think this should do it:

diff --git a/usr/src/uts/common/fs/zfs/arc.c
b/usr/src/uts/common/fs/zfs/arc.c
index 8bee031..2d510a3 100644
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4446,6 +4446,8 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
uint64_t taddr;
+ arc_buf_hdr_t marker = { 0 };
+ int count = 0;

buflist = dev->l2ad_buflist;

@@ -4477,6 +4479,27 @@ top:
for (ab = list_tail(buflist); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);

+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
+ /*
+ * It may take a long time to evict all the bufs requested.
+ * To avoid blocking all l2arc activity, periodically drop
+ * the l2arc_buflist_mtx and give other threads a chance to
+ * run before reacquiring the lock.
+ */
+ if (count++ > arc_evict_iterations) {
+ list_insert_after(buflist, ab, &marker);
+ mutex_exit(&l2arc_buflist_mtx);
+ kpreempt(KPREEMPT_SYNC);
+ mutex_enter(&l2arc_buflist_mtx);
+ ab_prev = list_prev(buflist, &marker);
+ list_remove(buflist, &marker);
+ count = 0;
+ continue;
+ }
+
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*

Matthew Ahrens

2014-04-09 17:06:06 UTC

Permalink

That looks like a good start. I think we would also need to ignore the
markers anywhere else that we iterate over l2ad_buflist. e.g.
l2arc_write_done, l2arc_write_buffers.

--matt

Post by Joerg Goltermann
Hi Matthew,

diff --git a/usr/src/uts/common/fs/zfs/arc.c
b/usr/src/uts/common/fs/zfs/arc.c
index 8bee031..2d510a3 100644
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4446,6 +4446,8 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
uint64_t taddr;
+ arc_buf_hdr_t marker = { 0 };
+ int count = 0;
buflist = dev->l2ad_buflist;
for (ab = list_tail(buflist); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);
+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
+ /*
+ * It may take a long time to evict all the bufs requested.
+ * To avoid blocking all l2arc activity, periodically drop
+ * the l2arc_buflist_mtx and give other threads a chance to
+ * run before reacquiring the lock.
+ */
+ if (count++ > arc_evict_iterations) {
+ list_insert_after(buflist, ab, &marker);
+ mutex_exit(&l2arc_buflist_mtx);
+ kpreempt(KPREEMPT_SYNC);
+ mutex_enter(&l2arc_buflist_mtx);
+ ab_prev = list_prev(buflist, &marker);
+ list_remove(buflist, &marker);
+ count = 0;
+ continue;
+ }
+
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Joerg Goltermann

2014-04-15 06:42:51 UTC

Permalink

Post by Matthew Ahrens
That looks like a good start. I think we would also need to ignore the
markers anywhere else that we iterate over l2ad_buflist. e.g.
l2arc_write_done, l2arc_write_buffers.

good catch, I only found the two usage cases your already mentioned.

--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4267,6 +4267,10 @@ l2arc_write_done(zio_t *zio)
ab_prev = list_prev(buflist, ab);
abl2 = ab->b_l2hdr;

+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
/*
* Release the temporary compressed buffer as soon as
possible.
*/
@@ -4446,6 +4450,8 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t all)
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
uint64_t taddr;
+ arc_buf_hdr_t marker = { 0 };
+ int count = 0;

buflist = dev->l2ad_buflist;

@@ -4477,6 +4483,27 @@ top:
for (ab = list_tail(buflist); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);

+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
+ /*
+ * It may take a long time to evict all the bufs requested.
+ * To avoid blocking all l2arc activity, periodically drop
+ * the l2arc_buflist_mtx and give other threads a chance to
+ * run before reacquiring the lock.
+ */
+ if (count++ > arc_evict_iterations) {
+ list_insert_after(buflist, ab, &marker);
+ mutex_exit(&l2arc_buflist_mtx);
+ kpreempt(KPREEMPT_SYNC);
+ mutex_enter(&l2arc_buflist_mtx);
+ ab_prev = list_prev(buflist, &marker);
+ list_remove(buflist, &marker);
+ count = 0;
+ continue;
+ }
+
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
@@ -4751,6 +4778,10 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev,
uint64_t target_sz,
l2arc_buf_hdr_t *l2hdr;
uint64_t buf_sz;

+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
/*
* We shouldn't need to lock the buffer here, since we
flagged
* it as ARC_L2_WRITING in the previous step, but we
must take

Matthew Ahrens

2014-04-15 15:15:47 UTC

Permalink

Looks good to me. Please test thoroughly and submit an RTI.

--matt

Post by Joerg Goltermann

Post by Matthew Ahrens
That looks like a good start. I think we would also need to ignore the
markers anywhere else that we iterate over l2ad_buflist. e.g.
l2arc_write_done, l2arc_write_buffers.

good catch, I only found the two usage cases your already mentioned.
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -4267,6 +4267,10 @@ l2arc_write_done(zio_t *zio)
ab_prev = list_prev(buflist, ab);
abl2 = ab->b_l2hdr;
+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
/*
* Release the temporary compressed buffer as soon as
possible.
*/
@@ -4446,6 +4450,8 @@ l2arc_evict(l2arc_dev_t *dev, uint64_t distance,
boolean_t all)
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
uint64_t taddr;
+ arc_buf_hdr_t marker = { 0 };
+ int count = 0;
buflist = dev->l2ad_buflist;
for (ab = list_tail(buflist); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);
+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
+ /*
+ * It may take a long time to evict all the bufs requested.
+ * To avoid blocking all l2arc activity, periodically drop
+ * the l2arc_buflist_mtx and give other threads a chance to
+ * run before reacquiring the lock.
+ */
+ if (count++ > arc_evict_iterations) {
+ list_insert_after(buflist, ab, &marker);
+ mutex_exit(&l2arc_buflist_mtx);
+ kpreempt(KPREEMPT_SYNC);
+ mutex_enter(&l2arc_buflist_mtx);
+ ab_prev = list_prev(buflist, &marker);
+ list_remove(buflist, &marker);
+ count = 0;
+ continue;
+ }
+
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
@@ -4751,6 +4778,10 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev,
uint64_t target_sz,
l2arc_buf_hdr_t *l2hdr;
uint64_t buf_sz;
+ /* ignore markers */
+ if (ab->b_spa == 0)
+ continue;
+
/*
* We shouldn't need to lock the buffer here, since we
flagged
* it as ARC_L2_WRITING in the previous step, but we
must take
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com