Discussion:
[OpenZFS Developer] Pool running out of space by just being open
Steven Hartland via illumos-zfs
2014-10-15 04:45:13 UTC
Permalink
----- Original Message -----
From: "Steven Hartland"
I've been investigating an issue for a user who was seeing
his pool import hang after upgrading on FreeBSD. After
digging around it turned out the issue was due to lack
of free space on the pool.
As the pool imports it writes hence requiring space but the
pool has so little space this was failing. The IO being
a required IO it retries, but obviously fails again
resulting the the pool being suspened hence the hang.
With the pool suspended during import it still holds the
pool lock so all attempts to query the status also hang,
which is one problem as the user can't tell why the hang
has occured.
During the debugging I mounted the pool read only and
sent a copy to another empty pool, which resulted in ~1/2
capacity being recovered. This seemed odd but I dismissed
it at the time.
The machine was then left, with the pool not being accessed,
however I just recieved an alert from our monitoring for
a pool failure. On looking I now see the new pool I created
with 2 write errors and no free space. So just having the
pool mounted, with no access happening, has managed to use
the remain 2GB on the 4GB pool.
Has anyone seen this before or has any ideas what might
be going on?
zdb -m -m -m -m <pool> shows allocation to transactions e.g.
metaslab 100 offset c8000000 spacemap 1453 free 0
segments 0 maxsize 0 freepct 0%
On-disk histogram: fragmentation 0
[ 0] ALLOC: txg 417, pass 2
[ 1] A range: 00c8000000-00c8001600 size: 001600
[ 2] ALLOC: txg 417, pass 3
[ 3] A range: 00c8001600-00c8003a00 size: 002400
[ 4] ALLOC: txg 418, pass 2
[ 5] A range: 00c8003a00-00c8005000 size: 001600
[ 6] ALLOC: txg 418, pass 3
[ 7] A range: 00c8005000-00c8006600 size: 001600
[ 8] ALLOC: txg 419, pass 2
[ 9] A range: 00c8006600-00c8007c00 size: 001600
[ 10] ALLOC: txg 419, pass 3
I tried destroying the pool and that hung, presumably due to
IO being suspended after the out of space errors.
After bisecting the kernel changes the commit which seems
to be causing this is:
https://svnweb.freebsd.org/base?view=revision&revision=268650
https://github.com/freebsd/freebsd/commit/91643324a9009cb5fbc8c00544b7781941f0d5d1
which correlates to:
https://github.com/illumos/illumos-gate/commit/7fd05ac4dec0c343d2f68f310d3718b715ecfbaf

I've checked the two make the same changes so there doesn't
seem to have been a downstream merge issue, at least not on
this specific commit.

My test now consists of:
1. mdconfig -t malloc -s 4G -S 512
2. zpool create tpool md0
3. zfs recv -duF tpool < test.zfs
4. zpool list -p -o free zfs 5

With this commit present, free reduces every 5 seconds until
the pool is out of space. Without it after at most 3 reductions
the pool settles and no further free space reduction is seen.

I've also found that creating the pool without async_destroy
enabled also prevents the issue.

An image that shows the final result of the leak can be found
here:
http://www.ijs.si/usr/mark/bsd/

On FreeBSD this image stalls on import unless imported readonly.
Once imported I used the following to create the test image
used above:
zfs send -R zfs/***@auto-2014-09-19_22.30 >test.zfs

Copying in the zfs illumos list to get more eyeballs given it
seems to be a quite serious issue.

Regards
Steve
Rich via illumos-zfs
2014-10-15 05:10:39 UTC
Permalink
So, the commit you referenced is the commit that introduced the
behavior of "in these error cases, we can't figure out what's in use
reliably, so 'leak the space', log that we leaked it, and that way we
don't corrupt things at the cost of space".

So I would posit that the problem may be one of a leaking/corruption
condition being triggered that simply wasn't noted prior to said
commit, rather than being a logical flaw in that commit?

You've helpfully included all the resources needed for someone to
replicate this on their own, which should help a lot, presuming that
people don't immediately find some counter-examples to your train of
logic (which doesn't seem flawed to my eyes, at the moment).

Have you actually tried this on any non-FBSD platforms yet to see if
this is as reliably replicable there yet?

I'm honestly kind of terrified by the prospect of a nice poisonous
dataset that can cause your pool to leak space until it's gone, so
I'll probably test this on some testbed as soon as I set one up that
has nothing else being tested on it I'd mind losing...

- Rich

On Wed, Oct 15, 2014 at 12:45 AM, Steven Hartland via illumos-zfs
----- Original Message ----- From: "Steven Hartland"
I've been investigating an issue for a user who was seeing
his pool import hang after upgrading on FreeBSD. After
digging around it turned out the issue was due to lack
of free space on the pool.
As the pool imports it writes hence requiring space but the
pool has so little space this was failing. The IO being
a required IO it retries, but obviously fails again
resulting the the pool being suspened hence the hang.
With the pool suspended during import it still holds the
pool lock so all attempts to query the status also hang,
which is one problem as the user can't tell why the hang
has occured.
During the debugging I mounted the pool read only and
sent a copy to another empty pool, which resulted in ~1/2
capacity being recovered. This seemed odd but I dismissed
it at the time.
The machine was then left, with the pool not being accessed,
however I just recieved an alert from our monitoring for
a pool failure. On looking I now see the new pool I created
with 2 write errors and no free space. So just having the
pool mounted, with no access happening, has managed to use
the remain 2GB on the 4GB pool.
Has anyone seen this before or has any ideas what might
be going on?
zdb -m -m -m -m <pool> shows allocation to transactions e.g.
metaslab 100 offset c8000000 spacemap 1453 free 0
segments 0 maxsize 0 freepct 0%
On-disk histogram: fragmentation 0
[ 0] ALLOC: txg 417, pass 2
[ 1] A range: 00c8000000-00c8001600 size: 001600
[ 2] ALLOC: txg 417, pass 3
[ 3] A range: 00c8001600-00c8003a00 size: 002400
[ 4] ALLOC: txg 418, pass 2
[ 5] A range: 00c8003a00-00c8005000 size: 001600
[ 6] ALLOC: txg 418, pass 3
[ 7] A range: 00c8005000-00c8006600 size: 001600
[ 8] ALLOC: txg 419, pass 2
[ 9] A range: 00c8006600-00c8007c00 size: 001600
[ 10] ALLOC: txg 419, pass 3
I tried destroying the pool and that hung, presumably due to
IO being suspended after the out of space errors.
After bisecting the kernel changes the commit which seems
https://svnweb.freebsd.org/base?view=revision&revision=268650
https://github.com/freebsd/freebsd/commit/91643324a9009cb5fbc8c00544b7781941f0d5d1
https://github.com/illumos/illumos-gate/commit/7fd05ac4dec0c343d2f68f310d3718b715ecfbaf
I've checked the two make the same changes so there doesn't
seem to have been a downstream merge issue, at least not on
this specific commit.
1. mdconfig -t malloc -s 4G -S 512
2. zpool create tpool md0
3. zfs recv -duF tpool < test.zfs
4. zpool list -p -o free zfs 5
With this commit present, free reduces every 5 seconds until
the pool is out of space. Without it after at most 3 reductions
the pool settles and no further free space reduction is seen.
I've also found that creating the pool without async_destroy
enabled also prevents the issue.
An image that shows the final result of the leak can be found
http://www.ijs.si/usr/mark/bsd/
On FreeBSD this image stalls on import unless imported readonly.
Once imported I used the following to create the test image
Copying in the zfs illumos list to get more eyeballs given it
seems to be a quite serious issue.
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24536931-5d25148d
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-10-15 13:43:01 UTC
Permalink
----- Original Message -----
Post by Rich via illumos-zfs
So, the commit you referenced is the commit that introduced the
behavior of "in these error cases, we can't figure out what's in use
reliably, so 'leak the space', log that we leaked it, and that way we
don't corrupt things at the cost of space".
So I would posit that the problem may be one of a leaking/corruption
condition being triggered that simply wasn't noted prior to said
commit, rather than being a logical flaw in that commit?
I need to investigate more to confirm that.
Post by Rich via illumos-zfs
You've helpfully included all the resources needed for someone to
replicate this on their own, which should help a lot, presuming that
people don't immediately find some counter-examples to your train of
logic (which doesn't seem flawed to my eyes, at the moment).
Have you actually tried this on any non-FBSD platforms yet to see if
this is as reliably replicable there yet?
Unfortunately I only have access to FreeBSD installs here, my old illumos
box is very out of date and it took me days to get working last time ;-)
Post by Rich via illumos-zfs
I'm honestly kind of terrified by the prospect of a nice poisonous
dataset that can cause your pool to leak space until it's gone, so
I'll probably test this on some testbed as soon as I set one up that
has nothing else being tested on it I'd mind losing...
Indeed!

FreeBSD is currently in the final stages of 10.1 release process so
I've already informed the Release Engineering team of this issue as it
appears to result in unrecoverable pool. For the original reporterit
resulted in a unbootable machine, as it had leaked that much the pool
didnt have enough space to perform the import on boot.

For reference in my test it leaked over 2GB in under 12hours.

Regards
Steve
Matthew Ahrens via illumos-zfs
2014-10-15 06:04:47 UTC
Permalink
On Tue, Oct 14, 2014 at 9:45 PM, Steven Hartland via illumos-zfs <
----- Original Message ----- From: "Steven Hartland"
I've been investigating an issue for a user who was seeing
his pool import hang after upgrading on FreeBSD. After
digging around it turned out the issue was due to lack
of free space on the pool.
As the pool imports it writes hence requiring space but the
pool has so little space this was failing. The IO being
a required IO it retries, but obviously fails again
resulting the the pool being suspened hence the hang.
With the pool suspended during import it still holds the
pool lock so all attempts to query the status also hang,
which is one problem as the user can't tell why the hang
has occured.
During the debugging I mounted the pool read only and
sent a copy to another empty pool, which resulted in ~1/2
capacity being recovered. This seemed odd but I dismissed
it at the time.
The machine was then left, with the pool not being accessed,
however I just recieved an alert from our monitoring for
a pool failure. On looking I now see the new pool I created
with 2 write errors and no free space. So just having the
pool mounted, with no access happening, has managed to use
the remain 2GB on the 4GB pool.
Has anyone seen this before or has any ideas what might
be going on?
zdb -m -m -m -m <pool> shows allocation to transactions e.g.
metaslab 100 offset c8000000 spacemap 1453 free 0
segments 0 maxsize 0 freepct 0%
On-disk histogram: fragmentation 0
[ 0] ALLOC: txg 417, pass 2
[ 1] A range: 00c8000000-00c8001600 size: 001600
[ 2] ALLOC: txg 417, pass 3
[ 3] A range: 00c8001600-00c8003a00 size: 002400
[ 4] ALLOC: txg 418, pass 2
[ 5] A range: 00c8003a00-00c8005000 size: 001600
[ 6] ALLOC: txg 418, pass 3
[ 7] A range: 00c8005000-00c8006600 size: 001600
[ 8] ALLOC: txg 419, pass 2
[ 9] A range: 00c8006600-00c8007c00 size: 001600
[ 10] ALLOC: txg 419, pass 3
I tried destroying the pool and that hung, presumably due to
IO being suspended after the out of space errors.
After bisecting the kernel changes the commit which seems
https://svnweb.freebsd.org/base?view=revision&revision=268650
https://github.com/freebsd/freebsd/commit/91643324a9009cb5fbc8c00544b778
1941f0d5d1
https://github.com/illumos/illumos-gate/commit/
7fd05ac4dec0c343d2f68f310d3718b715ecfbaf
I've checked the two make the same changes so there doesn't
seem to have been a downstream merge issue, at least not on
this specific commit.
1. mdconfig -t malloc -s 4G -S 512
2. zpool create tpool md0
3. zfs recv -duF tpool < test.zfs
4. zpool list -p -o free zfs 5
With this commit present, free reduces every 5 seconds until
the pool is out of space. Without it after at most 3 reductions
the pool settles and no further free space reduction is seen.
How did you generate the send stream? Please include the exact command you
used. How big is the send stream? Can you provide the send stream?

What is the output of "zfs recv"? Does it fail?

What is the "freeing" property?

What is the "leaked" property?

If you write to the pool (before it runs completely out of space), does the
free space jump back up to where it should be?

I tried a simple test on my system, sending one pool to another and
receiving with the flags you mentioned (on DelphixOS, which is not too
different from illumos trunk) and the problem did not reproduce. After the
receive, the pool is idle (no i/o observed with iostat, no change to "free"
property).

I took a look at the pool image you linked. The "leaked" space is consumed
by deferred frees (zdb -bb will tell you this). My first guess would be
that something is causing us to not process the deferred free list when we
should. i.e. spa_sync() is not calling spa_sync_deferred_frees(), but it
is subsequently syncing real changes to disk and adding more deferred
frees. This code looks somewhat fragile -- we are essentially checking
"will there be changes this txg", but the answer is not enforced. There
were some changes to dsl_scan_active() as part of the commit you mentioned,
which could be involved.

--matt
I've also found that creating the pool without async_destroy
enabled also prevents the issue.
An image that shows the final result of the leak can be found
http://www.ijs.si/usr/mark/bsd/
On FreeBSD this image stalls on import unless imported readonly.
Once imported I used the following to create the test image
Copying in the zfs illumos list to get more eyeballs given it
seems to be a quite serious issue.
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-10-15 13:34:30 UTC
Permalink
----- Original Message -----
From: "Matthew Ahrens" <***@delphix.com>

snip.
Post by Matthew Ahrens via illumos-zfs
How did you generate the send stream? Please include the exact command you
used.
How big is the send stream?
1.7G
Post by Matthew Ahrens via illumos-zfs
Can you provide the send stream?
http://blog.multiplay.co.uk/dropzone/freebsd/test.zfs.gz
Post by Matthew Ahrens via illumos-zfs
What is the output of "zfs recv"? Does it fail?
No output, succeeds without issue.
Post by Matthew Ahrens via illumos-zfs
What is the "freeing" property?
0
Post by Matthew Ahrens via illumos-zfs
What is the "leaked" property?
0
Post by Matthew Ahrens via illumos-zfs
If you write to the pool (before it runs completely out of space), does the
free space jump back up to where it should be?
Nope
Post by Matthew Ahrens via illumos-zfs
I tried a simple test on my system, sending one pool to another and
receiving with the flags you mentioned (on DelphixOS, which is not too
different from illumos trunk) and the problem did not reproduce. After the
receive, the pool is idle (no i/o observed with iostat, no change to "free"
property).
Thanks Matt be interested to see of the steam file above changes that.
Post by Matthew Ahrens via illumos-zfs
I took a look at the pool image you linked. The "leaked" space is consumed
by deferred frees (zdb -bb will tell you this). My first guess would be
that something is causing us to not process the deferred free list when we
should. i.e. spa_sync() is not calling spa_sync_deferred_frees(), but it
is subsequently syncing real changes to disk and adding more deferred
frees. This code looks somewhat fragile -- we are essentially checking
"will there be changes this txg", but the answer is not enforced. There
were some changes to dsl_scan_active() as part of the commit you mentioned,
which could be involved.
The limited tests I did last night after finding the problem revision seem to agree
with that as dtrace was showing that without this revision dsl_scan_sync was
always processing the same txg but with it txg was continually incrementing e.g.

dtrace -n 'dsl_scan_sync:entry {printf("%lu < %lu", ((dsl_pool_t *)arg0)->dp_scan->scn_restart_txg, ((dmu_tx_t *)arg1)->tx_txg);}'
dtrace: description 'dsl_scan_sync:entry ' matched 1 probe
CPU ID FUNCTION:NAME
20 36713 dsl_scan_sync:entry 0 < 485
20 36713 dsl_scan_sync:entry 0 < 485
20 36713 dsl_scan_sync:entry 0 < 485
20 36713 dsl_scan_sync:entry 0 < 485
20 36713 dsl_scan_sync:entry 0 < 485
20 36713 dsl_scan_sync:entry 0 < 485
20 36713 dsl_scan_sync:entry 0 < 486
20 36713 dsl_scan_sync:entry 0 < 486
20 36713 dsl_scan_sync:entry 0 < 486
20 36713 dsl_scan_sync:entry 0 < 486
20 36713 dsl_scan_sync:entry 0 < 487
20 36713 dsl_scan_sync:entry 0 < 487
20 36713 dsl_scan_sync:entry 0 < 487
20 36713 dsl_scan_sync:entry 0 < 487
20 36713 dsl_scan_sync:entry 0 < 487
20 36713 dsl_scan_sync:entry 0 < 487
15 36713 dsl_scan_sync:entry 0 < 488
15 36713 dsl_scan_sync:entry 0 < 488
15 36713 dsl_scan_sync:entry 0 < 488
15 36713 dsl_scan_sync:entry 0 < 488
Steven Hartland via illumos-zfs
2014-10-15 18:28:44 UTC
Permalink
----- Original Message -----
From: "Steven Hartland via illumos-zfs" <***@lists.illumos.org>
snip
Post by Steven Hartland via illumos-zfs
Thanks Matt be interested to see of the steam file above changes that.
Post by Matthew Ahrens via illumos-zfs
I took a look at the pool image you linked. The "leaked" space is consumed
by deferred frees (zdb -bb will tell you this). My first guess would be
that something is causing us to not process the deferred free list when we
should. i.e. spa_sync() is not calling spa_sync_deferred_frees(), but it
is subsequently syncing real changes to disk and adding more deferred
frees. This code looks somewhat fragile -- we are essentially checking
"will there be changes this txg", but the answer is not enforced. There
were some changes to dsl_scan_active() as part of the commit you mentioned,
which could be involved.
The limited tests I did last night after finding the problem revision seem to agree
with that as dtrace was showing that without this revision dsl_scan_sync was
always processing the same txg but with it txg was continually incrementing e.g.
snip

Ok so looking through this code the one thing that doesn't make sense to me is
the assignment of cn->scn_async_stalled in dsl_scan_sync.

I believe scn_async_stalled should only be set if we made no progress, however
in its also current form its being set even when bptree_is_empty indicates everything
actually completed.

The reason that scn_visited_this_txg is zero in this case is that in traverse_visitbp
we're triggering the early return case: bp->blk_birth <= td->td_min_txg which means
td->td_func (dsl_scan_free_block_cb) isn't called hence scn_visited_this_txg is never
incremented.

This early return was added by:
https://github.com/freebsd/freebsd/commit/ede49d4106ad9e6d34a2592322e48604f72605bd

In this case we appear to be destorying the temporary clones created by the recieve as
shown by the pool internal history:

2014-10-15.15:09:17 [txg:9] receive zfs/test/ROOT (48)
2014-10-15.15:09:17 [txg:10] set zfs/test/ROOT (48) $hasrecvd=
2014-10-15.15:09:18 [txg:31] finish receiving zfs/test/ROOT (48) snap=auto-2014-08-16_04.00
2014-10-15.15:09:18 [txg:31] snapshot zfs/test/***@auto-2014-08-16_04.00 (94)
2014-10-15.15:09:18 [txg:32] receive zfs/test/ROOT/%recv (100)
2014-10-15.15:09:18 [txg:33] finish receiving zfs/test/ROOT/%recv (100) snap=auto-2014-08-17_04.00
2014-10-15.15:09:18 [txg:33] clone swap zfs/test/ROOT/%recv (100) parent=ROOT
2014-10-15.15:09:18 [txg:33] snapshot zfs/test/***@auto-2014-08-17_04.00 (106)
2014-10-15.15:09:18 [txg:33] destroy zfs/test/ROOT/%recv (100)
2014-10-15.15:09:18 [txg:34] receive zfs/test/ROOT/%recv (112)
2014-10-15.15:09:18 [txg:35] finish receiving zfs/test/ROOT/%recv (112) snap=auto-2014-08-19_04.00
2014-10-15.15:09:18 [txg:35] clone swap zfs/test/ROOT/%recv (112) parent=ROOT
2014-10-15.15:09:18 [txg:35] snapshot zfs/test/***@auto-2014-08-19_04.00 (117)
2014-10-15.15:09:18 [txg:35] destroy zfs/test/ROOT/%recv (112)


It seems seems like the actual intention of this code was the following, which
also seems to prevent the leak:

svn diff -x -p
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c (revision 273115)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c (working copy)
@@ -1476,12 +1476,6 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
"traverse_dataset_destroyed()", err);
}

- /*
- * If we didn't make progress, mark the async destroy as
- * stalled, so that we will not initiate a spa_sync() on
- * its behalf.
- */
- scn->scn_async_stalled = (scn->scn_visited_this_txg == 0);

if (bptree_is_empty(dp->dp_meta_objset, dp->dp_bptree_obj)) {
/* finished; deactivate async destroy feature */
@@ -1495,6 +1489,13 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
dp->dp_bptree_obj, tx));
dp->dp_bptree_obj = 0;
scn->scn_async_destroying = B_FALSE;
+ } else {
+ /*
+ * If we didn't make progress, mark the async destroy as
+ * stalled, so that we will not initiate a spa_sync() on
+ * its behalf.
+ */
+ scn->scn_async_stalled = (scn->scn_visited_this_txg == 0);
}
}
if (scn->scn_visited_this_txg) {

So my questions are:
1. Is this actual intention of the code?
2. Is there a way to cleanup the leaked deferred frees on an existing pool?
3. Is there a way to import a pool "writable" which has so many frees its now full
and failing to complete the import with ENOSPC?
4. Should these deferred free's have ever multiplied like this or should we be preventing
the multiple storage of deferred frees?

Regards
Steve
Matthew Ahrens via illumos-zfs
2014-10-15 19:20:19 UTC
Permalink
----- Original Message ----- From: "Steven Hartland via illumos-zfs" <
snip
Post by Steven Hartland via illumos-zfs
Thanks Matt be interested to see of the steam file above changes that.
I took a look at the pool image you linked. The "leaked" space is
Post by Matthew Ahrens via illumos-zfs
consumed
by deferred frees (zdb -bb will tell you this). My first guess would be
that something is causing us to not process the deferred free list when we
should. i.e. spa_sync() is not calling spa_sync_deferred_frees(), but it
is subsequently syncing real changes to disk and adding more deferred
frees. This code looks somewhat fragile -- we are essentially checking
"will there be changes this txg", but the answer is not enforced. There
were some changes to dsl_scan_active() as part of the commit you mentioned,
which could be involved.
The limited tests I did last night after finding the problem revision seem to agree
with that as dtrace was showing that without this revision dsl_scan_sync was
always processing the same txg but with it txg was continually incrementing e.g.
snip
Ok so looking through this code the one thing that doesn't make sense to me is
the assignment of cn->scn_async_stalled in dsl_scan_sync.
I believe scn_async_stalled should only be set if we made no progress, however
in its also current form its being set even when bptree_is_empty indicates everything
actually completed.
The reason that scn_visited_this_txg is zero in this case is that in traverse_visitbp
we're triggering the early return case: bp->blk_birth <= td->td_min_txg which means
td->td_func (dsl_scan_free_block_cb) isn't called hence
scn_visited_this_txg is never
incremented.
https://github.com/freebsd/freebsd/commit/ede49d4106ad9e6d34a2592322e486
04f72605bd
In this case we appear to be destorying the temporary clones created by the recieve as
2014-10-15.15:09:17 [txg:9] receive zfs/test/ROOT (48) 2014-10-15.15:09:17
[txg:10] set zfs/test/ROOT (48) $hasrecvd=
2014-10-15.15:09:18 [txg:31] finish receiving zfs/test/ROOT (48) snap=auto-2014-08-16_04.00
(94) 2014-10-15.15:09:18 [txg:32] receive zfs/test/ROOT/%recv (100)
2014-10-15.15:09:18 [txg:33] finish receiving zfs/test/ROOT/%recv (100)
snap=auto-2014-08-17_04.00
2014-10-15.15:09:18 [txg:33] clone swap zfs/test/ROOT/%recv (100) parent=ROOT
(106) 2014-10-15.15:09:18 [txg:33] destroy zfs/test/ROOT/%recv (100)
2014-10-15.15:09:18 [txg:34] receive zfs/test/ROOT/%recv (112)
2014-10-15.15:09:18 [txg:35] finish receiving zfs/test/ROOT/%recv (112)
snap=auto-2014-08-19_04.00
2014-10-15.15:09:18 [txg:35] clone swap zfs/test/ROOT/%recv (112) parent=ROOT
(117) 2014-10-15.15:09:18 [txg:35] destroy zfs/test/ROOT/%recv (112)
It seems seems like the actual intention of this code was the following, which
svn diff -x -p
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c (revision 273115)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c (working copy)
@@ -1476,12 +1476,6 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
"traverse_dataset_destroyed()", err);
}
- /*
- * If we didn't make progress, mark the async destroy as
- * stalled, so that we will not initiate a spa_sync() on
- * its behalf.
- */
- scn->scn_async_stalled = (scn->scn_visited_this_txg == 0);
if (bptree_is_empty(dp->dp_meta_objset,
dp->dp_bptree_obj)) {
/* finished; deactivate async destroy feature */
@@ -1495,6 +1489,13 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
dp->dp_bptree_obj, tx));
dp->dp_bptree_obj = 0;
scn->scn_async_destroying = B_FALSE;
+ } else {
+ /*
+ * If we didn't make progress, mark the async destroy as
+ * stalled, so that we will not initiate a spa_sync() on
+ * its behalf.
+ */
+ scn->scn_async_stalled =
(scn->scn_visited_this_txg == 0);
}
}
if (scn->scn_visited_this_txg) {
1. Is this actual intention of the code?
Yes. If there there is an entry in the dp_bptree_obj (and therefore
SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty
(there's nothing to traverse in it), then we could incorrectly set
scn_async_stalled, leading to this behavior.

spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude that
no changes will be synced this txg, so we don't spa_sync_deferred_frees().
However, dsl_scan_sync() will see that scn_async_stalled is set, and
therefore try again. Though we don't actually try to process the bptree
again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do
bpobj_iterate() to process potential background snapshot destroys. This
always dirties the bpobj's bonus buffer, even if nothing is actually
changed. The dirty buffer then needs to be written out.

I was able to reproduce the bug on illumos using your send stream.
However, touching a file in the pool caused spa_sync_deferred_frees() to be
called, thus resetting the free space back to where it should be (but it
the free space continued reducing again after that).

In addition to the patch you have above, I think we need to do some work to
make the logic around deferred frees more robust, to prevent similar bugs
from creeping in in the future.

2. Is there a way to cleanup the leaked deferred frees on an existing pool?
If it hasn't totally run out of space, making any changes in the pool will
clean up the deferred frees (e.g. create a new file), but the free space
will start decreasing again. Rebooting (or export/import the pool) will
clear scn_async_stalled so the problem should go away entirely.
3. Is there a way to import a pool "writable" which has so many frees its now full
and failing to complete the import with ENOSPC?
Not that I know of, unfortunately.

4. Should these deferred free's have ever multiplied like this or should we
be preventing
the multiple storage of deferred frees?
I'm not sure what you mean. There can normally be more than one block
pointer in the deferred free list. Are you suggesting that the same
blkptr_t has been added to the deferred free list multiple times?

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-10-16 00:11:15 UTC
Permalink
----- Original Message -----
From: "Matthew Ahrens" <***@delphix.com>
snip..
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
1. Is this actual intention of the code?
Yes. If there there is an entry in the dp_bptree_obj (and therefore
SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty
(there's nothing to traverse in it), then we could incorrectly set
scn_async_stalled, leading to this behavior.
In this case there are entries in it but they are all for items with result in early
return.

Do you think for the time being the patch I listed is good enough?
Post by Matthew Ahrens via illumos-zfs
spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude that
no changes will be synced this txg, so we don't spa_sync_deferred_frees().
However, dsl_scan_sync() will see that scn_async_stalled is set, and
therefore try again. Though we don't actually try to process the bptree
again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do
bpobj_iterate() to process potential background snapshot destroys. This
always dirties the bpobj's bonus buffer, even if nothing is actually
changed. The dirty buffer then needs to be written out.
I was able to reproduce the bug on illumos using your send stream.
However, touching a file in the pool caused spa_sync_deferred_frees() to be
called, thus resetting the free space back to where it should be (but it
the free space continued reducing again after that).
Ah when you asked about that I was only looking at the incremental and as
you say it continues to reduce, so yes this does reset.

If spa_sync_deferred_frees() is responsible for cleaning up that free space
I'm curious why, as thats called early on in the import process, said import
stalls later on for that test pool image?
Post by Matthew Ahrens via illumos-zfs
In addition to the patch you have above, I think we need to do some work to
make the logic around deferred frees more robust, to prevent similar bugs
from creeping in in the future.
2. Is there a way to cleanup the leaked deferred frees on an existing pool?
If it hasn't totally run out of space, making any changes in the pool will
clean up the deferred frees (e.g. create a new file), but the free space
will start decreasing again. Rebooting (or export/import the pool) will
clear scn_async_stalled so the problem should go away entirely.
Confirmed.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
3. Is there a way to import a pool "writable" which has so many frees its
now full and failing to complete the import with ENOSPC?
Not that I know of, unfortunately.
Really need to have some hard reserved pool space so we can always
recover from things like this.
Post by Matthew Ahrens via illumos-zfs
4. Should these deferred free's have ever multiplied like this or should we
Post by Steven Hartland via illumos-zfs
be preventing the multiple storage of deferred frees?
I'm not sure what you mean. There can normally be more than one block
pointer in the deferred free list. Are you suggesting that the same
blkptr_t has been added to the deferred free list multiple times?
Well as no new async destroys are being done yet deferred frees are
increasing in size do they not contain the same information, in this
case nothing?

Regards
Steve
Matthew Ahrens via illumos-zfs
2014-10-16 01:09:11 UTC
Permalink
Post by Steven Hartland via illumos-zfs
snip..
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
1. Is this actual intention of the code?
Yes. If there there is an entry in the dp_bptree_obj (and therefore
SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty
(there's nothing to traverse in it), then we could incorrectly set
scn_async_stalled, leading to this behavior.
In this case there are entries in it but they are all for items with result in early
return.
Yes -- there is a tree in the bptree_obj, but the tree has nothing to
traverse.
Post by Steven Hartland via illumos-zfs
Do you think for the time being the patch I listed is good enough?
spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude
Post by Matthew Ahrens via illumos-zfs
that
no changes will be synced this txg, so we don't spa_sync_deferred_frees().
However, dsl_scan_sync() will see that scn_async_stalled is set, and
therefore try again. Though we don't actually try to process the bptree
again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do
bpobj_iterate() to process potential background snapshot destroys. This
always dirties the bpobj's bonus buffer, even if nothing is actually
changed. The dirty buffer then needs to be written out.
I was able to reproduce the bug on illumos using your send stream.
However, touching a file in the pool caused spa_sync_deferred_frees() to be
called, thus resetting the free space back to where it should be (but it
the free space continued reducing again after that).
Ah when you asked about that I was only looking at the incremental and as
you say it continues to reduce, so yes this does reset.
If spa_sync_deferred_frees() is responsible for cleaning up that free space
I'm curious why, as thats called early on in the import process, said import
stalls later on for that test pool image?
In addition to the patch you have above, I think we need to do some work
Post by Matthew Ahrens via illumos-zfs
to
make the logic around deferred frees more robust, to prevent similar bugs
from creeping in in the future.
2. Is there a way to cleanup the leaked deferred frees on an existing pool?
If it hasn't totally run out of space, making any changes in the pool will
clean up the deferred frees (e.g. create a new file), but the free space
will start decreasing again. Rebooting (or export/import the pool) will
clear scn_async_stalled so the problem should go away entirely.
Confirmed.
3. Is there a way to import a pool "writable" which has so many frees its
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
now full and failing to complete the import with ENOSPC?
Not that I know of, unfortunately.
Really need to have some hard reserved pool space so we can always
recover from things like this.
4. Should these deferred free's have ever multiplied like this or should
Post by Matthew Ahrens via illumos-zfs
we
Post by Steven Hartland via illumos-zfs
be preventing the multiple storage of deferred frees?
I'm not sure what you mean. There can normally be more than one block
pointer in the deferred free list. Are you suggesting that the same
blkptr_t has been added to the deferred free list multiple times?
Well as no new async destroys are being done yet deferred frees are
increasing in size do they not contain the same information, in this
case nothing?
They do not contain the same information. Async destroys and deferred
frees are two totally different things (though both involve remembering
things to free later). The deferred free list is a SPA concept; it's
appended to basically every TXG.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens via illumos-zfs
2014-10-16 01:09:54 UTC
Permalink
Post by Steven Hartland via illumos-zfs
snip..
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
1. Is this actual intention of the code?
Yes. If there there is an entry in the dp_bptree_obj (and therefore
SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty
(there's nothing to traverse in it), then we could incorrectly set
scn_async_stalled, leading to this behavior.
In this case there are entries in it but they are all for items with result in early
return.
Do you think for the time being the patch I listed is good enough?
Yes. I'll also work on making the spa_sync() code less fragile.

--matt
Post by Steven Hartland via illumos-zfs
spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude
Post by Matthew Ahrens via illumos-zfs
that
no changes will be synced this txg, so we don't spa_sync_deferred_frees().
However, dsl_scan_sync() will see that scn_async_stalled is set, and
therefore try again. Though we don't actually try to process the bptree
again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do
bpobj_iterate() to process potential background snapshot destroys. This
always dirties the bpobj's bonus buffer, even if nothing is actually
changed. The dirty buffer then needs to be written out.
I was able to reproduce the bug on illumos using your send stream.
However, touching a file in the pool caused spa_sync_deferred_frees() to be
called, thus resetting the free space back to where it should be (but it
the free space continued reducing again after that).
Ah when you asked about that I was only looking at the incremental and as
you say it continues to reduce, so yes this does reset.
If spa_sync_deferred_frees() is responsible for cleaning up that free space
I'm curious why, as thats called early on in the import process, said import
stalls later on for that test pool image?
In addition to the patch you have above, I think we need to do some work
Post by Matthew Ahrens via illumos-zfs
to
make the logic around deferred frees more robust, to prevent similar bugs
from creeping in in the future.
2. Is there a way to cleanup the leaked deferred frees on an existing pool?
If it hasn't totally run out of space, making any changes in the pool will
clean up the deferred frees (e.g. create a new file), but the free space
will start decreasing again. Rebooting (or export/import the pool) will
clear scn_async_stalled so the problem should go away entirely.
Confirmed.
3. Is there a way to import a pool "writable" which has so many frees its
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
now full and failing to complete the import with ENOSPC?
Not that I know of, unfortunately.
Really need to have some hard reserved pool space so we can always
recover from things like this.
4. Should these deferred free's have ever multiplied like this or should
Post by Matthew Ahrens via illumos-zfs
we
Post by Steven Hartland via illumos-zfs
be preventing the multiple storage of deferred frees?
I'm not sure what you mean. There can normally be more than one block
pointer in the deferred free list. Are you suggesting that the same
blkptr_t has been added to the deferred free list multiple times?
Well as no new async destroys are being done yet deferred frees are
increasing in size do they not contain the same information, in this
case nothing?
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Kojedzinszky via illumos-zfs
2014-10-16 14:45:26 UTC
Permalink
Dear Matt, Steve,

I've ran zdb -bb on my pool, and have some lekaed space reported by it.

What should be done with that space? Could that be reclaimed somehow, or
should the whole pool be re-created?

Regards,
----- Original Message ----- From: "Matthew Ahrens"
snip..
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
1. Is this actual intention of the code?
Yes. If there there is an entry in the dp_bptree_obj (and therefore
SPA_FEATURE_ASYNC_DESTROY is active), but the tree is effectively empty
(there's nothing to traverse in it), then we could incorrectly set
scn_async_stalled, leading to this behavior.
In this case there are entries in it but they are all for items with result in early
return.
Do you think for the time being the patch I listed is good enough?
Post by Matthew Ahrens via illumos-zfs
spa_sync() will see that dsl_scan_active() is FALSE, and thus conclude that
no changes will be synced this txg, so we don't
spa_sync_deferred_frees().
However, dsl_scan_sync() will see that scn_async_stalled is set, and
therefore try again. Though we don't actually try to process the bptree
again, because SPA_FEATURE_ASYNC_DESTROY is no longer active, we do
bpobj_iterate() to process potential background snapshot destroys.
This
always dirties the bpobj's bonus buffer, even if nothing is actually
changed. The dirty buffer then needs to be written out.
I was able to reproduce the bug on illumos using your send stream.
However, touching a file in the pool caused spa_sync_deferred_frees() to be
called, thus resetting the free space back to where it should be (but it
the free space continued reducing again after that).
Ah when you asked about that I was only looking at the incremental and as
you say it continues to reduce, so yes this does reset.
If spa_sync_deferred_frees() is responsible for cleaning up that free space
I'm curious why, as thats called early on in the import process, said import
stalls later on for that test pool image?
Post by Matthew Ahrens via illumos-zfs
In addition to the patch you have above, I think we need to do some work to
make the logic around deferred frees more robust, to prevent similar bugs
from creeping in in the future.
2. Is there a way to cleanup the leaked deferred frees on an existing pool?
If it hasn't totally run out of space, making any changes in the pool will
clean up the deferred frees (e.g. create a new file), but the free space
will start decreasing again. Rebooting (or export/import the pool) will
clear scn_async_stalled so the problem should go away entirely.
Confirmed.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
3. Is there a way to import a pool "writable" which has so many frees its
now full and failing to complete the import with ENOSPC?
Not that I know of, unfortunately.
Really need to have some hard reserved pool space so we can always
recover from things like this.
Post by Matthew Ahrens via illumos-zfs
4. Should these deferred free's have ever multiplied like this or should we
Post by Steven Hartland via illumos-zfs
be preventing the multiple storage of deferred frees?
I'm not sure what you mean. There can normally be more than one block
pointer in the deferred free list. Are you suggesting that the same
blkptr_t has been added to the deferred free list multiple times?
Well as no new async destroys are being done yet deferred frees are
increasing in size do they not contain the same information, in this
case nothing?
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/25402478-0858cafa
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Richard Kojedzinszky
Matthew Ahrens via illumos-zfs
2014-10-15 06:19:10 UTC
Permalink
On Tue, Oct 14, 2014 at 9:45 PM, Steven Hartland via illumos-zfs <
----- Original Message ----- From: "Steven Hartland"
I've been investigating an issue for a user who was seeing
his pool import hang after upgrading on FreeBSD. After
digging around it turned out the issue was due to lack
of free space on the pool.
As the pool imports it writes hence requiring space but the
pool has so little space this was failing. The IO being
a required IO it retries, but obviously fails again
resulting the the pool being suspened hence the hang.
With the pool suspended during import it still holds the
pool lock so all attempts to query the status also hang,
which is one problem as the user can't tell why the hang
has occured.
During the debugging I mounted the pool read only and
sent a copy to another empty pool, which resulted in ~1/2
capacity being recovered. This seemed odd but I dismissed
it at the time.
The machine was then left, with the pool not being accessed,
however I just recieved an alert from our monitoring for
a pool failure. On looking I now see the new pool I created
with 2 write errors and no free space. So just having the
pool mounted, with no access happening, has managed to use
the remain 2GB on the 4GB pool.
Has anyone seen this before or has any ideas what might
be going on?
zdb -m -m -m -m <pool> shows allocation to transactions e.g.
metaslab 100 offset c8000000 spacemap 1453 free 0
segments 0 maxsize 0 freepct 0%
On-disk histogram: fragmentation 0
[ 0] ALLOC: txg 417, pass 2
[ 1] A range: 00c8000000-00c8001600 size: 001600
[ 2] ALLOC: txg 417, pass 3
[ 3] A range: 00c8001600-00c8003a00 size: 002400
[ 4] ALLOC: txg 418, pass 2
[ 5] A range: 00c8003a00-00c8005000 size: 001600
[ 6] ALLOC: txg 418, pass 3
[ 7] A range: 00c8005000-00c8006600 size: 001600
[ 8] ALLOC: txg 419, pass 2
[ 9] A range: 00c8006600-00c8007c00 size: 001600
[ 10] ALLOC: txg 419, pass 3
I tried destroying the pool and that hung, presumably due to
IO being suspended after the out of space errors.
After bisecting the kernel changes the commit which seems
https://svnweb.freebsd.org/base?view=revision&revision=268650
https://github.com/freebsd/freebsd/commit/91643324a9009cb5fbc8c00544b778
1941f0d5d1
https://github.com/illumos/illumos-gate/commit/
7fd05ac4dec0c343d2f68f310d3718b715ecfbaf
I've checked the two make the same changes so there doesn't
seem to have been a downstream merge issue, at least not on
this specific commit.
1. mdconfig -t malloc -s 4G -S 512
2. zpool create tpool md0
3. zfs recv -duF tpool < test.zfs
Does the problem reproduce if you are not doing a "blow away" receive, e.g.
"zfs recv -du tpool/test"?
4. zpool list -p -o free zfs 5
With this commit present, free reduces every 5 seconds until
the pool is out of space. Without it after at most 3 reductions
the pool settles and no further free space reduction is seen.
I've also found that creating the pool without async_destroy
enabled also prevents the issue.
An image that shows the final result of the leak can be found
http://www.ijs.si/usr/mark/bsd/
On FreeBSD this image stalls on import unless imported readonly.
Once imported I used the following to create the test image
Copying in the zfs illumos list to get more eyeballs given it
seems to be a quite serious issue.
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-10-15 13:33:57 UTC
Permalink
----- Original Message -----
From: "Matthew Ahrens" <***@delphix.com>
snip.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
1. mdconfig -t malloc -s 4G -S 512
2. zpool create tpool md0
3. zfs recv -duF tpool < test.zfs
Does the problem reproduce if you are not doing a "blow away" receive, e.g.
"zfs recv -du tpool/test"?
Yes.
Continue reading on narkive:
Loading...