Large file deletion wedges whole pool, just dbuf_range_free()?

Discussion:

Large file deletion wedges whole pool, just dbuf_range_free()?

Dan McDonald via illumos-zfs

2014-04-29 14:32:39 UTC

A customer (who is Cc:ed here) deleted a single 2TB file from a 320TB pool. That delete effectively wedged the pool for 20 minutes. No writes happened (e.g. "touch foo" locked), etc. etc.

Am I safe in assuming that's just dbuf_range_free() doing its thing and that additionally, the ZIL is full, blocking on the txg that frees all the blocks? Or am I missing something more subtle, or is this an actual bug beyond the performance of dbuf_range_free()?

Thanks,
Dan

George Wilson via illumos-zfs

2014-04-29 14:53:37 UTC

Dan,

During this time did you see reads to the pool taking place? It's
possible that all the time is being spent in dbuf_free_range() but some
zfs kernel stacks from the time of the hang would be very helpful.

Thanks,
George

Post by Dan McDonald via illumos-zfs
A customer (who is Cc:ed here) deleted a single 2TB file from a 320TB pool. That delete effectively wedged the pool for 20 minutes. No writes happened (e.g. "touch foo" locked), etc. etc.
Am I safe in assuming that's just dbuf_range_free() doing its thing and that additionally, the ZIL is full, blocking on the txg that frees all the blocks? Or am I missing something more subtle, or is this an actual bug beyond the performance of dbuf_range_free()?
Thanks,
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

George Wilson via illumos-zfs

2014-04-29 14:56:15 UTC

Dan,
During this time did you see reads to the pool taking place? It's possible that all the time is being spent in dbuf_free_range() but some zfs kernel stacks from the time of the hang would be very helpful.

Knut encountered the problem (and it's dinnertime in Norway right now), so he can speak to it more. I asked him about ability to read as well. He said he could "ls" a directory, but wouldn't that just be ARC cached anyway?
Dan

Either iostat or zpool iostat would be useful.

- George

Dan McDonald via illumos-zfs

2014-04-29 14:58:35 UTC

Post by George Wilson via illumos-zfs

Dan,
During this time did you see reads to the pool taking place? It's possible that all the time is being spent in dbuf_free_range() but some zfs kernel stacks from the time of the hang would be very helpful.

Knut encountered the problem (and it's dinnertime in Norway right now), so he can speak to it more. I asked him about ability to read as well. He said he could "ls" a directory, but wouldn't that just be ARC cached anyway?
Dan

Either iostat or zpool iostat would be useful.

He did mail me zpool iostat...

***@ruge:/# zpool iostat storage0 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
storage0 146T 180T 439 6.12K 3.26M 82.4M
storage0 146T 180T 88 0 534K 0
storage0 146T 180T 63 0 387K 0
storage0 146T 180T 49 0 301K 0
storage0 146T 180T 81 0 492K 0
storage0 146T 180T 84 0 510K 0
[… Like this for approx 20 mins…]
storage0 146T 180T 87 0 526K 0
storage0 146T 180T 42 0 262K 0
storage0 146T 180T 5 18.3K 37.3K 146M
storage0 146T 180T 0 58.2K 0 463M
storage0 146T 180T 0 58.1K 0 463M
storage0 146T 180T 0 57.8K 0 458M
storage0 146T 180T 0 57.9K 0 462M
storage0 146T 180T 0 58.0K 0 459M
storage0 146T 180T 0 60.4K 0 481M
storage0 146T 180T 55 8.09K 187K 86.5M
storage0 146T 180T 1.16K 34.6K 3.41M 262M
storage0 146T 180T 0 59.5K 0 473M
storage0 146T 180T 35 41.3K 67.7K 326M
storage0 146T 180T 443 1.50K 796K 10.2M
storage0 146T 180T 3.89K 96 2.57M 5.01M
storage0 145T 182T 19.3K 3.21K 9.72M 23.0M
storage0 144T 182T 24.0K 5 12.3M 167K
storage0 144T 182T 25.7K 60 14.6M 4.31M
storage0 144T 182T 28.5K 58 14.8M 4.16M
storage0 144T 182T 23.1K 65 15.8M 4.26M
storage0 144T 182T 13.4K 3.23K 7.79M 21.1M
storage0 144T 182T 15.3K 445 8.36M 33.4M
storage0 144T 182T 20.4K 9.11K 10.3M 98.0M
storage0 144T 182T 16.0K 1.06K 8.04M 23.4M
storage0 144T 182T 8.95K 180 4.52M 13.4M
storage0 144T 182T 5.66K 9.11K 2.86M 98.2M
storage0 144T 182T 5.69K 1.22K 2.88M 36.0M
storage0 144T 182T 8.80K 167 4.44M 12.2M
storage0 144T 182T 11.6K 8.83K 5.82M 86.1M
storage0 144T 182T 2.33K 54 1.18M 3.30M
storage0 144T 182T 0 11 0 95.5K
storage0 144T 182T 0 0 0 0
storage0 144T 182T 545 0 4.18M 0
storage0 144T 182T 228 1.92K 1.68M 10.7M
storage0 144T 182T 0 1 0 43.8K
storage0 144T 182T 0 57 0 1.05M
storage0 144T 182T 0 33 0 633K
storage0 144T 182T 0 21 0 434K
storage0 144T 182T 0 1.25K 2.49K 3.69M
storage0 144T 182T 0 0 0 7.95K
storage0 144T 182T 0 10 0 167K
storage0 144T 182T 0 0 0 0
storage0 144T 182T 0 0 0 0
storage0 144T 182T 0 814 2.49K 2.42M
storage0 144T 182T 0 299 2.49K 22.3M
storage0 144T 182T 0 431 0 33.0M
storage0 144T 182T 0 1.34K 0 19.3M
storage0 144T 182T 0 7.59K 0 78.6M
storage0 144T 182T 0 433 0 33.0M

Dan

George Wilson via illumos-zfs

2014-04-29 14:59:37 UTC

Post by Dan McDonald via illumos-zfs

Post by George Wilson via illumos-zfs

Dan,
During this time did you see reads to the pool taking place? It's possible that all the time is being spent in dbuf_free_range() but some zfs kernel stacks from the time of the hang would be very helpful.

Knut encountered the problem (and it's dinnertime in Norway right now), so he can speak to it more. I asked him about ability to read as well. He said he could "ls" a directory, but wouldn't that just be ARC cached anyway?
Dan

Either iostat or zpool iostat would be useful.

He did mail me zpool iostat...
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
storage0 146T 180T 439 6.12K 3.26M 82.4M
storage0 146T 180T 88 0 534K 0
storage0 146T 180T 63 0 387K 0
storage0 146T 180T 49 0 301K 0
storage0 146T 180T 81 0 492K 0
storage0 146T 180T 84 0 510K 0
[… Like this for approx 20 mins…]
storage0 146T 180T 87 0 526K 0
storage0 146T 180T 42 0 262K 0
storage0 146T 180T 5 18.3K 37.3K 146M
storage0 146T 180T 0 58.2K 0 463M
storage0 146T 180T 0 58.1K 0 463M
storage0 146T 180T 0 57.8K 0 458M
storage0 146T 180T 0 57.9K 0 462M
storage0 146T 180T 0 58.0K 0 459M
storage0 146T 180T 0 60.4K 0 481M
storage0 146T 180T 55 8.09K 187K 86.5M
storage0 146T 180T 1.16K 34.6K 3.41M 262M
storage0 146T 180T 0 59.5K 0 473M
storage0 146T 180T 35 41.3K 67.7K 326M
storage0 146T 180T 443 1.50K 796K 10.2M
storage0 146T 180T 3.89K 96 2.57M 5.01M
storage0 145T 182T 19.3K 3.21K 9.72M 23.0M
storage0 144T 182T 24.0K 5 12.3M 167K
storage0 144T 182T 25.7K 60 14.6M 4.31M
storage0 144T 182T 28.5K 58 14.8M 4.16M
storage0 144T 182T 23.1K 65 15.8M 4.26M
storage0 144T 182T 13.4K 3.23K 7.79M 21.1M
storage0 144T 182T 15.3K 445 8.36M 33.4M
storage0 144T 182T 20.4K 9.11K 10.3M 98.0M
storage0 144T 182T 16.0K 1.06K 8.04M 23.4M
storage0 144T 182T 8.95K 180 4.52M 13.4M
storage0 144T 182T 5.66K 9.11K 2.86M 98.2M
storage0 144T 182T 5.69K 1.22K 2.88M 36.0M
storage0 144T 182T 8.80K 167 4.44M 12.2M
storage0 144T 182T 11.6K 8.83K 5.82M 86.1M
storage0 144T 182T 2.33K 54 1.18M 3.30M
storage0 144T 182T 0 11 0 95.5K
storage0 144T 182T 0 0 0 0
storage0 144T 182T 545 0 4.18M 0
storage0 144T 182T 228 1.92K 1.68M 10.7M
storage0 144T 182T 0 1 0 43.8K
storage0 144T 182T 0 57 0 1.05M
storage0 144T 182T 0 33 0 633K
storage0 144T 182T 0 21 0 434K
storage0 144T 182T 0 1.25K 2.49K 3.69M
storage0 144T 182T 0 0 0 7.95K
storage0 144T 182T 0 10 0 167K
storage0 144T 182T 0 0 0 0
storage0 144T 182T 0 0 0 0
storage0 144T 182T 0 814 2.49K 2.42M
storage0 144T 182T 0 299 2.49K 22.3M
storage0 144T 182T 0 431 0 33.0M
storage0 144T 182T 0 1.34K 0 19.3M
storage0 144T 182T 0 7.59K 0 78.6M
storage0 144T 182T 0 433 0 33.0M
Dan

So this show both reads and writes taking place.

- George

Dan McDonald via illumos-zfs

2014-04-29 15:04:59 UTC

Post by George Wilson via illumos-zfs

Post by Dan McDonald via illumos-zfs
storage0 146T 180T 49 0 301K 0
storage0 146T 180T 81 0 492K 0
storage0 146T 180T 84 0 510K 0
[… Like this for approx 20 mins…]
storage0 146T 180T 87 0 526K 0
storage0 146T 180T 42 0 262K 0

So this show both reads and writes taking place.

The "like this for approx. 20 mins" section, then is a lot of small reads, correct?

Knut --> The "read" column was nonzero during those 20 minutes, then?

Dan

Knut Erik Sørvik

2014-04-29 15:29:49 UTC

Hi.

Yes, for the entire 20-30 minutes, there was only ~80 iops at ~500k reads,
but NO writes.

An interesting part was when it started writing again:

storage0 146T 180T 55 8.09K 187K 86.5M
storage0 146T 180T 1.16K 34.6K 3.41M 262M
storage0 146T 180T 0 59.5K 0 473M
storage0 146T 180T 35 41.3K 67.7K 326M
storage0 146T 180T 443 1.50K 796K 10.2M
storage0 146T 180T 3.89K 96 2.57M 5.01M
storage0 145T 182T 19.3K 3.21K 9.72M 23.0M
storage0 144T 182T 24.0K 5 12.3M 167K

Note the alloc going from 146 to 144T, so there the 2TB was freed.

Knut Erik

Post by Dan McDonald via illumos-zfs

Post by George Wilson via illumos-zfs

Post by Dan McDonald via illumos-zfs
storage0 146T 180T 49 0 301K 0
storage0 146T 180T 81 0 492K 0
storage0 146T 180T 84 0 510K 0
[Š Like this for approx 20 minsŠ]
storage0 146T 180T 87 0 526K 0
storage0 146T 180T 42 0 262K 0

So this show both reads and writes taking place.

The "like this for approx. 20 mins" section, then is a lot of small reads, correct?
Knut --> The "read" column was nonzero during those 20 minutes, then?
Dan
**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Surya prakki via illumos-zfs

2014-04-30 03:34:02 UTC

Could you please ask him to collect the kernel stacks during the 0-write
periods; 3 to 4 rounds of it will be good. Generally no-writes mean, no new
txg is getting opened, which could indicate that sync_thread is stuck
syncing a single txg - which could indicate that it probably ended up doing
lot of disk IOs to sync that txg;
-surya

On Tue, Apr 29, 2014 at 8:34 PM, Dan McDonald via illumos-zfs <

Post by Dan McDonald via illumos-zfs

Post by George Wilson via illumos-zfs

Post by Dan McDonald via illumos-zfs
storage0 146T 180T 49 0 301K 0
storage0 146T 180T 81 0 492K 0
storage0 146T 180T 84 0 510K 0
[âŠ Like this for approx 20 minsâŠ]
storage0 146T 180T 87 0 526K 0
storage0 146T 180T 42 0 262K 0

So this show both reads and writes taking place.

The "like this for approx. 20 mins" section, then is a lot of small reads, correct?
Knut --> The "read" column was nonzero during those 20 minutes, then?
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/25372515-55527836
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Dan McDonald via illumos-zfs

2014-04-29 14:55:40 UTC

Dan,
During this time did you see reads to the pool taking place? It's possible that all the time is being spent in dbuf_free_range() but some zfs kernel stacks from the time of the hang would be very helpful.

Knut encountered the problem (and it's dinnertime in Norway right now), so he can speak to it more. I asked him about ability to read as well. He said he could "ls" a directory, but wouldn't that just be ARC cached anyway?

Dan

Matthew Ahrens via illumos-zfs

2014-04-29 16:31:39 UTC

There are two potential performance problems. For a 1TB file w/8K
recordsize, we have to read about 1 million indirect blocks. If you can
only do 100 random iops, that would take about 3 hours.

There is also a CPU usage issue which is related to dbuf_free_range().
This can be worked around with a patch like the following, but you will
probably need to add a flag to dnode_evict_dbufs() to make it only evict
level-0 dbufs.

diff --git a/usr/src/uts/common/fs/zfs/dmu.c
b/usr/src/uts/common/fs/zfs/dmu.c
index 09b09cf..c64758d 100644
--- a/usr/src/uts/common/fs/zfs/dmu.c
+++ b/usr/src/uts/common/fs/zfs/dmu.c
@@ -638,6 +638,9 @@ dmu_free_long_range_impl(objset_t *os, dnode_t *dn,
uint64_t
if (offset >= object_size)
return (0);

+ if (length == DMU_OBJECT_END && offset == 0)
+ dnode_evict_dbufs(dn);
+
if (length == DMU_OBJECT_END || offset + length > object_size)
length = object_size - offset;

On Tue, Apr 29, 2014 at 7:32 AM, Dan McDonald via illumos-zfs <

Post by Dan McDonald via illumos-zfs
A customer (who is Cc:ed here) deleted a single 2TB file from a 320TB
pool. That delete effectively wedged the pool for 20 minutes. No writes
happened (e.g. "touch foo" locked), etc. etc.
Am I safe in assuming that's just dbuf_range_free() doing its thing and
that additionally, the ZIL is full, blocking on the txg that frees all the
blocks? Or am I missing something more subtle, or is this an actual bug
beyond the performance of dbuf_range_free()?
Thanks,
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Knut Erik Sørvik via illumos-zfs

2014-04-29 18:44:49 UTC

Hi.

During the time of «no writes» the CPU utilization was very low.

Knut Erik

Fra: Matthew Ahrens <***@delphix.com<mailto:***@delphix.com>>
Dato: tirsdag 29. april 2014 18:31
Til: illumos-zfs <***@lists.illumos.org<mailto:***@lists.illumos.org>>, Dan McDonald <***@omniti.com<mailto:***@omniti.com>>
Kopi: Knut Erik Sørvik <***@teknograd.no<mailto:***@teknograd.no>>
Emne: Re: [zfs] Large file deletion wedges whole pool, just dbuf_range_free()?

There are two potential performance problems. For a 1TB file w/8K recordsize, we have to read about 1 million indirect blocks. If you can only do 100 random iops, that would take about 3 hours.

There is also a CPU usage issue which is related to dbuf_free_range(). This can be worked around with a patch like the following, but you will probably need to add a flag to dnode_evict_dbufs() to make it only evict level-0 dbufs.

diff --git a/usr/src/uts/common/fs/zfs/dmu.c b/usr/src/uts/common/fs/zfs/dmu.c
index 09b09cf..c64758d 100644
--- a/usr/src/uts/common/fs/zfs/dmu.c
+++ b/usr/src/uts/common/fs/zfs/dmu.c
@@ -638,6 +638,9 @@ dmu_free_long_range_impl(objset_t *os, dnode_t *dn, uint64_t
if (offset >= object_size)
return (0);

+ if (length == DMU_OBJECT_END && offset == 0)
+ dnode_evict_dbufs(dn);
+
if (length == DMU_OBJECT_END || offset + length > object_size)
length = object_size - offset;

On Tue, Apr 29, 2014 at 7:32 AM, Dan McDonald via illumos-zfs <***@lists.illumos.org<mailto:***@lists.illumos.org>> wrote:
A customer (who is Cc:ed here) deleted a single 2TB file from a 320TB pool. That delete effectively wedged the pool for 20 minutes. No writes happened (e.g. "touch foo" locked), etc. etc.

Am I safe in assuming that's just dbuf_range_free() doing its thing and that additionally, the ZIL is full, blocking on the txg that frees all the blocks? Or am I missing something more subtle, or is this an actual bug beyond the performance of dbuf_range_free()?

Thanks,
Dan

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

________________________________
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
________________________________

**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Stefan Ring via illumos-zfs

2014-04-29 18:59:07 UTC

On Tue, Apr 29, 2014 at 8:44 PM, Knut Erik Sørvik via illumos-zfs

Post by Knut Erik SÃ¸rvik via illumos-zfs
Hi.
During the time of «no writes» the CPU utilization was very low.
Knut Erik

Risking pointing out the obvious here, but doesn’t this sound exactly
like the number one dedup issue, enormous RAM requirements?

Approximating the block size rather conservatively at 100k, for 320 TB
and 320 bytes per block entry, we get almost exactly 1 TB. Do you have
1 TB of RAM?

Of course, this only matters if you have (or had) dedup enabled.

Knut Erik Sørvik via illumos-zfs

2014-04-29 19:05:52 UTC

Hi.

Never used dedupe.

kes

Post by Stefan Ring via illumos-zfs
On Tue, Apr 29, 2014 at 8:44 PM, Knut Erik Sørvik via illumos-zfs

Post by Knut Erik SÃ¸rvik via illumos-zfs
Hi.
During the time of «no writes» the CPU utilization was very low.
Knut Erik

Risking pointing out the obvious here, but doesn’t this sound exactly
like the number one dedup issue, enormous RAM requirements?
Approximating the block size rather conservatively at 100k, for 320 TB
and 320 bytes per block entry, we get almost exactly 1 TB. Do you have
1 TB of RAM?
Of course, this only matters if you have (or had) dedup enabled.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/26006819-94da3786
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbo

Andriy Gapon via illumos-zfs

2014-04-30 06:07:12 UTC

Post by Knut Erik SÃ¸rvik via illumos-zfs
There are two potential performance problems. For a 1TB file w/8K recordsize,
we have to read about 1 million indirect blocks. If you can only do 100 random
iops, that would take about 3 hours.

Matt,

just to clarify, do you speak of the code in dmu_tx_hold_free() that calls
dmu_tx_check_ioerr() ?
We noticed that that code produced lots of read I/O when removing a large file.

Post by Knut Erik SÃ¸rvik via illumos-zfs
There is also a CPU usage issue which is related to dbuf_free_range(). This can
be worked around with a patch like the following, but you will probably need to
add a flag to dnode_evict_dbufs() to make it only evict level-0 dbufs.
diff --git a/usr/src/uts/common/fs/zfs/dmu.c b/usr/src/uts/common/fs/zfs/dmu.c
index 09b09cf..c64758d 100644
--- a/usr/src/uts/common/fs/zfs/dmu.c
+++ b/usr/src/uts/common/fs/zfs/dmu.c
@@ -638,6 +638,9 @@ dmu_free_long_range_impl(objset_t *os, dnode_t *dn, uint64_t
if (offset >= object_size)
return (0);
+ if (length == DMU_OBJECT_END && offset == 0)
+ dnode_evict_dbufs(dn);
+
if (length == DMU_OBJECT_END || offset + length > object_size)
length = object_size - offset;
On Tue, Apr 29, 2014 at 7:32 AM, Dan McDonald via illumos-zfs
A customer (who is Cc:ed here) deleted a single 2TB file from a 320TB pool.
That delete effectively wedged the pool for 20 minutes. No writes happened
(e.g. "touch foo" locked), etc. etc.
Am I safe in assuming that's just dbuf_range_free() doing its thing and that
additionally, the ZIL is full, blocking on the txg that frees all the
blocks? Or am I missing something more subtle, or is this an actual bug
beyond the performance of dbuf_range_free()?
Thanks,
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives <https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/23139903-c4f22636> | Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>

--
Andriy Gapon

Matthew Ahrens via illumos-zfs

2014-04-30 17:24:35 UTC

Post by Matthew Ahrens via illumos-zfs

Post by Matthew Ahrens via illumos-zfs
There are two potential performance problems. For a 1TB file w/8K

recordsize,

Post by Matthew Ahrens via illumos-zfs
we have to read about 1 million indirect blocks. If you can only do 100

random

Post by Matthew Ahrens via illumos-zfs
iops, that would take about 3 hours.

Matt,
just to clarify, do you speak of the code in dmu_tx_hold_free() that calls
dmu_tx_check_ioerr() ?
We noticed that that code produced lots of read I/O when removing a large file.

Yes, that is the code that reads the indirect blocks. The indirect blocks
are also needed from syncing context where we actually free the blocks
(dnode_sync_free_range()), but they should have already been cached by the
reads from dmu_tx_hold_free().

--matt

Post by Matthew Ahrens via illumos-zfs

Post by Matthew Ahrens via illumos-zfs
There is also a CPU usage issue which is related to dbuf_free_range().

This can

Post by Matthew Ahrens via illumos-zfs
be worked around with a patch like the following, but you will probably

need to

Post by Matthew Ahrens via illumos-zfs
add a flag to dnode_evict_dbufs() to make it only evict level-0 dbufs.
diff --git a/usr/src/uts/common/fs/zfs/dmu.c

b/usr/src/uts/common/fs/zfs/dmu.c

Post by Matthew Ahrens via illumos-zfs
index 09b09cf..c64758d 100644
--- a/usr/src/uts/common/fs/zfs/dmu.c
+++ b/usr/src/uts/common/fs/zfs/dmu.c
@@ -638,6 +638,9 @@ dmu_free_long_range_impl(objset_t *os, dnode_t *dn,

uint64_t

Post by Matthew Ahrens via illumos-zfs
if (offset >= object_size)
return (0);
+ if (length == DMU_OBJECT_END && offset == 0)
+ dnode_evict_dbufs(dn);
+
if (length == DMU_OBJECT_END || offset + length > object_size)
length = object_size - offset;
On Tue, Apr 29, 2014 at 7:32 AM, Dan McDonald via illumos-zfs
A customer (who is Cc:ed here) deleted a single 2TB file from a

320TB pool.

Post by Matthew Ahrens via illumos-zfs
That delete effectively wedged the pool for 20 minutes. No writes

happened

Post by Matthew Ahrens via illumos-zfs
(e.g. "touch foo" locked), etc. etc.
Am I safe in assuming that's just dbuf_range_free() doing its thing

and that

Post by Matthew Ahrens via illumos-zfs
additionally, the ZIL is full, blocking on the txg that frees all the
blocks? Or am I missing something more subtle, or is this an actual

bug

Post by Matthew Ahrens via illumos-zfs
beyond the performance of dbuf_range_free()?
Thanks,
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now

https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460

Post by Matthew Ahrens via illumos-zfs
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives <

https://www.listbox.com/member/archive/182191/=now>

Post by Matthew Ahrens via illumos-zfs
<https://www.listbox.com/member/archive/rss/182191/23139903-c4f22636> |

Modify

Post by Matthew Ahrens via illumos-zfs
<

https://www.listbox.com/member/?&

Post by Matthew Ahrens via illumos-zfs
Your Subscription [Powered by Listbox] <http://www.listbox.com>

--
Andriy Gapon

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Knut Erik Sørvik via illumos-zfs

2014-04-29 18:40:07 UTC

Hi.

Yes, for the entire 20-30 minutes, there was only ~80 iops at ~500k reads,
but NO writes.

An interesting part was when it started writing again:

storage0 146T 180T 55 8.09K 187K 86.5M
storage0 146T 180T 1.16K 34.6K 3.41M 262M
storage0 146T 180T 0 59.5K 0 473M
storage0 146T 180T 35 41.3K 67.7K 326M
storage0 146T 180T 443 1.50K 796K 10.2M
storage0 146T 180T 3.89K 96 2.57M 5.01M
storage0 145T 182T 19.3K 3.21K 9.72M 23.0M
storage0 144T 182T 24.0K 5 12.3M 167K

Note the alloc going from 146 to 144T, so there the 2TB was freed.

Knut Erik

Post by Dan McDonald via illumos-zfs

Post by George Wilson via illumos-zfs

Post by Dan McDonald via illumos-zfs
storage0 146T 180T 49 0 301K 0
storage0 146T 180T 81 0 492K 0
storage0 146T 180T 84 0 510K 0
[Š Like this for approx 20 minsŠ]
storage0 146T 180T 87 0 526K 0
storage0 146T 180T 42 0 262K 0

So this show both reads and writes taking place.

The "like this for approx. 20 mins" section, then is a lot of small
reads, correct?
Knut --> The "read" column was nonzero during those 20 minutes, then?
Dan
**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

**************************************************************************
Denne e-post er skannet av http://teknograd.no og inneholder ikke virus.
**************************************************************************

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.c

15 Replies
6 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Dan McDonald via illumos-zfs 2014-04-29 14:32:39 UTC

George Wilson via illumos-zfs 2014-04-29 14:53:37 UTC

George Wilson via illumos-zfs 2014-04-29 14:56:15 UTC

Dan McDonald via illumos-zfs 2014-04-29 14:58:35 UTC

George Wilson via illumos-zfs 2014-04-29 14:59:37 UTC

Dan McDonald via illumos-zfs 2014-04-29 15:04:59 UTC

Knut Erik Sørvik 2014-04-29 15:29:49 UTC

Surya prakki via illumos-zfs 2014-04-30 03:34:02 UTC

Dan McDonald via illumos-zfs 2014-04-29 14:55:40 UTC

Matthew Ahrens via illumos-zfs 2014-04-29 16:31:39 UTC

Knut Erik Sørvik via illumos-zfs 2014-04-29 18:44:49 UTC

Stefan Ring via illumos-zfs 2014-04-29 18:59:07 UTC

Knut Erik Sørvik via illumos-zfs 2014-04-29 19:05:52 UTC

Andriy Gapon via illumos-zfs 2014-04-30 06:07:12 UTC

Matthew Ahrens via illumos-zfs 2014-04-30 17:24:35 UTC

Knut Erik Sørvik via illumos-zfs 2014-04-29 18:40:07 UTC

about - legalese

Loading...