Toll of dedup on WORM datasets

Discussion:

Jim Klimov

2014-03-26 22:44:46 UTC

Hello all,

It has been discussed that dedup has a relatively high price in
requirements to RAM and performance (on HDDs at least) due to its
needs to traverse the DDT when writing data onto the pool.

However, I wonder if there are any such losses when a deduped
dataset is read and mostly/never written? One usecase would be the
OE images, such as the global zone and local zone roots, which may
be upgraded and contain new revisions of identical files (trickery
with ZFS cloning to save space is pretty much defeated when you
want to upgrade an horde of zones, especially to do that regularly).
Assume that it is not a problem to either seperate the individual
data files (logs, userdata, etc.) into different datasets, or just
to disable dedup after the upgrades are completed; however, amount
of possible RAM is an issue on a particular (legacy) server.

Would reads from such a deduped zoneroot incur DDT traversals,
or is the dedup "price" paid only once during the writing of image
updates, and reads are as quick and use as many IOs and much RAM
as ordinary non-deduped dataset reads?

Also, what is the current situation on L1ARC and L2ARC caching
of deduped blocks - are they cached once per their DVA (i.e. might
deduplication of zoneroots actually save some RAM so precious on
the legacy server)?

And finally, may dedup be used on an rpool and/or on the rootfs
dataset? Are there any objections from kernel or grub, or some
good or bad experiences about this?

Thanks,
//Jim Klimov

Matthew Ahrens

2014-03-27 01:30:50 UTC

Permalink

The DDT isn't traversed (i.e. we don't look through all entries in the DDT.
However, each block written or freed requires that a block of the DDT be
modified, to adjust the entry's refcount. This is the cause of the RAM
usage and performance impact.

Post by Jim Klimov
However, I wonder if there are any such losses when a deduped
dataset is read and mostly/never written?

Reads do not require accessing the DDT. Thus reading a dedup'ed block is
just as fast as reading a non-deduped block.

Post by Jim Klimov
Also, what is the current situation on L1ARC and L2ARC caching
of deduped blocks - are they cached once per their DVA (i.e. might
deduplication of zoneroots actually save some RAM so precious on
the legacy server)?

Deduped blocks should only be cached once, assuming you have the fix for
3145 "single-copy arc", which was integrated to illumos in September 2012.

Post by Jim Klimov
And finally, may dedup be used on an rpool and/or on the rootfs
dataset? Are there any objections from kernel or grub, or some
good or bad experiences about this?

I haven't tried this myself, but I believe it should work fine, since
reading dedup'ed blocks is not a special case.

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2014-03-27 03:22:37 UTC

Permalink

The DDT isn't traversed (i.e. we don't look through all entries in the DDT. However, each block written or freed requires that a block of the DDT be modified, to adjust the entry's refcount. This is the cause of the RAM usage and performance impact.

I suspect could use the ZIL to accelerate this. a cursory look suggested that we deduplicate before ZIL. Looking into it in more detail is on my todo list, but I am not sure when I will get to it.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

surya

2014-03-27 07:01:26 UTC

Permalink

Post by Jim Klimov
Hello all,
It has been discussed that dedup has a relatively high price in
requirements to RAM and performance (on HDDs at least) due to its
needs to traverse the DDT when writing data onto the pool.
The DDT isn't traversed (i.e. we don't look through all entries in the
DDT. However, each block written or freed requires that a block of
the DDT be modified, to adjust the entry's refcount. This is the
cause of the RAM usage and performance impact.
However, I wonder if there are any such losses when a deduped
dataset is read and mostly/never written?
Reads do not require accessing the DDT. Thus reading a dedup'ed block
is just as fast as reading a non-deduped block.
Also, what is the current situation on L1ARC and L2ARC caching
of deduped blocks - are they cached once per their DVA (i.e. might
deduplication of zoneroots actually save some RAM so precious on
the legacy server)?
Deduped blocks should only be cached once, assuming you have the fix
for 3145 "single-copy arc", which was integrated to illumos in
September 2012.
And finally, may dedup be used on an rpool and/or on the rootfs
dataset? Are there any objections from kernel or grub, or some
good or bad experiences about this?
I haven't tried this myself, but I believe it should work fine, since
reading dedup'ed blocks is not a special case.

I am not sure what you would gain from having dedup enabled on rootfs -
where I expect only unique blocks to be present mostly [and you would
still have to pay for the DDT]. 'zpool status -D' gives a feel for how
unique or otherwise the data is.
-surya

Post by Jim Klimov
--matt
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2014-03-27 08:51:57 UTC

Permalink

Post by surya

the

Post by Jim Klimov
DDT. However, each block written or freed requires that a block of
the DDT be modified, to adjust the entry's refcount. This is the
cause of the RAM usage and performance impact.
However, I wonder if there are any such losses when a deduped
dataset is read and mostly/never written?
Reads do not require accessing the DDT. Thus reading a dedup'ed

block

Post by Jim Klimov
is just as fast as reading a non-deduped block.
Also, what is the current situation on L1ARC and L2ARC caching
of deduped blocks - are they cached once per their DVA (i.e.

might

Post by Jim Klimov
deduplication of zoneroots actually save some RAM so precious on
the legacy server)?
Deduped blocks should only be cached once, assuming you have the fix
for 3145 "single-copy arc", which was integrated to illumos in
September 2012.
And finally, may dedup be used on an rpool and/or on the rootfs
dataset? Are there any objections from kernel or grub, or some
good or bad experiences about this?
I haven't tried this myself, but I believe it should work fine, since
reading dedup'ed blocks is not a special case.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22497542-d75cd9d9
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Well, sometimes i am not convinced that all of the file data is unique i.e. when files are replaced by packaged upgrades. Some blocks may remain the same...

However i guess the better rationale might be to keep zoneroots along with the os root, since they would likely be updated with the same data at once, and thus dedup well. And caching the filedata once would be a benefit on constrained systems.

Though yes, i am in favor of keeping rpools simple, and zoneroots/zonedata on a separate pool with appropriate write-strategies per data set.

Also, wouldn't 'zpool status -D' estimate the whole pool instead of the few chosen datasets where i expect good ratios and plan to point-enable the dedup?

Thanks,
//Jim
--
Typos courtesy of K-9 Mail on my Samsung Android

surya

2014-03-28 05:06:56 UTC

Permalink

Post by Jim Klimov

Post by surya

the

block

Post by Jim Klimov
is just as fast as reading a non-deduped block.
Also, what is the current situation on L1ARC and L2ARC caching
of deduped blocks - are they cached once per their DVA (i.e.

might

Well, sometimes i am not convinced that all of the file data is unique i.e. when files are replaced by packaged upgrades. Some blocks may remain the same...
However i guess the better rationale might be to keep zoneroots along with the os root, since they would likely be updated with the same data at once, and thus dedup well. And caching the filedata once would be a benefit on constrained systems.
Though yes, i am in favor of keeping rpools simple, and zoneroots/zonedata on a separate pool with appropriate write-strategies per data set.
Also, wouldn't 'zpool status -D' estimate the whole pool instead of the few chosen datasets where i expect good ratios and plan to point-enable the dedup?

Hm...yes; Time to dtrace the individual datasets and retrieve the sizes
of DDT_CLASS_DUPLICATE/UNIQUE objects in the short term; Long term, I
see an rfe to extend this cmd.
thanks,
surya

Post by Jim Klimov
Thanks,
//Jim
--
Typos courtesy of K-9 Mail on my Samsung Android

Richard Elling

2014-03-28 05:24:37 UTC

Permalink

Post by Jim Klimov
Also, wouldn't 'zpool status -D' estimate the whole pool instead of the few chosen datasets where i expect good ratios and plan to point-enable the dedup?

"zpool status -D" shows the current dedup table. This looks very similar to the simulated dedup table you
can get from "zdb -S". The difference is that zpool status is real, zdb is a darn good estimate.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

surya

2014-03-28 05:32:33 UTC

Permalink

Post by Richard Elling

Post by Jim Klimov
Also, wouldn't 'zpool status -D' estimate the whole pool instead of
the few chosen datasets where i expect good ratios and plan to
point-enable the dedup?

Oh..thats right; Even though dedupe can be enabled on per dataset level,
the dedupe table is per pool;
So the gain per dataset can't be retrieved imo.

Post by Richard Elling
--
+1-760-896-4422
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>

Bill Sommerfeld

2014-03-27 19:09:57 UTC

Permalink

Post by surya

Post by Matthew Ahrens
I haven't tried this myself, but I believe it should work fine, since
reading dedup'ed blocks is not a special case.

If you keep multiple BE's around there is opportunity for dedup.

I have a couple systems with root on ssd and dedup (and copies=2) enabled. I
see dedup ratios of 1.24x (with 6 BE's - I should do some housecleaning) and
1.14x (with two) :

dedup: DDT entries 372725, size 481 on disk, 329 in core

bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 314K 15.6G 8.19G 16.3G 314K 15.6G 8.19G 16.3G
2 39.9K 2.04G 1.03G 2.06G 89.3K 4.68G 2.31G 4.61G
4 6.31K 352M 214M 429M 28.5K 1.49G 925M 1.81G
8 3.62K 60.8M 18.6M 37.1M 31.9K 714M 208M 416M
16 189 2.86M 1.19M 2.39M 3.76K 55.0M 22.6M 45.2M
32 36 609K 185K 370K 1.47K 27.3M 7.52M 15.0M
64 41 1.80M 1.16M 2.33M 3.70K 194M 126M 252M
128 8 154K 99K 198K 1.32K 24.8M 15.9M 31.7M
256 3 2K 2K 4K 997 652K 652K 1.27M
512 2 1K 1K 2K 1.36K 696K 696K 1.36M
1K 2 1K 1K 2K 2.14K 1.07M 1.07M 2.14M
Total 364K 18.1G 9.45G 18.8G 478K 22.8G 11.8G 23.5G

dedup: DDT entries 250332, size 463 on disk, 316 in core

bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 215K 9.26G 4.09G 8.18G 215K 9.26G 4.09G 8.18G
2 24.2K 886M 446M 892M 51.5K 1.81G 922M 1.80G
4 4.42K 69.9M 24.4M 48.9M 21.2K 307M 109M 218M
8 684 44.8M 11.6M 23.1M 6.78K 452M 117M 234M
16 107 894K 296K 591K 2.13K 17.8M 6.06M 12.1M
32 32 328K 164K 328K 1.33K 13.6M 6.66M 13.3M
64 14 35K 18.5K 37K 1.21K 2.56M 1.43M 2.86M
128 2 30K 18K 36K 399 4.84M 2.92M 5.83M
256 3 2K 2K 4K 1.11K 700K 700K 1.37M
512 2 1K 1K 2K 1.18K 602K 602K 1.18M
Total 244K 10.2G 4.56G 9.12G 302K 11.9G 5.23G 10.5G

surya

2014-03-28 05:02:38 UTC

Permalink

Thanks Bill for confirming as well as for the data point.

Post by Bill Sommerfeld

Post by surya

Post by Matthew Ahrens
I haven't tried this myself, but I believe it should work fine, since
reading dedup'ed blocks is not a special case.

If you keep multiple BE's around there is opportunity for dedup.
I have a couple systems with root on ssd and dedup (and copies=2) enabled. I
see dedup ratios of 1.24x (with 6 BE's - I should do some housecleaning) and
dedup: DDT entries 372725, size 481 on disk, 329 in core
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 314K 15.6G 8.19G 16.3G 314K 15.6G 8.19G 16.3G
2 39.9K 2.04G 1.03G 2.06G 89.3K 4.68G 2.31G 4.61G
4 6.31K 352M 214M 429M 28.5K 1.49G 925M 1.81G
8 3.62K 60.8M 18.6M 37.1M 31.9K 714M 208M 416M
16 189 2.86M 1.19M 2.39M 3.76K 55.0M 22.6M 45.2M
32 36 609K 185K 370K 1.47K 27.3M 7.52M 15.0M
64 41 1.80M 1.16M 2.33M 3.70K 194M 126M 252M
128 8 154K 99K 198K 1.32K 24.8M 15.9M 31.7M
256 3 2K 2K 4K 997 652K 652K 1.27M
512 2 1K 1K 2K 1.36K 696K 696K 1.36M
1K 2 1K 1K 2K 2.14K 1.07M 1.07M 2.14M
Total 364K 18.1G 9.45G 18.8G 478K 22.8G 11.8G 23.5G
dedup: DDT entries 250332, size 463 on disk, 316 in core
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 215K 9.26G 4.09G 8.18G 215K 9.26G 4.09G 8.18G
2 24.2K 886M 446M 892M 51.5K 1.81G 922M 1.80G
4 4.42K 69.9M 24.4M 48.9M 21.2K 307M 109M 218M
8 684 44.8M 11.6M 23.1M 6.78K 452M 117M 234M
16 107 894K 296K 591K 2.13K 17.8M 6.06M 12.1M
32 32 328K 164K 328K 1.33K 13.6M 6.66M 13.3M
64 14 35K 18.5K 37K 1.21K 2.56M 1.43M 2.86M
128 2 30K 18K 36K 399 4.84M 2.92M 5.83M
256 3 2K 2K 4K 1.11K 700K 700K 1.37M
512 2 1K 1K 2K 1.18K 602K 602K 1.18M
Total 244K 10.2G 4.56G 9.12G 302K 11.9G 5.23G 10.5G
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/25372515-edbd5004
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Ian Collins

2014-03-28 06:43:06 UTC

Permalink

Post by Bill Sommerfeld

Post by surya

Post by Matthew Ahrens
I haven't tried this myself, but I believe it should work fine, since
reading dedup'ed blocks is not a special case.

If you keep multiple BE's around there is opportunity for dedup.

Or lots of zone roots.

--
Ian.

Jim Klimov

2014-03-28 16:53:46 UTC

Permalink

Post by surya
'zpool status -D' gives a feel for how
unique or otherwise the data is.

Interesting... I ran this on an older (pre-illumos) pool imported
into oi_151a8, and got this:

# zpool status -D pond
pool: pond
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool
can still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not
support feature flags.
...
errors: No known data errors

dedup: no DDT entries

So yes, the pool is too old for dedup altogether, but does this forbid
ZFS from trying to count identical checksum hits?

Now waiting for "the pretty good estimate" from a "zdb -DD pond" call...

//Jim

Jim Klimov

2014-03-28 17:22:02 UTC

Permalink

Post by Jim Klimov
Now waiting for "the pretty good estimate" from a "zdb -DD pond" call...

Sorry, I meant "zdb -SS pond" for (S)imulation of dedup :)
And indeed, "zpool status -D" returns the same result on native
oi_151a8 pools as well, so I guess it regards dedup statistics
on deduplication that is already in place.

By the way, the simulated stats include information about the
savings from both dedup and compression (with LSIZE vs. PSIZE
basically, for single-copy blocks at least). Are the compression
stats taken from the already compressed blocks only, or does
(can?) the simulator actually compress the data to estimate
savings if (uncompressed?) data is recompressed (with some
default algo?)

//Jim

Jim Klimov

2014-03-30 10:58:46 UTC

Permalink

Post by Jim Klimov

Post by Jim Klimov
Now waiting for "the pretty good estimate" from a "zdb -DD pond" call...

Sorry, I meant "zdb -SS pond" for (S)imulation of dedup :)

That did not go to well... the server worked, and lagged, and spent
more and more time in swapping, and ultimately had itself rebooted.
I am not sure why (the IPMI sercon last-screen-of-life got cleared
by the BIOS), but there was still lots of available swap space.

Probably the lags caused the hardware watchdog to call a reset at
the 15-minute timeout mark.

And I still don't have an answer as to how the dedup would or won't
help on this box ;)

So... there is just one way to try this - an actual copy of the live
zoneroots onto a deduped pool to see how that goes.

As for the original question, and replies in the thread, it seems
that we should expect no extra run-time price paid (except for DDT
storage) for the deduped zoneroots which are written once and remain
rarely modified until the whole bunch is updated and then remains
static for a while again... right? That is, during normal operation
there are no extra processing or caching costs due to dedup (since
there should be no special processing for reads), and in fact there
may be savings due to caching of identical->deduped blocks only once?

Thanks again for the answers and opinions,
//Jim