Discussion:
Overhead of ashift=12 vs 9
Aneurin Price
2013-11-05 11:24:30 UTC
Permalink
Hi Folks,

I'm currently in the process of migrating all of my data from one pool
to another, and I'm seeing some large discrepancies in the space usage
for some of my datasets, as reported by zfs list.

For example, one dataset that goes from 6.48GB to 13.9GB, one from
208GB to 237GB, one from 83GB to 147GB. All of these have copies=1 and
the same compression ratio reported. Other datasets either have the
same reported usage, or a difference that's only a tiny percentage.

The only difference I can think of that might be relevant is that the
new pool has ashift 12, whereas the old has ashift 9. Given that the
datasets with the highest percentage difference are ones that hold
mostly small files, this seems the likely explanation. Eg. the
datasest that went from 6.5 to 14GB probably contains around a million
files, putting them at a little over 4k each; I'm interested in what
du reports, but it's been about 24 hours so far and it hasn't
completed.

I've read quite a bit about overhead from having ashift=12, but all in
the context of RAIDZ. This pool isn't using RAIDZ, just three basic
vdevs. Given that, does this sound like an expected level of overhead
coming from the higher ashift, or should I be looking for something
else?

Thanks,
Nye
Jim Klimov
2013-11-05 14:43:54 UTC
Permalink
Post by Aneurin Price
I've read quite a bit about overhead from having ashift=12, but all in
the context of RAIDZ. This pool isn't using RAIDZ, just three basic
vdevs. Given that, does this sound like an expected level of overhead
coming from the higher ashift, or should I be looking for something
else?
That would mean a stripe of 3 disks (LUNs, slices, whatever) or a
raid10 mirror+stripe of those, right?

I think that if, as you say, there are many small files, then you
likely end up with either two 4Kb allocations for each (if they
remain "a bit over 4Kb" after compression) instead of say 9-10 512b
sector allocations.

Likewise, each file also has metadata - at least a sector which
would contain the block pointer entry(ies) that make up a file.
While larger files have many blkptr_t's which can fit into one
dnode block (up to 16Kb each, IIRC), small files would only have
one or two entries consuming a whole sector with lots of empty
slack space. I'd "bet" that this is what your overhead likely
comes from, especially the near-doubling of disk usage.

You can research individual files' allocations with ZDB, example:

### Get inode number
# ls -lai /usr/bin/bash
12010 -r-xr-xr-x 1 root bin 799044 Nov 26 2009
/usr/bin/bash

### Determine dataset
# df -k /usr/bin/bash
Filesystem kbytes used avail capacity Mounted on
rpool/ROOT/snv_129/usr
30965760 528424 10809487 5% /usr

### Request allocation info on object in dataset by number
# zdb -dddddd -bbbbbb rpool/ROOT/snv_129/usr 12010
Dataset rpool/ROOT/snv_129/usr [ZPL], ID 364, cr_txg 314680, 516M, 49105
objects, rootbp DVA[0]=<0:5bb096c00:200> DVA[1]=<0:2a0c2d800:200> [L0
DMU objset] fletcher4 lzjb LE contiguous unique double size=400L/200P
birth=28256443L/28256443P fill=49105
cksum=7d6bd5cd9:364e73218ce:bf2a70e933e2:1c7bc10cc715c1

Object lvl iblk dblk dsize lsize %full type
12010 2 16K 128K 357K 896K 100.00 ZFS plain file
(K=inherit) (Z=inherit)
264 bonus ZFS znode
dnode flags: USED_BYTES
dnode maxblkid: 6
path /bin/bash
uid 0
gid 2
atime Wed Nov 2 14:17:10 2011
mtime Thu Nov 26 00:48:02 2009
ctime Tue Dec 29 03:51:22 2009
crtime Tue Dec 29 03:51:11 2009
gen 315244
mode 100555
size 799044
parent 22
links 1
xattr 0
rdev 0x0000000000000000
Indirect blocks:
0 L1 DVA[0]=<0:34c809a00:400> DVA[1]=<0:111b8a000:400>
[L1 ZFS plain file] fletcher4 lzjb LE contiguous unique double
size=4000L/400P birth=315244L/315244P fill=7
cksum=7171242382:3c5381f40231:1301ded64cddb5:47d2537c473531a

### So here we have one dnode block with L0 entries which reference
### pieces of the larger file, 7 segments overall (below), plus the
### dnode block itself described above

0 L0 DVA[0]=<0:34c72e400:d800> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/d800P
birth=315244L/315244P fill=1
cksum=1bee14e9e80d:2f65b3dfbc1a1fb:8bd7b384ac9843a8:78ceb6870b887e4e
20000 L0 DVA[0]=<0:34c705200:f800> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/f800P
birth=315244L/315244P fill=1
cksum=1f3a6a883e7b:3cd63a1eae15ddd:ad790f0c7886865a:2465df27ed8f14c5
40000 L0 DVA[0]=<0:34c6f6200:f000> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/f000P
birth=315244L/315244P fill=1
cksum=1e7f42676223:39389e38a5b2a77:64651f5883c4106b:51d00e300f2c5e5f
60000 L0 DVA[0]=<0:34c714a00:fa00> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/fa00P
birth=315244L/315244P fill=1
cksum=1f6371d117a9:3d3f7debbd934c1:c634eb7023ab1abd:1654fde6f1229bd9
80000 L0 DVA[0]=<0:34c76fc00:fa00> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/fa00P
birth=315244L/315244P fill=1
cksum=1f3a2ed4c291:3d35862981da9b5:a24f7fef76f5dcc6:23e9b5cba877fae
a0000 L0 DVA[0]=<0:34c7fd800:c200> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/c200P
birth=315244L/315244P fill=1
cksum=18c419fdb031:2572d46a20bce87:c82add0a3084fc30:ae7b1bb5bc4e79be
c0000 L0 DVA[0]=<0:34c724400:1400> [L0 ZFS plain file]
fletcher4 gzip-9 LE contiguous unique single size=20000L/1400P
birth=315244L/315244P fill=1
cksum=2898c46d28f:6cfcb3a659052:b65b392a3073f70:47831f32459e8397

segment [0000000000000000, 00000000000e0000) size 896K
Post by Aneurin Price
Hi Folks,
I'm currently in the process of migrating all of my data from one pool
to another, and I'm seeing some large discrepancies in the space usage
for some of my datasets, as reported by zfs list.
For example, one dataset that goes from 6.48GB to 13.9GB, one from
208GB to 237GB, one from 83GB to 147GB. All of these have copies=1 and
the same compression ratio reported. Other datasets either have the
same reported usage, or a difference that's only a tiny percentage.
The only difference I can think of that might be relevant is that the
new pool has ashift 12, whereas the old has ashift 9. Given that the
datasets with the highest percentage difference are ones that hold
mostly small files, this seems the likely explanation. Eg. the
datasest that went from 6.5 to 14GB probably contains around a million
files, putting them at a little over 4k each; I'm interested in what
du reports, but it's been about 24 hours so far and it hasn't
completed.
Hope my guess helps,

//Jim
Aneurin Price
2013-11-05 16:33:36 UTC
Permalink
Post by Jim Klimov
Post by Aneurin Price
I've read quite a bit about overhead from having ashift=12, but all in
the context of RAIDZ. This pool isn't using RAIDZ, just three basic
vdevs. Given that, does this sound like an expected level of overhead
coming from the higher ashift, or should I be looking for something
else?
That would mean a stripe of 3 disks (LUNs, slices, whatever) or a
raid10 mirror+stripe of those, right?
I think that if, as you say, there are many small files, then you
likely end up with either two 4Kb allocations for each (if they
remain "a bit over 4Kb" after compression) instead of say 9-10 512b
sector allocations.
Likewise, each file also has metadata - at least a sector which
would contain the block pointer entry(ies) that make up a file.
While larger files have many blkptr_t's which can fit into one
dnode block (up to 16Kb each, IIRC), small files would only have
one or two entries consuming a whole sector with lots of empty
slack space. I'd "bet" that this is what your overhead likely
comes from, especially the near-doubling of disk usage.
Thanks for these pointers. I think I am indeed looking at something
like a classic internal fragmentation case.

Further investigation of the file layout on this filesystem shows the
situation is worse than I thought: I'm using gmvault to backup my
gmail database, and it saves each e-mail as an individual file, with
an extra metadata file for each one, meaning that they make up half
the total set of files. These metadata files are on the order of a
couple of hundred bytes, so this looks like pretty much the worst case
scenario: moving from ashift 9 to 12 means they take up 8 times as
much space.

We're only talking about a few GB here, so that's not the end of the
world. The larger filesystems, although not such a large percentage,
do have a lot more wasted space in absolute terms, but I guess there's
not a great deal I can do short of storing all the data in them in an
archive or something. Probably not worth the hassle for the sake of
100GB or so.

Anyway, at least I'm now reasonably satisfied that I know where all
this space is going, so thanks.
Nye
Jim Klimov
2013-11-05 16:52:38 UTC
Permalink
Post by Aneurin Price
Further investigation of the file layout on this filesystem shows the
situation is worse than I thought: I'm using gmvault to backup my
gmail database, and it saves each e-mail as an individual file, with
an extra metadata file for each one, meaning that they make up half
the total set of files. These metadata files are on the order of a
couple of hundred bytes, so this looks like pretty much the worst case
scenario: moving from ashift 9 to 12 means they take up 8 times as
much space.
You can also investigate if you can create a zvol, mount it from
localhost over iSCSI and spawn a ZFS pool with ashift=9 inside.
This may be a somewhat fragile solution, however myself and I think
"Ned" have gone through such an experiment. In my case however the
zvol was created with 4KB blocksize "for efficiency" (hardware
IOs would match logical IOs) and dedup inside... and on my 8Gb
machine the idea was overall a failure. But not impossible.
I believe in your backup case the IO performance would not be the
bottleneck, and saving some tens or hundreds of GBs (and learning
something new) might be worth it...

There is some trickery to add SMF services and dependencies for
the zvol, iscsi target, client and inner pool to come up and
down in proper order, I think this can be found in the archives.

HTH,
//Jim
Aneurin Price
2013-11-06 10:41:56 UTC
Permalink
Post by Jim Klimov
Post by Aneurin Price
Further investigation of the file layout on this filesystem shows the
situation is worse than I thought: I'm using gmvault to backup my
gmail database, and it saves each e-mail as an individual file, with
an extra metadata file for each one, meaning that they make up half
the total set of files. These metadata files are on the order of a
couple of hundred bytes, so this looks like pretty much the worst case
scenario: moving from ashift 9 to 12 means they take up 8 times as
much space.
You can also investigate if you can create a zvol, mount it from
localhost over iSCSI and spawn a ZFS pool with ashift=9 inside.
Or even some other FS designed for efficient block suballocation I guess.
It might bear investigating once I have the rest of this long and
painful migration process out of the way...

Thanks for your input.
Richard Elling
2013-11-06 19:38:35 UTC
Permalink
comment below...
Post by Aneurin Price
Post by Jim Klimov
Post by Aneurin Price
I've read quite a bit about overhead from having ashift=12, but all in
the context of RAIDZ. This pool isn't using RAIDZ, just three basic
vdevs. Given that, does this sound like an expected level of overhead
coming from the higher ashift, or should I be looking for something
else?
That would mean a stripe of 3 disks (LUNs, slices, whatever) or a
raid10 mirror+stripe of those, right?
I think that if, as you say, there are many small files, then you
likely end up with either two 4Kb allocations for each (if they
remain "a bit over 4Kb" after compression) instead of say 9-10 512b
sector allocations.
Likewise, each file also has metadata - at least a sector which
would contain the block pointer entry(ies) that make up a file.
While larger files have many blkptr_t's which can fit into one
dnode block (up to 16Kb each, IIRC), small files would only have
one or two entries consuming a whole sector with lots of empty
slack space. I'd "bet" that this is what your overhead likely
comes from, especially the near-doubling of disk usage.
Thanks for these pointers. I think I am indeed looking at something
like a classic internal fragmentation case.
It is better to show data than guess. zfs_blkstats will show exactly where
space is being consumed. Alas, you did not state what OS you are running
and zfs_blkstats is available as an mdb command on Solaris-derived OSes.
-- richard
Post by Aneurin Price
Further investigation of the file layout on this filesystem shows the
situation is worse than I thought: I'm using gmvault to backup my
gmail database, and it saves each e-mail as an individual file, with
an extra metadata file for each one, meaning that they make up half
the total set of files. These metadata files are on the order of a
couple of hundred bytes, so this looks like pretty much the worst case
scenario: moving from ashift 9 to 12 means they take up 8 times as
much space.
We're only talking about a few GB here, so that's not the end of the
world. The larger filesystems, although not such a large percentage,
do have a lot more wasted space in absolute terms, but I guess there's
not a great deal I can do short of storing all the data in them in an
archive or something. Probably not worth the hassle for the sake of
100GB or so.
Anyway, at least I'm now reasonably satisfied that I know where all
this space is going, so thanks.
Nye
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Aneurin Price
2013-11-07 11:28:13 UTC
Permalink
Post by Richard Elling
comment below...
I've read quite a bit about overhead from having ashift=12, but all in
the context of RAIDZ. This pool isn't using RAIDZ, just three basic
vdevs. Given that, does this sound like an expected level of overhead
coming from the higher ashift, or should I be looking for something
else?
That would mean a stripe of 3 disks (LUNs, slices, whatever) or a
raid10 mirror+stripe of those, right?
I think that if, as you say, there are many small files, then you
likely end up with either two 4Kb allocations for each (if they
remain "a bit over 4Kb" after compression) instead of say 9-10 512b
sector allocations.
Likewise, each file also has metadata - at least a sector which
would contain the block pointer entry(ies) that make up a file.
While larger files have many blkptr_t's which can fit into one
dnode block (up to 16Kb each, IIRC), small files would only have
one or two entries consuming a whole sector with lots of empty
slack space. I'd "bet" that this is what your overhead likely
comes from, especially the near-doubling of disk usage.
Thanks for these pointers. I think I am indeed looking at something
like a classic internal fragmentation case.
It is better to show data than guess. zfs_blkstats will show exactly where
space is being consumed. Alas, you did not state what OS you are running
and zfs_blkstats is available as an mdb command on Solaris-derived OSes.
Thanks for the suggestion. Unfortunately I'm running zfsonlinux which
doesn't currently export that data (last time I checked anyway).
You've got me curious now though, so now I'm contemplating the easiest way
of booting temporarily into SmartOS or something to find out (this machine
is headless, so it's not as simple as 'plug in bootable usb stick, restart,
done').



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...