Discussion:
RAIDz and 4k Sectors
Greg Zartman
2013-11-18 17:59:22 UTC
Permalink
I'm working with SmartOS and have setup a 6x1TB raidz pool with an ashift
of 12 (note: I did chose the ashift of 12 so I could later replace the 1TB
drives with 2/3TB drives).

I created a 100GB volume on my raidz zpool as a container for a linux file
system (ext4). I then filled up the 100GB volume with data. After
filling the volume with data, I was suprised that the volume grew to 153GB
(zfs list). After much research online, it appears this is because of the
4k sectors and parity -- 1 4k block for data and 2 for parity. This jibes
with the 50% increase in the volume size I'm seeing.

I raised this issue on the SmartOS mailing list and have been advised to
just ditch raidz on larger hard drives (2TB+ with 4k sectors) and go with
mirrored drives.

My question: Is there any work around or configuration to use raidz on
drives with 4k sectors that doesn't make you give up 50% storage space,
beyond the normal parity requirements? Is the only real solution to just
not use raidz and go with mirror devices.

Thanks.
--
Greg J. Zartman
Board Member

Koozali Foundation, Inc.
2755 19th Street SE
Salem, Oregon 97302
Cell: 541-5218449

SME Server user and community member since 2000



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Keith Wesolowski
2013-11-18 18:07:58 UTC
Permalink
Post by Greg Zartman
I'm working with SmartOS and have setup a 6x1TB raidz pool with an ashift
of 12 (note: I did chose the ashift of 12 so I could later replace the 1TB
drives with 2/3TB drives).
That's not really necessary. We're currently sourcing 512n 3TB and even
4TB drives and plan to continue doing so for some time. They are
actively being made and sold by HGST and I have not been informed of any
imminent EOL.

See for example this datasheet:
http://www.hgst.com/tech/techlib.nsf/techdocs/FD3F376DC2ECCE68882579D40082C393/$file/US7K4000_ds.pdf.

That doesn't mean 4k doesn't suck, but the bottom line is that if you
store small files on 4k disks, you're going to waste space, and it will
likely be worse on parity-based layouts.
Post by Greg Zartman
I raised this issue on the SmartOS mailing list and have been advised to
just ditch raidz on larger hard drives (2TB+ with 4k sectors) and go with
mirrored drives.
Again, there is no need to use 4k disks if they're not suited to your
application.
Greg Zartman
2013-11-18 19:33:46 UTC
Permalink
On Mon, Nov 18, 2013 at 10:07 AM, Keith Wesolowski <
Post by Keith Wesolowski
Post by Greg Zartman
I'm working with SmartOS and have setup a 6x1TB raidz pool with an ashift
of 12 (note: I did chose the ashift of 12 so I could later replace the
1TB
Post by Greg Zartman
drives with 2/3TB drives).
That's not really necessary. We're currently sourcing 512n 3TB and even
4TB drives and plan to continue doing so for some time. They are
actively being made and sold by HGST and I have not been informed of any
imminent EOL.
I am using Western Digital Red Drives, 2TB. Looks like they have 4k native
sectors. I'm not sure if they support 512e, but my guess is you'd lose
some io performance in emulation mode.
Post by Keith Wesolowski
That doesn't mean 4k doesn't suck, but the bottom line is that if you
store small files on 4k disks, you're going to waste space, and it will
likely be worse on parity-based layouts.
I'm losing about 53% with raidz. Perhaps it's the way linux is formatting
the ext4 partition in the vm container. I'll look into this.

Greg



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Ian Collins
2013-11-18 20:28:12 UTC
Permalink
Post by Greg Zartman
On Mon, Nov 18, 2013 at 10:07 AM, Keith Wesolowski
That doesn't mean 4k doesn't suck, but the bottom line is that if you
store small files on 4k disks, you're going to waste space, and it will
likely be worse on parity-based layouts.
I'm losing about 53% with raidz. Perhaps it's the way linux is
formatting the ext4 partition in the vm container. I'll look into this.
Did you follow the volume creation suggestions offered on the SmartOS list?
--
Ian.
Greg Zartman
2013-11-18 20:44:10 UTC
Permalink
Post by Ian Collins
Did you follow the volume creation suggestions offered on the SmartOS list?
Working on it. I was playing around on a mirrored array, so I'll need to
revert back to my raidz config to see if the suggestions work.

Sounds like there are multiple things going on here. I was curious what
others were doing, and it sounds like the SmartOS people are simply buying
512 sector drives.

My drives have 512e, but I'm not sure how much I'd lose in i/o for this
emulation mode.



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Wagner
2013-11-18 19:53:27 UTC
Permalink
The data sheet you've linked to is not a 512 native drive, it is AF (4k)
with 512 emulation.
Post by Keith Wesolowski
Post by Greg Zartman
I'm working with SmartOS and have setup a 6x1TB raidz pool with an ashift
of 12 (note: I did chose the ashift of 12 so I could later replace the
1TB
Post by Greg Zartman
drives with 2/3TB drives).
That's not really necessary. We're currently sourcing 512n 3TB and even
4TB drives and plan to continue doing so for some time. They are
actively being made and sold by HGST and I have not been informed of any
imminent EOL.
http://www.hgst.com/tech/techlib.nsf/techdocs/FD3F376DC2ECCE68882579D40082C393/$file/US7K4000_ds.pdf
.
That doesn't mean 4k doesn't suck, but the bottom line is that if you
store small files on 4k disks, you're going to waste space, and it will
likely be worse on parity-based layouts.
Post by Greg Zartman
I raised this issue on the SmartOS mailing list and have been advised to
just ditch raidz on larger hard drives (2TB+ with 4k sectors) and go with
mirrored drives.
Again, there is no need to use 4k disks if they're not suited to your
application.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24409195-16edb367
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Keith Wesolowski
2013-11-18 20:10:00 UTC
Permalink
Post by Karl Wagner
The data sheet you've linked to is not a 512 native drive, it is AF (4k)
with 512 emulation.
If you read carefully, you will see that they make both 512n and 512e
models. The model number encodes the sector size.
Karl Wagner
2013-11-18 20:13:02 UTC
Permalink
So they do. Sorry, my mistake.
Post by Keith Wesolowski
Post by Karl Wagner
The data sheet you've linked to is not a 512 native drive, it is AF (4k)
with 512 emulation.
If you read carefully, you will see that they make both 512n and 512e
models. The model number encodes the sector size.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24409195-16edb367
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Laager
2013-11-18 18:22:38 UTC
Permalink
Post by Greg Zartman
I'm working with SmartOS and have setup a 6x1TB raidz pool with an
This should say raidz2, I assume, given your later mention of two blocks
of parity. If it's really raidz1, then the math doesn't work out evenly,
which makes this problem that much worse.

Compression helps make it better, and you probably want lz4 anyway,
unless your data is uncompressible.
Post by Greg Zartman
ashift of 12 (note: I did chose the ashift of 12 so I could later
replace the 1TB drives with 2/3TB drives).
I created a 100GB volume on my raidz zpool as a container for a linux
file system (ext4).
(6-2) * 4K = 16K. Set your volblocksize on the zvol to 16K.

Also, you should add `-E stride=4,stripe-width=4` to your mke2fs command
line when you create the filesystem in Linux. I'm not sure how much
difference this makes in practice, but setting it properly isn't too
hard, so we always do it here.

If you want to go larger (to reduce the ZFS metadata overhead), use
volblocksize=32K and stride=8,stripe-width=8, 64K and 16, or 128K and
32. My Nexenta reseller recommends 64K blocks on ZFS as a good
real-world default; the idea being that you really want to keep all your
ZFS metadata in RAM.

Also, note that volume reservations aren't calculated correctly in this
case. To use your example, the refreservation was probably calculated at
100 GB and change, but you ended up consuming 150 GB. So either always
create sparse zvols (so you're not kidding yourself), or if you really
want reserved space, create a test zvol of the same logical size and
fill it (with compression turned off); then manually re-set the
refreservation on the real zvol.

--
Jim Klimov
2013-11-18 18:26:11 UTC
Permalink
Post by Greg Zartman
I'm working with SmartOS and have setup a 6x1TB raidz pool with an
ashift of 12 (note: I did chose the ashift of 12 so I could later
replace the 1TB drives with 2/3TB drives).
I created a 100GB volume on my raidz zpool as a container for a linux
file system (ext4). I then filled up the 100GB volume with data.
After filling the volume with data, I was suprised that the volume grew
to 153GB (zfs list). After much research online, it appears this is
because of the 4k sectors and parity -- 1 4k block for data and 2 for
parity. This jibes with the 50% increase in the volume size I'm seeing.
Actually, I am not so sure about this: "zfs list" does not account the
parity/mirroring overheads (AFAIK, may be wrong). It may account for
metadata overhead associated with storage of those blocks (references
to them).
Post by Greg Zartman
My question: Is there any work around or configuration to use raidz on
drives with 4k sectors that doesn't make you give up 50% storage space,
beyond the normal parity requirements? Is the only real solution to
just not use raidz and go with mirror devices.
You mentioned 6*disk raidz1 - this should allow for 5 data + 1 parity.
I believe in your case the zvol uses a default blocksize of 8k which
does reduce to 2*4k data + 1 parity as you witness.

Beside mirroring (which would gain in speed and be more predictable in
parity overhead ratio), you could look into larger block sizes, and
into raidz2 (4 data + 2 parity) which would be neater for 2^N sized
blocks (if you don't compress). It is often said that with modern huge
disks (3Tb+) the scrub/resilver/rebuild times can be too large for
safe work of the pool without one lost disk, so that 2-3 redundancy
disks are recommended (and accordingly larger disk sets - like 8+3).

//Jim
Matthew Ahrens
2013-11-21 04:07:09 UTC
Permalink
Post by Jim Klimov
Post by Greg Zartman
I'm working with SmartOS and have setup a 6x1TB raidz pool with an
ashift of 12 (note: I did chose the ashift of 12 so I could later
replace the 1TB drives with 2/3TB drives).
I created a 100GB volume on my raidz zpool as a container for a linux
file system (ext4). I then filled up the 100GB volume with data.
After filling the volume with data, I was suprised that the volume grew
to 153GB (zfs list). After much research online, it appears this is
because of the 4k sectors and parity -- 1 4k block for data and 2 for
parity. This jibes with the 50% increase in the volume size I'm seeing.
Actually, I am not so sure about this: "zfs list" does not account the
parity/mirroring overheads (AFAIK, may be wrong).
"zfs list" does take into account additional space used (e.g for parity,
ditto, gang) beyond what would be expected based on a 128K recordsize.
Post by Jim Klimov
It may account for
metadata overhead associated with storage of those blocks (references
to them).
It does.

--matt
Post by Jim Klimov
My question: Is there any work around or configuration to use raidz on
Post by Greg Zartman
drives with 4k sectors that doesn't make you give up 50% storage space,
beyond the normal parity requirements? Is the only real solution to
just not use raidz and go with mirror devices.
You mentioned 6*disk raidz1 - this should allow for 5 data + 1 parity.
I believe in your case the zvol uses a default blocksize of 8k which
does reduce to 2*4k data + 1 parity as you witness.
Beside mirroring (which would gain in speed and be more predictable in
parity overhead ratio), you could look into larger block sizes, and
into raidz2 (4 data + 2 parity) which would be neater for 2^N sized
blocks (if you don't compress). It is often said that with modern huge
disks (3Tb+) the scrub/resilver/rebuild times can be too large for
safe work of the pool without one lost disk, so that 2-3 redundancy
disks are recommended (and accordingly larger disk sets - like 8+3).
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...