4KB block drives (under FreeBSD)

Discussion:

Paul Kraus via illumos-zfs

2014-08-31 20:33:32 UTC

I have a server running FreeBSD 10.0 with a RAIDz2 based zpool. The zpool consists of five 1TB drives. Recently one of the drives failed and I replaced it with a much more modern 1TB drive. Unfortunately, the replacement drive is 4KB block and the older drives are 512B block. So zpool status is rightly complaining:

***@FreeBSD2:/freebsd1/etc # zpool status
pool: export
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Aug 31 15:45:34 2014
113G scanned out of 2.87T at 52.1M/s, 15h25m to go
22.4G resilvered, 3.83% done
config:

NAME STATE READ WRITE CKSUM
export ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
ada6p1 ONLINE 0 0 0 block size: 512B configured, 4096B native
diskid/DISK-WD-WMC5K0159058 ONLINE 0 0 0 block size: 512B configured, 4096B native (resilvering)
diskid/DISK-JPW9K0N018164Lp1 ONLINE 0 0 0
diskid/DISK-9QJ5252Gp1 ONLINE 0 0 0
diskid/DISK-9QJ517HQp1 ONLINE 0 0 0
diskid/DISK-9QJ574MKp1 ONLINE 0 0 0

errors: No known data errors

Note, I used a 2TB drive I had lying around as a temporary spare while I waited for the replacement drive to arrive. I was not worried about the 2TB drive being 4KB as I knew that I would be replacing it real soon now with a 1TB drive. Unfortunately the 1TB drive is also 4KB.

So the real question here is if I can destroy the zpool and rebuild it FORCING 4KB block size (even on the 512B drives) ? I know there will probably be performance degradation, but only until all of the older 512B drives die and get replaced.

Or am I better off having two zpools, one made up of 4KB block drives and one of 512B drives. That would be harder to manage. I could go to 2-way mirrors and just add pairs as the old drives die off and migrate the data, but since we can’t shrink a zpool, that would mean rebuilding the old zpool every time a drive fails. Not something I am looking forward to.

Reminder: This is FreeBSD 10.0

Thanks for the input.

--
Paul Kraus
***@kraus-haus.org

z***@lists.illumos.org

2014-08-31 22:43:17 UTC

Permalink

Good day Paul.

In my practics, perfomance penalti for 4k drive in 512B pool is big, but 512B
drive work great in 4K optimized pool. Today I create only ashift=12 pool for
future replace drive to 4K sectors.

Post by Paul Kraus via illumos-zfs
I have a server running FreeBSD 10.0 with a RAIDz2 based zpool. The zpool
consists of five 1TB drives. Recently one of the drives failed and I
replaced it with a much more modern 1TB drive. Unfortunately, the
replacement drive is 4KB block and the older drives are 512B block. So
pool: export
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Aug 31 15:45:34 2014
113G scanned out of 2.87T at 52.1M/s, 15h25m to go
22.4G resilvered, 3.83% done
NAME STATE READ WRITE CKSUM
export ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
512B configured, 4096B native diskid/DISK-WD-WMC5K0159058 ONLINE 0
0 0 block size: 512B configured, 4096B native (resilvering)
diskid/DISK-JPW9K0N018164Lp1 ONLINE 0 0 0
diskid/DISK-9QJ5252Gp1 ONLINE 0 0 0
diskid/DISK-9QJ517HQp1 ONLINE 0 0 0
diskid/DISK-9QJ574MKp1 ONLINE 0 0 0
errors: No known data errors
Note, I used a 2TB drive I had lying around as a temporary spare while I
waited for the replacement drive to arrive. I was not worried about the 2TB
drive being 4KB as I knew that I would be replacing it real soon now with a
1TB drive. Unfortunately the 1TB drive is also 4KB.
So the real question here is if I can destroy the zpool and rebuild it
FORCING 4KB block size (even on the 512B drives) ? I know there will
probably be performance degradation, but only until all of the older 512B
drives die and get replaced.
Or am I better off having two zpools, one made up of 4KB block drives and
one of 512B drives. That would be harder to manage. I could go to 2-way
mirrors and just add pairs as the old drives die off and migrate the data,
but since we can’t shrink a zpool, that would mean rebuilding the old zpool
every time a drive fails. Not something I am looking forward to.
Reminder: This is FreeBSD 10.0
Thanks for the input.
--
Paul Kraus
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/25007046-0c89c892 Modify
https://www.listbox.com/member/?&
914 Powered by Listbox: http://www.listbox.com

Jakob Borg via illumos-zfs

2014-09-01 06:13:05 UTC

Permalink

Post by Paul Kraus via illumos-zfs
So the real question here is if I can destroy the zpool and rebuild it FORCING 4KB block size (even on the 512B drives) ? I know there will probably be performance degradation, but only until all of the older 512B drives die and get replaced.

Yes. If you destroy and recreate the pool with those same drives,
you'll get one ashift=12 RAIDZ2 vdev (i.e. 4K block size). The ashift
is per vdev, and in your case you only have one. Creating ashift=12
vdevs when *none* of the disks report 4K size requires some trickery,
but this is the straightforward case.

//jb

Steven Hartland via illumos-zfs

2014-09-01 08:18:15 UTC

Permalink

----- Original Message -----

2014-08-31 22:33 GMT+02:00 Paul Kraus via illumos-zfs

So the real question here is if I can destroy the zpool and rebuilt
it FORCING 4KB block size (even on the 512B drives) ? I know there
will probably be performance degradation, but only until all of the
older
512B drives die and get replaced.

Actually if your on a late enough version you can ensure that with:
sysctl vfs.zfs.min_auto_ashift=12

Regards
Steve

Paul Kraus via illumos-zfs

2014-09-01 17:44:12 UTC

Permalink

Post by Steven Hartland via illumos-zfs

Post by Jakob Borg via illumos-zfs

sysctl vfs.zfs.min_auto_ashift=12

I am running 10.0-RELEASE-p7 and I see vfs.zfs.max_auto_ashift: 13 (which implies up to 8KB disk blocks) but I do not see a min_auto_ashift (I am using sysctl -a to look).

I would rather not build a custom kernel for this system, but stay with the RELEASE code.’

If I did set min_auto_ashift would I have a problem importing ashift=9 zpools or is that tunable only for vdev creation?

Thanks.

--
Paul Kraus
***@kraus-haus.org

Steven Hartland via illumos-zfs

2014-09-01 17:54:01 UTC

Permalink

Post by Steven Hartland via illumos-zfs
----- Original Message -----
Sent: Monday, September 01, 2014 6:44 PM
Subject: Re: [zfs] 4KB block drives (under FreeBSD)

Post by Steven Hartland via illumos-zfs

2014-08-31 22:33 GMT+02:00 Paul Kraus via illumos-zfs

sysctl vfs.zfs.min_auto_ashift=12

I am running 10.0-RELEASE-p7 and I see
vfs.zfs.max_auto_ashift: 13 (which implies up to 8KB disk blocks)
but I do not see a min_auto_ashift (I am using sysctl -a to look).
I would rather not build a custom kernel for this system, but stay
with the RELEASE code.’
If I did set min_auto_ashift would I have a problem importing ashift=9
zpools or is that tunable only for vdev creation?

It was added after 10.0, it will be in 10.1 and can be accessed in
stable/10 already.

The sysctl effects the creation of new pools to use a min ashift of
the value specified so it allows you to create ashift=12 pools even
when the backing devices report a 512b sector size for example.

It has no impact on already created pools.

If you don't want to upgrade to stable/10 then you can use the gnop
hack to force 4k pool creation.

Regards
Steve

Paul Kraus via illumos-zfs

2014-09-01 18:04:13 UTC

Permalink

Post by Steven Hartland via illumos-zfs

----- Original Message ——
I am running 10.0-RELEASE-p7 and I see
vfs.zfs.max_auto_ashift: 13 (which implies up to 8KB disk blocks)
but I do not see a min_auto_ashift (I am using sysctl -a to look).
I would rather not build a custom kernel for this system, but stay
with the RELEASE code.’
If I did set min_auto_ashift would I have a problem importing ashift=9
zpools or is that tunable only for vdev creation?

It was added after 10.0, it will be in 10.1 and can be accessed in
stable/10 already.

How painless is that upgrade (to stable/10) ?

Post by Steven Hartland via illumos-zfs
The sysctl effects the creation of new pools to use a min ashift of
the value specified so it allows you to create ashift=12 pools even
when the backing devices report a 512b sector size for example.
It has no impact on already created pools.

That was what I expected, thanks for confirming.

Post by Steven Hartland via illumos-zfs
If you don't want to upgrade to stable/10 then you can use the gnop
hack to force 4k pool creation.

Yeah, I have been trying to avoid that :-)

Can I also force the ashift=12 by using at least one 4KB drive in each vdev as I create them? (under 10-RELEASE)

--
Paul Kraus
***@kraus-haus.org

Steven Hartland via illumos-zfs

2014-09-01 18:18:44 UTC

Permalink

This post might be inappropriate. Click to display it.

Paul Kraus via illumos-zfs

2014-09-01 18:01:20 UTC

Permalink

Post by Jakob Borg via illumos-zfs

At this point I am actually planning on changing my configuration. I went with the RAIDz2 so that I would have more time to replace a failed drive, during which a second failed drive would not hurt me. But I have since determined that the resilver time is throttled by the write speed of the one drive being written to. By going to a 6 disk configuration of three 2-way mirrors I get roughly the same net capacity, better day to day performance, and I may see reduced resilver times, plus I will be able to add capacity (add another mirror vdev) down the road.

So I plan on building the zpool one mirror pair at a time. I have two of the 4KB drives. I’ll build the first mirror pair using one of them, then add the second using one of them as well, then replace the 4KB drive from the first vdev with an older 512B drive and add the third mirror vdev using the freed up 4KB drive. That will give me three vdev, all ashift=12, made up of four 512B drives and two 4KB drives. I will also have one hot spare.

If I use the full drives (and let ZFS create the EFI labels), do I have to worry about alignment or does ZFS take care of that now?

Plus I now have enough drives and drive bays to make sure that I am NOT mirroring drives of the same model and production date. When I started I had 4 identical Seagate drives, now I have a good mix of Seagate, HGST, and WD so my fear of simultaneous failure of both halves of a mirror at once is greatly reduced.

Plus I will have a much more robust backup system in place, so I am now more concerned with performance on the production server, perviously reliability and redundancy were the #1 priority, now they are tied with performance :-)

--
Paul Kraus
***@kraus-haus.org

Paul Kraus via illumos-zfs

2014-09-02 01:34:55 UTC

Permalink

Post by Paul Kraus via illumos-zfs
At this point I am actually planning on changing my configuration. I went with the RAIDz2 so that I would have more time to replace a failed drive, during which a second failed drive would not hurt me. But I have since determined that the resilver time is throttled by the write speed of the one drive being written to. By going to a 6 disk configuration of three 2-way mirrors I get roughly the same net capacity, better day to day performance, and I may see reduced resilver times, plus I will be able to add capacity (add another mirror vdev) down the road.

But a double drive failure might kill the whole pool. And are you sure
the benefit in decreased resilver time is real?

The resilver time for a RAIDz2 pools limited to the random write performance of the drive being resilvered. The resilver time for a mirror pool should be no worse than that. I will test the mirror resilver time (as I have done the RAIDz2 resilver time).

Check the list archives. If I remember correctly, the pairs of mirrors
strategy is less safe than raidz2 with the same number of disks.

The MTTDL research down by Richard Elling years ago clearly shows that a 3-way mirror has better MTTDL numbers than a RAIDz2 which has better numbers than a 2-way mirror. IIRC, RAIDz3 beats the 3-way mirror. Which all makes good common sense. In the case of a 6 disk zpool, in the RAIDz2 case, when one drive fails, *any* other drive can fail and not take the zpool out; but in the case of a 3 x 2-way mirror, when one drive fails, failure of the specific drive that is the partner *will* take the zpool out.

The tradeoffs are between:

1. ability to survive multiple drive failures before a resilver can complete (better with RAIDz2)

2. performance for random I/O (better with mirrors)

3. resilver time (I have measured RAIDz2 resilver time, I have yet to measure mirror resilver time)

4. Long term manageability (it is far easier to grow a set of mirrors than a RAIDz<n>)

In 2009 or so I was working with a client on a storage system for very mission critical data. The total size was about 250TB (not big by today’s standards, but it was at the time). The client could not afford the 3 weeks it would have taken to restore from backup if there had been failure. At first (2005 or so) we tried multiple zpools to provide fault isolation, but that proved unmanageable (not technically, but from the business side).

The final design was five Sun J4400 chassis each with 24 x 750GB drives (the 1TB drives were just starting to appear and were not cost effective yet). The configuration (after lost of consultation with Sun) was to create a one zpool of 22 RAIDz2 vdevs. Each vdev consists one one drive from each chassis (and we made it be the drives in the same numbered slot, so all the drive sin slot 10 were part of the same vdev), slots 0 and 1 held the hot spares.

Performance was very good (better than any of the applications that needed the data) and reliability was stellar. We tested loss of controllers, paths, chassis all before going production with no loss of data.

I very much understand the tradeoffs between mirrors and RAIDz<n> :-) In this case, loss of the entire zpool will be an inconvenience but not catastrophic as I will have a replica being synced via zfs send / zfs recv (or maybe rsync) hourly. I suspect the network load of the rsync will be heavier than the zfs send / zfs recv.

--
Paul Kraus
***@kraus-haus.org

Chris Siebenmann via illumos-zfs

2014-09-02 03:18:23 UTC

Permalink

Post by Paul Kraus via illumos-zfs
The resilver time for a RAIDz2 pools limited to the random write
performance of the drive being resilvered. The resilver time for a
mirror pool should be no worse than that. I will test the mirror
resilver time (as I have done the RAIDz2 resilver time).

My view is that ZFS resilver times are generally very hard to estimate
and that it is going to be misleading to measure them on fresh pools.
This is because a ZFS resilver is not a linear process; it involves
walking the pool metadata and data in some order and rewriting some or
all of it on the new/replacement drive. In various circumstances this
can result in resilvers being streaming read or write limited, random
write limited, or random read limited (and any particular resilver may
transition between limits at different points in time).

A fresh pool with freshly loaded data is likely to be a best case as
far as linear data and metadata and low seeks are concerned. Depending
on the sort of writes that happen in your pool, your pool may stay more
or less this way or it may fragment over time.

If all else is equal I would somewhat expect a mirrored pool to do
better on resilvers than a raidz pool given the same number of disks
simply because a mirror pool can manage more IOPs a second (as a raidz
vdev effectively gets only one disk's IOPs while mirrors get all IOPs,
plus you'll have more vdevs with a mirror-based pool).

I sort of half-wish that ZFS resilvers (and scrubs) reported how much
random IO and how much linear IO they did, or at least reported how many
IOPs the resilver took as well as how much data was involved. I suppose
the real answer is to instrument this stuff with DTrace.

(Among other uses, reporting this for scrubs would give you a crude
measure of how fragmented your pool was.)

- cks

Paul Kraus via illumos-zfs

2014-09-02 12:44:47 UTC

Permalink

Post by Chris Siebenmann via illumos-zfs
If all else is equal I would somewhat expect a mirrored pool to do
better on resilvers than a raidz pool given the same number of disks
simply because a mirror pool can manage more IOPs a second (as a raidz
vdev effectively gets only one disk's IOPs while mirrors get all IOPs,
plus you'll have more vdevs with a mirror-based pool).

This is my experience with scrubs, having more vdevs means more I/OPS available to get the job done.

Post by Chris Siebenmann via illumos-zfs
I sort of half-wish that ZFS resilvers (and scrubs) reported how much
random IO and how much linear IO they did, or at least reported how many
IOPs the resilver took as well as how much data was involved. I suppose
the real answer is to instrument this stuff with DTrace.

When I resilver (or scrub) I watch the stats from iostat -x and zpool iostat -v and they give me a good idea what is going on between ZFS and the drives. When you are limited by a single drive writing or, in the case of a scrub, one drive out of a group that is just slower. I have a scrub running right now and one of four drives is clearly slower than the others:

extended device statistics
device r/s w/s kr/s kw/s qlen svc_t %b
ada2 1029.4 3.5 127423.2 26.4 0 4.8 69
ada3 1031.6 3.7 127544.8 11.7 0 2.3 51
ada8 1032.8 3.5 127293.2 26.4 10 4.7 78
ada9 1029.0 3.7 127640.8 11.7 9 8.6 96

Drives not in this zpool removed for clarity. The odd thing is that ada2 is a WD and ada3, ada8, and ada9 are all HGST (but while they are all 2TB Ultrastar, each one is a slightly different model drive).

ada2: <WDC WD2000FYYZ-01UL1B1 01.01K02> ATA-8 SATA 3.x device
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

ada3: <HGST HUS724020ALA640 MF6OAA70> ATA-8 SATA 3.x device
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

ada8: <HGST HUS724020ALE640 MJ6OA580> ATA-8 SATA 3.x device
ada8: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada8: Command Queueing enabled
ada8: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

ada9: <Hitachi HUA723020ALA640 MK7OAA10> ATA-8 SATA 3.x device
ada9: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada9: Command Queueing enabled
ada9: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

ada2 and ada3 are on one controller (SATA3) and ada8 and ada9 are on another (SATA2). All of the other drives in the system are essentially idle. It is clear to me that the HGST HUA723020ALA640 is a slightly slower drive. This is not surprising in that it is a sightly older drive (7230 vs the other two which are 7240).

While it is not good for performance, having different types / models of drives in one pool is good for reliability. I especially try to get drives with different build dates (to avoid the dreaded “bad week” of production).

This is a freshly loaded zpool, so I expect that most of the I/O is as linear as ZFS ever gets (unless the data were all very large files, these are not, the data is a mix of large and small files).

To tie back to the subject, these are all 4KB drives and the vdevs are ashift=12.

[***@FreeBSD2 ~/bin]$ zpool status export1
pool: export1
state: ONLINE
scan: scrub in progress since Tue Sep 2 08:21:13 2014
307G scanned out of 1.60T at 234M/s, 1h37m to go
0 repaired, 18.73% done
config:

NAME STATE READ WRITE CKSUM
export1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
diskid/DISK-PK1134P6HSXXXX ONLINE 0 0 0
diskid/DISK-WD-WMC1P007XXXX ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
diskid/DISK-MK0251YGKLXXXX ONLINE 0 0 0
diskid/DISK-PN2134P6HXXXXX ONLINE 0 0 0

errors: No known data errors
[***@FreeBSD2 ~/bin]$

--
Paul Kraus
***@kraus-haus.org

Richard Elling via illumos-zfs

2014-09-02 16:17:30 UTC

Permalink

Post by Chris Siebenmann via illumos-zfs

This was particularly difficult with the old write throttle, where read traffic
could cause writes to be delayed. The new write throttle has much better
control and more clearly separates the normal workload from scrub/resilver.

Post by Chris Siebenmann via illumos-zfs
A fresh pool with freshly loaded data is likely to be a best case as
far as linear data and metadata and low seeks are concerned. Depending
on the sort of writes that happen in your pool, your pool may stay more
or less this way or it may fragment over time.
If all else is equal I would somewhat expect a mirrored pool to do
better on resilvers than a raidz pool given the same number of disks
simply because a mirror pool can manage more IOPs a second (as a raidz
vdev effectively gets only one disk's IOPs while mirrors get all IOPs,
plus you'll have more vdevs with a mirror-based pool).

In practice, this makes no difference. The important aspect is that, for HDDs,
too many concurrent I/Os leads to poor average response time. The cure is to
improve how I/Os are scheduled and keep the number of concurrent I/Os low.
HDDs are also asymmetrical in that writes are buffered, so it is generally ok to
issue a writes while reads (scrub/resilver) are ongoing.

This can only be useful to your workload if your workload operates temporally.
There are very few workloads that do so, streaming media being the most often
cited. For more typical workloads, your app doesn't keep things in order, so the
on-disk placement has little chance of improving the skew.

Post by Chris Siebenmann via illumos-zfs
(Among other uses, reporting this for scrubs would give you a crude
measure of how fragmented your pool was.)

You'll need to define fragmentation before I can comment further. For a definition
consisting of fragments, you can now get very detailed information about the
distribution of physical block sizes and their level via zdb -bbb.
-- richard

Paul Kraus via illumos-zfs

2014-09-02 16:22:47 UTC

Permalink

Post by Richard Elling via illumos-zfs

So does this come back to the vfs.zfs.vdev.max_pending: parameter? It defaults to 10 on FreeBSD and I *thought* the default had changed to a lower value on Illumos.

I had to tune this down to 4 (and min down to 2) on a system with 4 SATA drives behind a port multiplier (I have finally gotten rid of that system) due to SATA timeout issues (FreeBSD 9.x)

--
Paul Kraus
***@kraus-haus.org

Continue reading on narkive:

Search results for '4KB block drives (under FreeBSD)' (Questions and Answers)

replies

what it File allocation table?

started 2006-11-07 12:26:49 UTC

computers & internet