zpool should not accept devices with a larger (incompatible) ashift

Discussion:

Steven Hartland

2013-08-02 21:10:14 UTC

The fix for https://www.illumos.org/issues/2671 added the ability
to use a device which has a larger ashift that the pool was created
with.

This is an invalid change as it allow the use of a device with a
larger sector size than the pool can access.

I believe the change was actually designed to allow "faked" ashift
increases as used to force devices which advertise their logical
and physical sector sizes as 512b when in fact 4k native devices.

In that specific case this change will work but it will also allow
a fully native 4k device to be used at which point it will fail as
ZFS has no way to align reads and writes correctly.

Initially this change should be backed out and then ashift
calculations updated to take into account two values, the minimum
sector sizes of devices and the desired sector sizes of devices.

I have just such a change for FreeBSD but I'm looking to enhance
it further with the ability to configure the "desired" value from
the zpool command line.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

George Wilson

2013-08-03 14:06:13 UTC

Permalink

Steve,

I'm not sure I understand the problem. You mention that this allows a
device with a larger sector size than the pool can access. Your
particular example states that using a fully native 4k device can not be
used by ZFS because it has not ay to align reads and writes. A fully 4k
device and a faked 4k device should be treated the exact same way so I'm
not sure I follow. Can you provide more details?

BTW, there is a current proposal that was made by Justin to provide both
a logical and physical sector size. Are you proposing something similar?
If you have diffs of a patch please send them out. I'm finishing the
review for Justin's proposal and would like to see what you had in mind.

Thanks,
George

On 8/2/13 5:10 PM, Steven Hartland wrote:
> The fix for https://www.illumos.org/issues/2671 added the ability
> to use a device which has a larger ashift that the pool was created
> with.
>
> This is an invalid change as it allow the use of a device with a
> larger sector size than the pool can access.
>
> I believe the change was actually designed to allow "faked" ashift
> increases as used to force devices which advertise their logical
> and physical sector sizes as 512b when in fact 4k native devices.
>
> In that specific case this change will work but it will also allow
> a fully native 4k device to be used at which point it will fail as
> ZFS has no way to align reads and writes correctly.
>
> Initially this change should be backed out and then ashift
> calculations updated to take into account two values, the minimum
> sector sizes of devices and the desired sector sizes of devices.
>
> I have just such a change for FreeBSD but I'm looking to enhance
> it further with the ability to configure the "desired" value from
> the zpool command line.
>
> Regards
> Steve
>
> ================================================
> This e.mail is private and confidential between Multiplay (UK) Ltd.
> and the person or entity to whom it is addressed. In the event of
> misdirection, the recipient is prohibited from using, copying,
> printing or otherwise disseminating it or any information contained in
> it.
> In the event of misdirection, illegible or incomplete transmission
> please telephone +44 845 868 1337
> or return the E.mail to ***@multiplay.co.uk.
>
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
> Modify Your Subscription:
> https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com

Steven Hartland

2013-08-03 14:39:09 UTC

Permalink

A faked and fully native disk aren't the same as at the device layer
in that a faked one will still addesss the disks at 512b offsets so
said requests can be satisified.

A native however simply can't access 512b offsets without a translation
layer e.g. for a read it may need to have its initial offset adjusted,
read a min of 4k and then only return a portion of that; for write you
may well need to do a read so the contents of sector can have a portion
of it updated.

Does that make things clearer?

I have seen a version Justin's ideas / implementation and it was
almost exactly the same as what we currently run under FreeBSD i.e.
the concept of two "ashift" components, which mapped to logical and
physical sector size.

My only addition to this would be consider different terms to avoid
confusion as logical and physical can be quite counter intuative for
those that aren't totally familar with scsi / ata layer. I also know
there's been some talk about changing the name "ashift" to something
more logical.

I've attached a copy of our working version we use against FreeBSD
8.2. As I'm sure you'll see its very similar to Justin's work, or
at least the version I last saw. Terms are different in that we
use dashift (desired ashift) for the value the device would like
to use. This is allowed to be larger than the top level vdev ashift
however the ashift isn't which prevents the issue reported.

I also wanted to add the ability to configure this from the zpool
cmd line, which has been discussed on the list, which is the main
reason I haven't got round to creating an issue / webrev against
illumos yet ;-)

Regards
Steve

----- Original Message -----
From: "George Wilson" <***@delphix.com>
To: <***@lists.illumos.org>
Cc: "Steven Hartland" <***@multiplay.co.uk>
Sent: Saturday, August 03, 2013 3:06 PM
Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift

> Steve,
>
> I'm not sure I understand the problem. You mention that this allows a
> device with a larger sector size than the pool can access. Your
> particular example states that using a fully native 4k device can not be
> used by ZFS because it has not ay to align reads and writes. A fully 4k
> device and a faked 4k device should be treated the exact same way so I'm
> not sure I follow. Can you provide more details?
>
> BTW, there is a current proposal that was made by Justin to provide both
> a logical and physical sector size. Are you proposing something similar?
> If you have diffs of a patch please send them out. I'm finishing the
> review for Justin's proposal and would like to see what you had in mind.
>
> Thanks,
> George
>
> On 8/2/13 5:10 PM, Steven Hartland wrote:
>> The fix for https://www.illumos.org/issues/2671 added the ability
>> to use a device which has a larger ashift that the pool was created
>> with.
>>
>> This is an invalid change as it allow the use of a device with a
>> larger sector size than the pool can access.
>>
>> I believe the change was actually designed to allow "faked" ashift
>> increases as used to force devices which advertise their logical
>> and physical sector sizes as 512b when in fact 4k native devices.
>>
>> In that specific case this change will work but it will also allow
>> a fully native 4k device to be used at which point it will fail as
>> ZFS has no way to align reads and writes correctly.
>>
>> Initially this change should be backed out and then ashift
>> calculations updated to take into account two values, the minimum
>> sector sizes of devices and the desired sector sizes of devices.
>>
>> I have just such a change for FreeBSD but I'm looking to enhance
>> it further with the ability to configure the "desired" value from
>> the zpool command line.
>>
>> Regards
>> Steve
>>
>> ================================================
>> This e.mail is private and confidential between Multiplay (UK) Ltd.
>> and the person or entity to whom it is addressed. In the event of
>> misdirection, the recipient is prohibited from using, copying,
>> printing or otherwise disseminating it or any information contained in
>> it.
>> In the event of misdirection, illegible or incomplete transmission
>> please telephone +44 845 868 1337
>> or return the E.mail to ***@multiplay.co.uk.
>>
>>
>>
>> -------------------------------------------
>> illumos-zfs
>> Archives: https://www.listbox.com/member/archive/182191/=now
>> RSS Feed:
>> https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
>> Modify Your Subscription:
>> https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>
>

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

George Wilson

2013-08-03 15:01:16 UTC

Permalink

On 8/3/13 10:39 AM, Steven Hartland wrote:
> A faked and fully native disk aren't the same as at the device layer
> in that a faked one will still addesss the disks at 512b offsets so
> said requests can be satisified.
The zio pipeline will adjust reads and writes to be 4k aligned. Are you
not seeing that? I wouldn't think that we would send any non-aligned
read/write requests to the device. Although we are looking at making a
change to allow unaligned access to the label of the devices.

>
> A native however simply can't access 512b offsets without a translation
> layer e.g. for a read it may need to have its initial offset adjusted,
> read a min of 4k and then only return a portion of that; for write you
> may well need to do a read so the contents of sector can have a portion
> of it updated.

I don't have a native 4k device to try this on but what happens if you
try to do a 'dd' from the raw device at a 512b offset. Does it give an
error?

I will take a closer look at your patch.

Thanks,
George

>
> Does that make things clearer?
>
> I have seen a version Justin's ideas / implementation and it was
> almost exactly the same as what we currently run under FreeBSD i.e.
> the concept of two "ashift" components, which mapped to logical and
> physical sector size.
>
> My only addition to this would be consider different terms to avoid
> confusion as logical and physical can be quite counter intuative for
> those that aren't totally familar with scsi / ata layer. I also know
> there's been some talk about changing the name "ashift" to something
> more logical.
>
> I've attached a copy of our working version we use against FreeBSD
> 8.2. As I'm sure you'll see its very similar to Justin's work, or
> at least the version I last saw. Terms are different in that we
> use dashift (desired ashift) for the value the device would like
> to use. This is allowed to be larger than the top level vdev ashift
> however the ashift isn't which prevents the issue reported.
>
> I also wanted to add the ability to configure this from the zpool
> cmd line, which has been discussed on the list, which is the main
> reason I haven't got round to creating an issue / webrev against
> illumos yet ;-)
>
> Regards
> Steve
>
> ----- Original Message ----- From: "George Wilson"
> <***@delphix.com>
> To: <***@lists.illumos.org>
> Cc: "Steven Hartland" <***@multiplay.co.uk>
> Sent: Saturday, August 03, 2013 3:06 PM
> Subject: Re: [zfs] zpool should not accept devices with a larger
> (incompatible) ashift
>
>
>> Steve,
>>
>> I'm not sure I understand the problem. You mention that this allows a
>> device with a larger sector size than the pool can access. Your
>> particular example states that using a fully native 4k device can not
>> be used by ZFS because it has not ay to align reads and writes. A
>> fully 4k device and a faked 4k device should be treated the exact
>> same way so I'm not sure I follow. Can you provide more details?
>>
>> BTW, there is a current proposal that was made by Justin to provide
>> both a logical and physical sector size. Are you proposing something
>> similar? If you have diffs of a patch please send them out. I'm
>> finishing the review for Justin's proposal and would like to see what
>> you had in mind.
>>
>> Thanks,
>> George
>>
>> On 8/2/13 5:10 PM, Steven Hartland wrote:
>>> The fix for https://www.illumos.org/issues/2671 added the ability
>>> to use a device which has a larger ashift that the pool was created
>>> with.
>>>
>>> This is an invalid change as it allow the use of a device with a
>>> larger sector size than the pool can access.
>>>
>>> I believe the change was actually designed to allow "faked" ashift
>>> increases as used to force devices which advertise their logical
>>> and physical sector sizes as 512b when in fact 4k native devices.
>>>
>>> In that specific case this change will work but it will also allow
>>> a fully native 4k device to be used at which point it will fail as
>>> ZFS has no way to align reads and writes correctly.
>>>
>>> Initially this change should be backed out and then ashift
>>> calculations updated to take into account two values, the minimum
>>> sector sizes of devices and the desired sector sizes of devices.
>>>
>>> I have just such a change for FreeBSD but I'm looking to enhance
>>> it further with the ability to configure the "desired" value from
>>> the zpool command line.
>>>
>>> Regards
>>> Steve
>>>
>>> ================================================
>>> This e.mail is private and confidential between Multiplay (UK) Ltd.
>>> and the person or entity to whom it is addressed. In the event of
>>> misdirection, the recipient is prohibited from using, copying,
>>> printing or otherwise disseminating it or any information contained
>>> in it.
>>> In the event of misdirection, illegible or incomplete transmission
>>> please telephone +44 845 868 1337
>>> or return the E.mail to ***@multiplay.co.uk.
>>>
>>>
>>>
>>> -------------------------------------------
>>> illumos-zfs
>>> Archives: https://www.listbox.com/member/archive/182191/=now
>>> RSS Feed:
>>> https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
>>> Modify Your Subscription:
>>> https://www.listbox.com/member/?&
>>> Powered by Listbox: http://www.listbox.com
>>
>>
>
> ================================================
> This e.mail is private and confidential between Multiplay (UK) Ltd.
> and the person or entity to whom it is addressed. In the event of
> misdirection, the recipient is prohibited from using, copying,
> printing or otherwise disseminating it or any information contained in
> it.
> In the event of misdirection, illegible or incomplete transmission
> please telephone +44 845 868 1337
> or return the E.mail to ***@multiplay.co.uk.

Steven Hartland

2013-08-03 15:48:12 UTC

Permalink

----- Original Message -----
From: "George Wilson" <***@delphix.com>

> On 8/3/13 10:39 AM, Steven Hartland wrote:
>> A faked and fully native disk aren't the same as at the device layer
>> in that a faked one will still addesss the disks at 512b offsets so
>> said requests can be satisified.
> The zio pipeline will adjust reads and writes to be 4k aligned. Are you
> not seeing that? I wouldn't think that we would send any non-aligned
> read/write requests to the device. Although we are looking at making a
> change to allow unaligned access to the label of the devices.

It's my understanding this will only happen if the top level vdev was
4k on creation, as its this persitent ashift value which determines
how the requests are sent to the lower level subsystems.

What this change allows is a subordinate vdev with a higher ashift
requirement than the top level vdev, which I believe will result in ZFS
still trying to access said device according to the top level ashift
and hence result in invalid offsets and sizes.

I may have missed something but I've not seen any stage of the pipeline
which dynamically transforms the IO's for each leaf vdev ashift.

I believe this is why the original ashift hard fail was there in the
first place.

>> A native however simply can't access 512b offsets without a translation
>> layer e.g. for a read it may need to have its initial offset adjusted,
>> read a min of 4k and then only return a portion of that; for write you
>> may well need to do a read so the contents of sector can have a portion
>> of it updated.
>
> I don't have a native 4k device to try this on but what happens if you
> try to do a 'dd' from the raw device at a 512b offset. Does it give an
> error?

Neither do I unfortunately, but having worked on the lower layers (geom
& scsi) for FreeBSD I do know that they will fail such IO requests as it
knows they can't be satisfied by the HW. e.g.

if ((bp->bio_offset % cp->provider->sectorsize) != 0 ||
(bp->bio_bcount % cp->provider->sectorsize) != 0) {
bp->bio_resid = bp->bio_bcount;
biofinish(bp, NULL, EINVAL);
return;
}

Thinking out loud this very code may allow this to be tested on FreeBSD
using a nop device, I'll give it a shot and report back.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Steven Hartland

2013-08-03 20:49:39 UTC

Permalink

----- Original Message -----
From: "Steven Hartland"
>> I don't have a native 4k device to try this on but what happens if you
>> try to do a 'dd' from the raw device at a 512b offset. Does it give an
>> error?
>
> Neither do I unfortunately, but having worked on the lower layers (geom
> & scsi) for FreeBSD I do know that they will fail such IO requests as it
> knows they can't be satisfied by the HW. e.g.
>
> if ((bp->bio_offset % cp->provider->sectorsize) != 0 ||
> (bp->bio_bcount % cp->provider->sectorsize) != 0) {
> bp->bio_resid = bp->bio_bcount;
> biofinish(bp, NULL, EINVAL);
> return;
> }
>
> Thinking out loud this very code may allow this to be tested on FreeBSD
> using a nop device, I'll give it a shot and report back.

OK confirmed kernel panic on zpool import with:
"wrong length 1024 for sectorsize 4096"

So tried tripped the following test:-
if (bp->bio_cmd & (BIO_READ|BIO_WRITE|BIO_DELETE)) {
KASSERT(bp->bio_offset % cp->provider->sectorsize == 0,
("wrong offset %jd for sectorsize %u",
bp->bio_offset, cp->provider->sectorsize));
KASSERT(bp->bio_length % cp->provider->sectorsize == 0,
("wrong length %jd for sectorsize %u",
bp->bio_length, cp->provider->sectorsize));
}

Which means ZFS tried to do incorrectly aligned IO on the 4k sector
device.

To reproduce on FreeBSD you can use the following steps:-
gnop create -S 512 ada2
zpool create tpool ada2.nop
sh -c 'for i in 0 1 2 3 4 5; do dd if=/dev/random of=/tpool/512-$i bs=512 count=1; done'
zpool export tpool
gnop destroy ada2.nop
gnop create -S 4096 ada2
zpool import
<panic>

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Jim Klimov

2013-08-04 09:48:59 UTC

Permalink

This all sounds strange, i.e. I've heard statements on this list that
the "sd" driver in illumos (though allegedly not ZFS itself, so this
logic might not propagate into FreeBSD) should do RMW IOs to devices
when the IO size/alignment mismatches that of reported native sector
size. Indeed, there were discussions of this like "Why am I seeing
message XXX on console/in logs?"

So, if for any reason, a native 4k AF drive (without 512e emulation)
is used with ZFS in an ashift=9 pool, it was my understanding that
"sd" would take care of the emulation much like the 512e firmware
would have...

Am I missing something?

//Jim

Steven Hartland

2013-08-04 11:55:03 UTC

Permalink

----- Original Message -----
From: "Jim Klimov" <***@cos.ru>
Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift

> This all sounds strange, i.e. I've heard statements on this list that
> the "sd" driver in illumos (though allegedly not ZFS itself, so this
> logic might not propagate into FreeBSD) should do RMW IOs to devices
> when the IO size/alignment mismatches that of reported native sector
> size. Indeed, there were discussions of this like "Why am I seeing
> message XXX on console/in logs?"
>
> So, if for any reason, a native 4k AF drive (without 512e emulation)
> is used with ZFS in an ashift=9 pool, it was my understanding that
> "sd" would take care of the emulation much like the 512e firmware
> would have...
>
> Am I missing something?

Thanks Jim I'm not aware of this so would be good to get someone familar
with the sd to confirm, cc'ing developer@ for this purpose.

It does however it sounds odd that the device driver layer would do that
tbh as it could cause quite an overhead with every IO request having to
be tested and transformed if required.

Even if this is true for illumos I would suggest this is still change
is removed and corrected in a more compatible and correct manor as
illumos is the upstream for ZFS of a number of projects not all of
which have this facility in the device layer.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Hans Rosenfeld

2013-08-04 12:40:49 UTC

Permalink

On Sun, Aug 04, 2013 at 12:55:03PM +0100, Steven Hartland wrote:
> ----- Original Message ----- From: "Jim Klimov" <***@cos.ru>
> Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift
>
>
> >This all sounds strange, i.e. I've heard statements on this list that
> >the "sd" driver in illumos (though allegedly not ZFS itself, so this
> >logic might not propagate into FreeBSD) should do RMW IOs to devices
> >when the IO size/alignment mismatches that of reported native sector
> >size. Indeed, there were discussions of this like "Why am I seeing
> >message XXX on console/in logs?"
> >
> >So, if for any reason, a native 4k AF drive (without 512e emulation)
> >is used with ZFS in an ashift=9 pool, it was my understanding that
> >"sd" would take care of the emulation much like the 512e firmware
> >would have...
> >
> >Am I missing something?
>
> Thanks Jim I'm not aware of this so would be good to get someone
> familar with the sd to confirm, cc'ing developer@ for this purpose.

Jim is right, the illumos sd driver does RMW when necessary and warns
about it because it is slow. Does FreeBSD not support doing that using
Geom or something?

> It does however it sounds odd that the device driver layer would do that
> tbh as it could cause quite an overhead with every IO request having to
> be tested and transformed if required.
>
> Even if this is true for illumos I would suggest this is still change
> is removed and corrected in a more compatible and correct manor as
> illumos is the upstream for ZFS of a number of projects not all of
> which have this facility in the device layer.

If I understand you correctly, you attached or replaced a device with
both physical and logical block size of 4k to a top-level vdev with an
ashift of 9. I can see how that can break on systems where sd doesn't do
RMW. I don't know how well it will work on system where sd does RMW, but
I assume it will not work very well.

If there is an agreement to disallow attaching/replacing devices where
both physical and logical block sizes require a larger ashift then what
the existing vdev has, then it's just that what needs to be added to the
check. The existing handling of 512e devices should be retained.

Also, existing pools should not become unimportable because we decided
to change the ashift checking logic. If FreeBSD can't handle such pools
at all, denying the import should be specific to FreeBSD.

So whatever you do to fix that, please don't break existing pools or
functionality (that is, no backing out of 2671).

Hans

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

Steven Hartland

2013-08-04 13:12:23 UTC

Permalink

----- Original Message -----
From: "Hans Rosenfeld" <***@nexenta.com>
To: <***@lists.illumos.org>
Cc: <***@lists.illumos.org>
Sent: Sunday, August 04, 2013 1:40 PM
Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift

> On Sun, Aug 04, 2013 at 12:55:03PM +0100, Steven Hartland wrote:
>> ----- Original Message ----- From: "Jim Klimov" <***@cos.ru>
>> Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift
>>
>>
>> >This all sounds strange, i.e. I've heard statements on this list that
>> >the "sd" driver in illumos (though allegedly not ZFS itself, so this
>> >logic might not propagate into FreeBSD) should do RMW IOs to devices
>> >when the IO size/alignment mismatches that of reported native sector
>> >size. Indeed, there were discussions of this like "Why am I seeing
>> >message XXX on console/in logs?"
>> >
>> >So, if for any reason, a native 4k AF drive (without 512e emulation)
>> >is used with ZFS in an ashift=9 pool, it was my understanding that
>> >"sd" would take care of the emulation much like the 512e firmware
>> >would have...
>> >
>> >Am I missing something?
>>
>> Thanks Jim I'm not aware of this so would be good to get someone
>> familar with the sd to confirm, cc'ing developer@ for this purpose.
>
> Jim is right, the illumos sd driver does RMW when necessary and warns
> about it because it is slow. Does FreeBSD not support doing that using
> Geom or something?
>
>> It does however it sounds odd that the device driver layer would do that
>> tbh as it could cause quite an overhead with every IO request having to
>> be tested and transformed if required.
>>
>> Even if this is true for illumos I would suggest this is still change
>> is removed and corrected in a more compatible and correct manor as
>> illumos is the upstream for ZFS of a number of projects not all of
>> which have this facility in the device layer.
>
> If I understand you correctly, you attached or replaced a device with
> both physical and logical block size of 4k to a top-level vdev with an
> ashift of 9. I can see how that can break on systems where sd doesn't do
> RMW. I don't know how well it will work on system where sd does RMW, but
> I assume it will not work very well.

Correct.

> If there is an agreement to disallow attaching/replacing devices where
> both physical and logical block sizes require a larger ashift then what
> the existing vdev has, then it's just that what needs to be added to the
> check. The existing handling of 512e devices should be retained.

This is where the additional work comes in teaching ZFS that there are
two values relavent to ashift calculations (physical and logical sector
sizes).

> Also, existing pools should not become unimportable because we decided
> to change the ashift checking logic. If FreeBSD can't handle such pools
> at all, denying the import should be specific to FreeBSD.
>
> So whatever you do to fix that, please don't break existing pools or
> functionality (that is, no backing out of 2671).

Sure, I'm assuming this was brought about by illumos equivalent of
FreeBSD's disk quirks, where by the user can override the values
reported by the device?

With regards the warnings printed by sd, do these get triggered and
the RMW code path triggered when the user manually changes the sector
size? I ask as I guess this shouldn't happen in this case and only
in the case where disk simply can't preform said IO.

Does sd support the concept of physical and logical sector sizes?

If so then the same approach we're looking to use under FreeBSD could
be used hence keeping the code the same while allowing optimiation for
devices which lie about their real sector sizes.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Hans Rosenfeld

2013-08-04 14:45:15 UTC

Permalink

On Sun, Aug 04, 2013 at 02:12:23PM +0100, Steven Hartland wrote:
> >If there is an agreement to disallow attaching/replacing devices where
> >both physical and logical block sizes require a larger ashift then what
> >the existing vdev has, then it's just that what needs to be added to the
> >check. The existing handling of 512e devices should be retained.
>
> This is where the additional work comes in teaching ZFS that there are
> two values relavent to ashift calculations (physical and logical sector
> sizes).
>
> >Also, existing pools should not become unimportable because we decided
> >to change the ashift checking logic. If FreeBSD can't handle such pools
> >at all, denying the import should be specific to FreeBSD.
> >
> >So whatever you do to fix that, please don't break existing pools or
> >functionality (that is, no backing out of 2671).
>
> Sure, I'm assuming this was brought about by illumos equivalent of
> FreeBSD's disk quirks, where by the user can override the values
> reported by the device?

No, the problem existed since April 2010, when ZFS learned to query the
physical block size of disks. This had the side effect that pools using
mixed physical block sizes in one vdev suddenly stopped working (as all
vdevs had an ashift=9 until then).

2671 was supposed to fix that, and it also introduced the workaround to
override the physical block size detection of sd. The latter part was
intended for folks who had disks that didn't report their physical block
size correctly.

> With regards the warnings printed by sd, do these get triggered and
> the RMW code path triggered when the user manually changes the sector
> size?

They get triggered by short or misaligned writes relative to the logical
block size of the device. I have seen this for disks with a 528b block
size, and once for iSCSI devices that used native 4k sectors (both
physical and logical).

> I ask as I guess this shouldn't happen in this case and only
> in the case where disk simply can't preform said IO.

Exactly. It doesn't apply to those so-called 512e devices which have a
physical block size of 4k, and a logical block size of 512b.

> Does sd support the concept of physical and logical sector sizes?

Yes, it actually uses three block sizes:

un_phy_blocksize is the physical block size reported by the device.
un_tgt_blocksize is the logical block size reported by the device, and
it is the block size used in communication with the device.
un_sys_blocksize is the block size used by the system for the device, if
I'm not mistaken this is different from un_tgt_blocksize for optical
disks.

> If so then the same approach we're looking to use under FreeBSD could
> be used hence keeping the code the same while allowing optimiation for
> devices which lie about their real sector sizes.

I don't really like all that "lying about real sector size" talk. Which
one is "real"? Is the physical block size (which may or may not be the
block size actually used by the low level format or the underlying
flash device or whatever is there) more "real" than the logical block
size that is used when communicating with the device?

Hans

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-08-04 14:57:32 UTC

Permalink

On 2013-08-04 16:45, Hans Rosenfeld wrote:
> I don't really like all that "lying about real sector size" talk. Which
> one is "real"? Is the physical block size (which may or may not be the
> block size actually used by the low level format or the underlying
> flash device or whatever is there) more "real" than the logical block
> size that is used when communicating with the device?

OTOH, the "real" vs. "lying" one is that requires minimal overhead
processing on behalf of the device, and may happen to be noticeably
faster (as in measurable RMW performance degradation with unaligned
small 512-byte IOs into 4K disks where firmware does not report the
4K size in any field (I think there were reported some drive models
like this), but uses 4K on "physical" layer and RMW lags for smaller
IOs with announced 512b sector size.

Of course, with all the anti-noise media encoding and CRCs and maybe
ECC or hardware encryption in place, we shouldn't expect (nor care
much) that the native 512b sector would only consume 512*8 bits on
the HDD either :)

So the "real" sector size is one that requires minimal conversions
while this data passes through layers :)

//Jim

Alexander Motin

2013-08-04 15:20:11 UTC

Permalink

On 04.08.2013 17:45, Hans Rosenfeld wrote:
> On Sun, Aug 04, 2013 at 02:12:23PM +0100, Steven Hartland wrote:
>> If so then the same approach we're looking to use under FreeBSD could
>> be used hence keeping the code the same while allowing optimiation for
>> devices which lie about their real sector sizes.
>
> I don't really like all that "lying about real sector size" talk. Which
> one is "real"? Is the physical block size (which may or may not be the
> block size actually used by the low level format or the underlying
> flash device or whatever is there) more "real" than the logical block
> size that is used when communicating with the device?

Logical sector size, reported by device, is the primary block size of
the block device. Disks are not addressed in bytes, but only in logical
sectors. That is the only constant that should never change during
device existence. It affects disk partitioning and many file systems
(obviously except ZFS). What is behind it (4K magnetic sector, 128K
flash block or something absolutely unknown) is a problem of device
firmware only. We had alike discussion in FreeBSD and I strongly believe
that OS should not attempt to swap concept of logical block reported by
device (that has strict definition of minimal addressed block) with
anything else. Other information (if available) can and should be used
as optional hints for partition alignment, FS tuning, etc., but never
more! Otherwise, going that way, calculating the only ashift based on
that unreliable untrusted value, is the real cause of your problems that
required the fix objected here by Steven.

I would like to support Steven's position, that ZFS should have two (or
theoretically may be even more) values of ashifts: one absolute (based
on disk logical sector size), that should guarantee that no short or
unaligned I/O will ever happen; and another optional (based on physicals
sector size or other magic device hides inside), that would help ZFS to
better work with the device. The second value should be possible to
set/modify from command line during pool creation if user knows more
then hardware reports, or if he plans later migration.

--
Alexander Motin

George Wilson

2013-08-04 15:28:36 UTC

Permalink

On 8/4/13 9:12 AM, Steven Hartland wrote:
>
> ----- Original Message ----- From: "Hans Rosenfeld"
> <***@nexenta.com>
> To: <***@lists.illumos.org>
> Cc: <***@lists.illumos.org>
> Sent: Sunday, August 04, 2013 1:40 PM
> Subject: Re: [zfs] zpool should not accept devices with a larger
> (incompatible) ashift
>
>
>> On Sun, Aug 04, 2013 at 12:55:03PM +0100, Steven Hartland wrote:
>>> ----- Original Message ----- From: "Jim Klimov" <***@cos.ru>
>>> Subject: Re: [zfs] zpool should not accept devices with a larger
>>> (incompatible) ashift
>>>
>>>
>>> >This all sounds strange, i.e. I've heard statements on this list that
>>> >the "sd" driver in illumos (though allegedly not ZFS itself, so this
>>> >logic might not propagate into FreeBSD) should do RMW IOs to devices
>>> >when the IO size/alignment mismatches that of reported native sector
>>> >size. Indeed, there were discussions of this like "Why am I seeing
>>> >message XXX on console/in logs?"
>>> >
>>> >So, if for any reason, a native 4k AF drive (without 512e emulation)
>>> >is used with ZFS in an ashift=9 pool, it was my understanding that
>>> >"sd" would take care of the emulation much like the 512e firmware
>>> >would have...
>>> >
>>> >Am I missing something?
>>>
>>> Thanks Jim I'm not aware of this so would be good to get someone
>>> familar with the sd to confirm, cc'ing developer@ for this purpose.
>>
>> Jim is right, the illumos sd driver does RMW when necessary and warns
>> about it because it is slow. Does FreeBSD not support doing that using
>> Geom or something?
>>
>>> It does however it sounds odd that the device driver layer would do
>>> that
>>> tbh as it could cause quite an overhead with every IO request having to
>>> be tested and transformed if required.
>>>
>>> Even if this is true for illumos I would suggest this is still change
>>> is removed and corrected in a more compatible and correct manor as
>>> illumos is the upstream for ZFS of a number of projects not all of
>>> which have this facility in the device layer.
>>
>> If I understand you correctly, you attached or replaced a device with
>> both physical and logical block size of 4k to a top-level vdev with an
>> ashift of 9. I can see how that can break on systems where sd doesn't do
>> RMW. I don't know how well it will work on system where sd does RMW, but
>> I assume it will not work very well.
>
> Correct.
I'm not sure this can happen. When ZFS does an attach or replace it will
check the top-level's ashift value and ensure that the
attaching/replacing vdev does not have a larger ashift value. So
assuming you had a top-level device with an ashift of 9 and you tried to
attach a vdev with a ashift of 12 it would just fail:

4605 /*
4606 * The new device cannot have a higher alignment requirement
4607 * than the top-level vdev.
4608 */
4609 if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
4610 return (spa_vdev_exit(spa, newrootvd, txg, EDOM));
4611

Has this been removed from FreeBSD?

- George

Steven Hartland

2013-08-04 15:54:27 UTC

Permalink

----- Original Message -----
From: "George Wilson" <***@delphix.com>
To: <***@lists.illumos.org>
Cc: "Steven Hartland" <***@multiplay.co.uk>; <***@lists.illumos.org>
Sent: Sunday, August 04, 2013 4:28 PM
Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift

> On 8/4/13 9:12 AM, Steven Hartland wrote:
>>
>> ----- Original Message ----- From: "Hans Rosenfeld"
>> <***@nexenta.com>
>> To: <***@lists.illumos.org>
>> Cc: <***@lists.illumos.org>
>> Sent: Sunday, August 04, 2013 1:40 PM
>> Subject: Re: [zfs] zpool should not accept devices with a larger
>> (incompatible) ashift
>>
>>
>>> On Sun, Aug 04, 2013 at 12:55:03PM +0100, Steven Hartland wrote:
>>>> ----- Original Message ----- From: "Jim Klimov" <***@cos.ru>
>>>> Subject: Re: [zfs] zpool should not accept devices with a larger
>>>> (incompatible) ashift
>>>>
>>>>
>>>> >This all sounds strange, i.e. I've heard statements on this list that
>>>> >the "sd" driver in illumos (though allegedly not ZFS itself, so this
>>>> >logic might not propagate into FreeBSD) should do RMW IOs to devices
>>>> >when the IO size/alignment mismatches that of reported native sector
>>>> >size. Indeed, there were discussions of this like "Why am I seeing
>>>> >message XXX on console/in logs?"
>>>> >
>>>> >So, if for any reason, a native 4k AF drive (without 512e emulation)
>>>> >is used with ZFS in an ashift=9 pool, it was my understanding that
>>>> >"sd" would take care of the emulation much like the 512e firmware
>>>> >would have...
>>>> >
>>>> >Am I missing something?
>>>>
>>>> Thanks Jim I'm not aware of this so would be good to get someone
>>>> familar with the sd to confirm, cc'ing developer@ for this purpose.
>>>
>>> Jim is right, the illumos sd driver does RMW when necessary and warns
>>> about it because it is slow. Does FreeBSD not support doing that using
>>> Geom or something?
>>>
>>>> It does however it sounds odd that the device driver layer would do
>>>> that
>>>> tbh as it could cause quite an overhead with every IO request having to
>>>> be tested and transformed if required.
>>>>
>>>> Even if this is true for illumos I would suggest this is still change
>>>> is removed and corrected in a more compatible and correct manor as
>>>> illumos is the upstream for ZFS of a number of projects not all of
>>>> which have this facility in the device layer.
>>>
>>> If I understand you correctly, you attached or replaced a device with
>>> both physical and logical block size of 4k to a top-level vdev with an
>>> ashift of 9. I can see how that can break on systems where sd doesn't do
>>> RMW. I don't know how well it will work on system where sd does RMW, but
>>> I assume it will not work very well.
>>
>> Correct.
> I'm not sure this can happen. When ZFS does an attach or replace it will
> check the top-level's ashift value and ensure that the
> attaching/replacing vdev does not have a larger ashift value. So
> assuming you had a top-level device with an ashift of 9 and you tried to
> attach a vdev with a ashift of 12 it would just fail:
>
> 4605 /*
> 4606 * The new device cannot have a higher alignment requirement
> 4607 * than the top-level vdev.
> 4608 */
> 4609 if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
> 4610 return (spa_vdev_exit(spa, newrootvd, txg, EDOM));
> 4611
>
> Has this been removed from FreeBSD?

This change was done when the pool was exported to trigger the behavour
target by the previous fix in question, so no replace was done.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Steven Hartland

2013-08-04 22:00:09 UTC

Permalink

----- Original Message -----
From: "George Wilson" <***@delphix.com>
To: <***@lists.illumos.org>
Cc: "Steven Hartland" <***@multiplay.co.uk>; <***@lists.illumos.org>
Sent: Sunday, August 04, 2013 4:28 PM
Subject: [developer] Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift

> On 8/4/13 9:12 AM, Steven Hartland wrote:
>>
>> ----- Original Message ----- From: "Hans Rosenfeld"
>> <***@nexenta.com>
>> To: <***@lists.illumos.org>
>> Cc: <***@lists.illumos.org>
>> Sent: Sunday, August 04, 2013 1:40 PM
>> Subject: Re: [zfs] zpool should not accept devices with a larger
>> (incompatible) ashift
>>
>>
>>> On Sun, Aug 04, 2013 at 12:55:03PM +0100, Steven Hartland wrote:
>>>> ----- Original Message ----- From: "Jim Klimov" <***@cos.ru>
>>>> Subject: Re: [zfs] zpool should not accept devices with a larger
>>>> (incompatible) ashift
>>>>
>>>>
>>>> >This all sounds strange, i.e. I've heard statements on this list that
>>>> >the "sd" driver in illumos (though allegedly not ZFS itself, so this
>>>> >logic might not propagate into FreeBSD) should do RMW IOs to devices
>>>> >when the IO size/alignment mismatches that of reported native sector
>>>> >size. Indeed, there were discussions of this like "Why am I seeing
>>>> >message XXX on console/in logs?"
>>>> >
>>>> >So, if for any reason, a native 4k AF drive (without 512e emulation)
>>>> >is used with ZFS in an ashift=9 pool, it was my understanding that
>>>> >"sd" would take care of the emulation much like the 512e firmware
>>>> >would have...
>>>> >
>>>> >Am I missing something?
>>>>
>>>> Thanks Jim I'm not aware of this so would be good to get someone
>>>> familar with the sd to confirm, cc'ing developer@ for this purpose.
>>>
>>> Jim is right, the illumos sd driver does RMW when necessary and warns
>>> about it because it is slow. Does FreeBSD not support doing that using
>>> Geom or something?
>>>
>>>> It does however it sounds odd that the device driver layer would do
>>>> that
>>>> tbh as it could cause quite an overhead with every IO request having to
>>>> be tested and transformed if required.
>>>>
>>>> Even if this is true for illumos I would suggest this is still change
>>>> is removed and corrected in a more compatible and correct manor as
>>>> illumos is the upstream for ZFS of a number of projects not all of
>>>> which have this facility in the device layer.
>>>
>>> If I understand you correctly, you attached or replaced a device with
>>> both physical and logical block size of 4k to a top-level vdev with an
>>> ashift of 9. I can see how that can break on systems where sd doesn't do
>>> RMW. I don't know how well it will work on system where sd does RMW, but
>>> I assume it will not work very well.
>>
>> Correct.
> I'm not sure this can happen. When ZFS does an attach or replace it will
> check the top-level's ashift value and ensure that the
> attaching/replacing vdev does not have a larger ashift value. So
> assuming you had a top-level device with an ashift of 9 and you tried to
> attach a vdev with a ashift of 12 it would just fail:
>
> 4605 /*
> 4606 * The new device cannot have a higher alignment requirement
> 4607 * than the top-level vdev.
> 4608 */
> 4609 if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
> 4610 return (spa_vdev_exit(spa, newrootvd, txg, EDOM));
> 4611
>
> Has this been removed from FreeBSD?

No we're never attaching a disk it was created with the disk, just
the paramters have changed, so theres no replace operation hence
that code never gets run.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Justin T. Gibbs

2013-08-11 01:15:31 UTC

Permalink

On Aug 4, 2013, at 4:00 PM, Steven Hartland <***@multiplay.co.uk> wrote:

>
> ----- Original Message ----- From: "George Wilson" <***@delphix.com>
> To: <***@lists.illumos.org>
> Cc: "Steven Hartland" <***@multiplay.co.uk>; <***@lists.illumos.org>
> Sent: Sunday, August 04, 2013 4:28 PM
> Subject: [developer] Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift
>
>
>> On 8/4/13 9:12 AM, Steven Hartland wrote:
>>>
>>> ----- Original Message ----- From: "Hans Rosenfeld" <***@nexenta.com>
>>> To: <***@lists.illumos.org>
>>> Cc: <***@lists.illumos.org>
>>> Sent: Sunday, August 04, 2013 1:40 PM
>>> Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift
>>>
>>>
>>>> On Sun, Aug 04, 2013 at 12:55:03PM +0100, Steven Hartland wrote:
>>>>> ----- Original Message ----- From: "Jim Klimov" <***@cos.ru>
>>>>> Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift
>>>>>
>>>>>
>>>>> >This all sounds strange, i.e. I've heard statements on this list that
>>>>> >the "sd" driver in illumos (though allegedly not ZFS itself, so this
>>>>> >logic might not propagate into FreeBSD) should do RMW IOs to devices
>>>>> >when the IO size/alignment mismatches that of reported native sector
>>>>> >size. Indeed, there were discussions of this like "Why am I seeing
>>>>> >message XXX on console/in logs?"
>>>>> >
>>>>> >So, if for any reason, a native 4k AF drive (without 512e emulation)
>>>>> >is used with ZFS in an ashift=9 pool, it was my understanding that
>>>>> >"sd" would take care of the emulation much like the 512e firmware
>>>>> >would have...
>>>>> >
>>>>> >Am I missing something?
>>>>>
>>>>> Thanks Jim I'm not aware of this so would be good to get someone
>>>>> familar with the sd to confirm, cc'ing developer@ for this purpose.
>>>>
>>>> Jim is right, the illumos sd driver does RMW when necessary and warns
>>>> about it because it is slow. Does FreeBSD not support doing that using
>>>> Geom or something?
>>>>
>>>>> It does however it sounds odd that the device driver layer would do that
>>>>> tbh as it could cause quite an overhead with every IO request having to
>>>>> be tested and transformed if required.
>>>>>
>>>>> Even if this is true for illumos I would suggest this is still change
>>>>> is removed and corrected in a more compatible and correct manor as
>>>>> illumos is the upstream for ZFS of a number of projects not all of
>>>>> which have this facility in the device layer.
>>>>
>>>> If I understand you correctly, you attached or replaced a device with
>>>> both physical and logical block size of 4k to a top-level vdev with an
>>>> ashift of 9. I can see how that can break on systems where sd doesn't do
>>>> RMW. I don't know how well it will work on system where sd does RMW, but
>>>> I assume it will not work very well.
>>>
>>> Correct.
>> I'm not sure this can happen. When ZFS does an attach or replace it will check the top-level's ashift value and ensure that the attaching/replacing vdev does not have a larger ashift value. So assuming you had a top-level device with an ashift of 9 and you tried to attach a vdev with a ashift of 12 it would just fail:
>> 4605 /*
>> 4606 * The new device cannot have a higher alignment requirement
>> 4607 * than the top-level vdev.
>> 4608 */
>> 4609 if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
>> 4610 return (spa_vdev_exit(spa, newrootvd, txg, EDOM));
>> 4611
>> Has this been removed from FreeBSD?
>
> No we're never attaching a disk it was created with the disk, just
> the paramters have changed, so theres no replace operation hence
> that code never gets run.
>
> Regards
> Steve

I finally got around to plumbing the logical and physical information out to userland. If folks can provide some guidance on how zpool status should complain about these types of errors, I'll get this into a webrev. What I have now is a bit wordy, but I expect that a run of "zpool status" after an OS upgrade will be the first time the user becomes aware of the problem. There needs to be clear and easily accessible guidance for the user about what to do when they have this problem.

--
Justin

[***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
pool: foo
state: ONLINE
scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
config:

NAME STATE READ WRITE CKSUM
foo ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da8 ONLINE 0 0 0
da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal

************************ WARNING ************************
Expect reduced performance.

One or more devices are configured to use a block size
smaller than their native block size. This can occur if
the native size was improperly reported by the version of
the system software used to add the device to the pool, or
when an existing pool device is replaced with a device that
does not support the native block size of the original.

For cache and log devices, removing and re-adding the device,
or mirror containing the device, will clear the error.

For data devices, either replace them with devices that
support the configured block size, or create and migrate
your data to a new pool. The devices from the original
pool can then be added to the new pool.

errors: No known data errors

Steven Hartland

2013-08-11 01:42:27 UTC

Permalink

>
> ----- Original Message -----
> From: "Justin T. Gibbs" <***@scsiguy.com>
>
> I finally got around to plumbing the logical and physical information out to userland. If folks can provide some guidance on
> how zpool status should complain about these types of errors, I'll get this into a webrev. What I have now is a bit wordy, but
> I expect that a run of "zpool status" after an OS upgrade will be the first time the user becomes aware of the problem. There
> needs to be clear and easily accessible guidance for the user about what to do when they have this problem.
>
> --
> Justin
>
> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
> pool: foo
> state: ONLINE
> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
> config:
>
> NAME STATE READ WRITE CKSUM
> foo ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> da8 ONLINE 0 0 0
> da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>
> ************************ WARNING ************************
> Expect reduced performance.
>
> One or more devices are configured to use a block size
> smaller than their native block size. This can occur if
> the native size was improperly reported by the version of
> the system software used to add the device to the pool, or
> when an existing pool device is replaced with a device that
> does not support the native block size of the original.
>
> For cache and log devices, removing and re-adding the device,
> or mirror containing the device, will clear the error.
>
> For data devices, either replace them with devices that
> support the configured block size, or create and migrate
> your data to a new pool. The devices from the original
> pool can then be added to the new pool.
>
> errors: No known data errors
>

I like the idea of reporting this clearly in zpool status.

I would something more familar in layout and a little less
verbose, may be somthing like:-

pool: tpool
state: ONLINE
status: One or more devices are configured to use a block size
smaller than their native block size.
action: Replace device or migrate pool to use native block size.
see: http://illumos.org/msg/ZFS-xxxxxx
scan: none requested
config:

NAME STATE READ WRITE CKSUM
tpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada4 ONLINE 0 0 0
ada6 DEGRADED 0 0 0 block size: 512b configured, 4096b optimal

We could use the link to provide more detailed information on
the various processes as well as providing different "action"
and see links depending on the use of disk as per your current
verbose description.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Richard Laager

2013-08-11 02:19:29 UTC

Permalink

On Sun, 2013-08-11 at 02:42 +0100, Steven Hartland wrote:
> see: http://illumos.org/msg/ZFS-xxxxxx

I like the suggestion to use a link instead of the long description.

> ada6 DEGRADED 0 0 0 block size: 512b configured, 4096b optima

I don't like the idea of showing the disk as DEGRADED. I feel like
DEGRADED on a disk should (only) be things like "the disk is showing
high latency but hasn't outright died." I've heard talk of that being
something that should exist in the future. Then, like DEGRADED top-level
vdevs, it may make sense to avoid accessing DEGRADED disk vdevs when
possible. In the case of disk vdevs, that may or may not include
writing, but (at least in the latency case) probably means avoiding
reading.

One option (that would fit in the existing space) is UNALIGNED.

--

Steven Hartland

2013-08-11 02:47:53 UTC

Permalink

> On Sun, 2013-08-11 at 02:42 +0100, Steven Hartland wrote:
> > see: http://illumos.org/msg/ZFS-xxxxxx
>
> I like the suggestion to use a link instead of the long description.
>
> > ada6 DEGRADED 0 0 0 block size: 512b configured, 4096b optima
>
> I don't like the idea of showing the disk as DEGRADED. I feel like
> DEGRADED on a disk should (only) be things like "the disk is showing
> high latency but hasn't outright died." I've heard talk of that being
> something that should exist in the future. Then, like DEGRADED top-level
> vdevs, it may make sense to avoid accessing DEGRADED disk vdevs when
> possible. In the case of disk vdevs, that may or may not include
> writing, but (at least in the latency case) probably means avoiding
> reading.

My thinking for using DEGRADED was = degraded performance this could
be used for a number of performance related items, be that increased
latency due to the unaligned read / writes or due to platter retries.

> One option (that would fit in the existing space) is UNALIGNED.

That sounds like a sensible suggestion too, depends if we want
to be more generic that can be reused or specific.

My question would be what happens when you have the conbination
of the unaligned and poor latency, then what do we display?

Regards
steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-08-11 08:26:40 UTC

Permalink

On 2013-08-11 04:47, Steven Hartland wrote:
> My question would be what happens when you have the conbination
> of the unaligned and poor latency, then what do we display?

I'd guess that an error overrides a warning. If you only have space
to show one value (be it the string near the device, or a status
value in a structure unless that's a bitmap), pick the worst.
* UNALIGNED is a warning, that things may be suboptimal.
* DEGRADED is a call to action - to fix or replace the device, and
to avoid putting load to it in the meanwhile.

The more verbose description at the zpool output header can include
several notices I guess - i.e. to upgrade the pool and to fix it or
something like that.

My 2c,
//Jim

Richard Laager

2013-08-11 19:49:00 UTC

Permalink

On Sun, 2013-08-11 at 10:26 +0200, Jim Klimov wrote:
> On 2013-08-11 04:47, Steven Hartland wrote:
> > My question would be what happens when you have the conbination
> > of the unaligned and poor latency, then what do we display?
>
> I'd guess that an error overrides a warning. If you only have space
> to show one value (be it the string near the device, or a status
> value in a structure unless that's a bitmap), pick the worst.
> * UNALIGNED is a warning, that things may be suboptimal.
> * DEGRADED is a call to action - to fix or replace the device, and
> to avoid putting load to it in the meanwhile.

This is exactly what I had in mind.

This is the current state ranking (see zpool_state_to_name() in
libzfs_pool.c:

OFFLINE > REMOVED > FAULTED > ((SPLIT > UNAVAIL) | DEGRADED) > ONLINE

I'm suggesting that UNALIGNED be a state below DEGRADED and above
ONLINE. And like any state, it would bubble up to the top-level vdevs
and the pool itself.

da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal

I didn't say it before, but this also feels too long. Maybe just limit
it to "512B (4096B optimal)".

--

Richard Elling

2013-08-11 20:46:34 UTC

Permalink

On Aug 11, 2013, at 12:49 PM, Richard Laager <***@wiktel.com> wrote:

> On Sun, 2013-08-11 at 10:26 +0200, Jim Klimov wrote:
>> On 2013-08-11 04:47, Steven Hartland wrote:
>>> My question would be what happens when you have the conbination
>>> of the unaligned and poor latency, then what do we display?
>>
>> I'd guess that an error overrides a warning. If you only have space
>> to show one value (be it the string near the device, or a status
>> value in a structure unless that's a bitmap), pick the worst.
>> * UNALIGNED is a warning, that things may be suboptimal.
>> * DEGRADED is a call to action - to fix or replace the device, and
>> to avoid putting load to it in the meanwhile.
>
> This is exactly what I had in mind.
>
> This is the current state ranking (see zpool_state_to_name() in
> libzfs_pool.c:
>
> OFFLINE > REMOVED > FAULTED > ((SPLIT > UNAVAIL) | DEGRADED) > ONLINE
>
> I'm suggesting that UNALIGNED be a state below DEGRADED and above
> ONLINE. And like any state, it would bubble up to the top-level vdevs
> and the pool itself.

UNALIGNED is the wrong word. It is possible to have mismatched block sizes while
aligned. For 10+ years we've been using RMW as the acronym to describe how we
deal with mismatched block sizes (read-modify-write), but it doesn't feel satisfying in
this context. Perhaps something less cryptic, BSIZEMISMATCH?

NB, RAID-5/6 does RWM for a high percentage of write cases. It seems unwise to alert
this as a problem in ZFS when it is common and expected behavior for RAID-5/6.

We currently expect the state to roll up. In other words, the pool's state is a roll-up of
the states of the vdevs. It seems alarming to not have ONLINE be the pool state in
these cases.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Justin T. Gibbs

2013-08-11 20:03:46 UTC

Permalink

On Aug 10, 2013, at 8:19 PM, Richard Laager <***@wiktel.com> wrote:

> On Sun, 2013-08-11 at 02:42 +0100, Steven Hartland wrote:
>> see: http://illumos.org/msg/ZFS-xxxxxx
>
> I like the suggestion to use a link instead of the long description.

I've always felt that the system should be self documenting. A web link is useless if the system you are repairing *is* your gateway to the Internet, the target site is down, you're performing service inside a dark site, or the link is stale. Can't the link to long description target a locally installed man page?

--
Justin

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Justin T. Gibbs

2013-08-11 19:59:57 UTC

Permalink

On Aug 10, 2013, at 7:42 PM, Steven Hartland <***@multiplay.co.uk> wrote:

>>
>> ----- Original Message ----- From: "Justin T. Gibbs" <***@scsiguy.com>
>>
>> I finally got around to plumbing the logical and physical information out to userland. If folks can provide some guidance on how zpool status should complain about these types of errors, I'll get this into a webrev. What I have now is a bit wordy, but I expect that a run of "zpool status" after an OS upgrade will be the first time the user becomes aware of the problem. There needs to be clear and easily accessible guidance for the user about what to do when they have this problem.
>>
>> --
>> Justin
>>
>> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>> pool: foo
>> state: ONLINE
>> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> foo ONLINE 0 0 0
>> mirror-0 ONLINE 0 0 0
>> da8 ONLINE 0 0 0
>> da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>>
>> ************************ WARNING ************************
>> Expect reduced performance.
>>
>> One or more devices are configured to use a block size
>> smaller than their native block size. This can occur if
>> the native size was improperly reported by the version of
>> the system software used to add the device to the pool, or
>> when an existing pool device is replaced with a device that
>> does not support the native block size of the original.
>>
>> For cache and log devices, removing and re-adding the device,
>> or mirror containing the device, will clear the error.
>>
>> For data devices, either replace them with devices that
>> support the configured block size, or create and migrate
>> your data to a new pool. The devices from the original
>> pool can then be added to the new pool.
>>
>> errors: No known data errors
>>
>
> I like the idea of reporting this clearly in zpool status.
>
> I would something more familar in layout and a little less
> verbose, may be somthing like:-
>
> pool: tpool
> state: ONLINE
> status: One or more devices are configured to use a block size
> smaller than their native block size.
> action: Replace device or migrate pool to use native block size.
> see: http://illumos.org/msg/ZFS-xxxxxx
> scan: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> tpool ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> ada1 ONLINE 0 0 0
> ada4 ONLINE 0 0 0
> ada6 DEGRADED 0 0 0 block size: 512b configured, 4096b optimal
>
> We could use the link to provide more detailed information on
> the various processes as well as providing different "action"
> and see links depending on the use of disk as per your current
> verbose description.
>
> Regards
> Steve

This is what I have now:

[***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
pool: foo
state: ONLINE
status: One or more devices are configured to use a non-native block size.
Expect reduced performance.
action: Replace affected devices with devices that support the
configured block size, or migrate data to a properly configured
pool.
scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
config:

NAME STATE READ WRITE CKSUM
foo ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da8 ONLINE 0 0 0
da9 ONLINE 0 0 0 block size: 512B configured, 4096B native

errors: No known data errors

I don't think we should change the state of the vdev or pool unless we expect the system to perform some type of automatic action to correct the issue.

Right now, the priority for this pool status is just above "pool version is down rev". Is this the correct priority? Do we want to report all status that is available for a pool instead of just the highest priority status? Should we only report the per-vdev block size data when the pool status indicates a non-native block size?

We also have more information available to validate spares. Should we indicate when members of a pool are not protected by the attached spares due to ashift or capacity issues? How about spares that are suboptimal due to their native block-size? Right now, I don't print out block size information for spares (da9 is a native 4KiB device attached to a 512B pool):

[***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
pool: foo
state: ONLINE
scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
config:

NAME STATE READ WRITE CKSUM
foo ONLINE 0 0 0
da8 ONLINE 0 0 0
spares
da9 AVAIL

errors: No known data errors

--
Justin

Richard Laager

2013-08-11 20:34:03 UTC

Permalink

On Sun, 2013-08-11 at 14:03 -0600, Justin T. Gibbs wrote:
> Can't the link to long description target a locally installed man
> page?

I would support that, but that's orthogonal to the ashift issue. Such a
change should change all troubleshooting links to reference man pages
instead of URLs, not just this new one.

> status: One or more devices are configured to use a non-native block size.
...
> da9 ONLINE 0 0 0 block size: 512B configured, 4096B native

I like the change from "optimal" to "native". It mirrors the "512
Native" and "4K Native" marketing of the drives, and gets away from
"optimal" which has a different meaning in
Linux's /sys/block/*/queue/*size.

It's still long, though. Would "512B blocks, 4096B native" be enough?
Keep in mind that on Linux, drive names are MUCH longer because we
recommend the use of /dev/disk/by-id. Even on Illumos, SAS disk names
get quite long. Here are a couple of examples from systems of mine:

Illumos:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c4t5000C50026249533d0 ONLINE 0 0 0

Linux:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-1AMCC_3QD04BRW000000000000 ONLINE 0 0 0
scsi-SATA_ST3750640AS_3QD0494F ONLINE 0 0 0

That said, almost anything is going to wrap if you assume 80-character
terminals and disk names this long. My Linux example could only handle
"512/4096".

> I don't think we should change the state of the vdev or pool unless we
> expect the system to perform some type of automatic action to correct
> the issue.

This seems like the exact opposite of what it means. Pool states other
than ONLINE typically mean the *administrator* needs to do some *manual*
action. The only exception is that the system can (except on Linux)
automatically replace a failed disk IFF there's a spare available. The
system being in the process of "doing something" doesn't change the pool
from what it would otherwise be; for example, when a non-faulted device
is being replaced, the pool state remains ONLINE.

--

Steven Hartland

2013-08-11 20:53:13 UTC

Permalink

> On Sun, 2013-08-11 at 14:03 -0600, Justin T. Gibbs wrote:
> > Can't the link to long description target a locally installed man
> > page?
>
> I would support that, but that's orthogonal to the ashift issue. Such a
> change should change all troubleshooting links to reference man pages
> instead of URLs, not just this new one.

I like this idea something like the following would be great:
see: man zpool-status (8) blocksize

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Justin T. Gibbs

2013-08-11 21:13:24 UTC

Permalink

On Aug 11, 2013, at 2:34 PM, Richard Laager <***@wiktel.com> wrote:

> On Sun, 2013-08-11 at 14:03 -0600, Justin T. Gibbs wrote:
>> Can't the link to long description target a locally installed man
>> page?
>
> I would support that, but that's orthogonal to the ashift issue. Such a
> change should change all troubleshooting links to reference man pages
> instead of URLs, not just this new one.

Sure.

>> status: One or more devices are configured to use a non-native block size.
> ...
>> da9 ONLINE 0 0 0 block size: 512B configured, 4096B native
>
> I like the change from "optimal" to "native". It mirrors the "512
> Native" and "4K Native" marketing of the drives, and gets away from
> "optimal" which has a different meaning in
> Linux's /sys/block/*/queue/*size.
>
> It's still long, though. Would "512B blocks, 4096B native" be enough?

I use an 80col terminal and completely understand the desire to keep the output short. But I don't think the shorter version is as clear.

> Keep in mind that on Linux, drive names are MUCH longer because we
> recommend the use of /dev/disk/by-id. Even on Illumos, SAS disk names
> get quite long. Here are a couple of examples from systems of mine:
>
> Illumos:
> NAME STATE READ WRITE CKSUM
> pool ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> c4t5000C50026249533d0 ONLINE 0 0 0
>
> Linux:
> NAME STATE READ WRITE CKSUM
> pool ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> scsi-1AMCC_3QD04BRW000000000000 ONLINE 0 0 0
> scsi-SATA_ST3750640AS_3QD0494F ONLINE 0 0 0
>
> That said, almost anything is going to wrap if you assume 80-character
> terminals and disk names this long. My Linux example could only handle
> "512/4096".

Yes. Most of the existing status data that can be placed to the right of a vdev would cause a wrap in this case.

>
>> I don't think we should change the state of the vdev or pool unless we
>> expect the system to perform some type of automatic action to correct
>> the issue.
>
> This seems like the exact opposite of what it means. Pool states other
> than ONLINE typically mean the *administrator* needs to do some *manual*
> action. The only exception is that the system can (except on Linux)
> automatically replace a failed disk IFF there's a spare available. The
> system being in the process of "doing something" doesn't change the pool
> from what it would otherwise be; for example, when a non-faulted device
> is being replaced, the pool state remains ONLINE.

There is already precedent in "zpool status" for administrative "notices" to be output in the status section even when all state is ONLINE. The most common example is a down-rev pool version. In my code now, the vdev's state does not change, which means there's no FMA event, and thus no FMA action. Unless we need an FMA event to alert the admin, or plan to activate a spare in this case, I see no reason to further complicate vdev state management in the kernel or userland to handle the non-native block size case.

--
Justin

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Steven Hartland

2013-08-11 20:39:58 UTC

Permalink

> ----- Original Message -----
> From: "Justin T. Gibbs" <***@scsiguy.com>
> This is what I have now:
>
> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
> pool: foo
> state: ONLINE
> status: One or more devices are configured to use a non-native block size.
> Expect reduced performance.
> action: Replace affected devices with devices that support the
> configured block size, or migrate data to a properly configured
> pool.
> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
> config:
>
> NAME STATE READ WRITE CKSUM
> foo ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> da8 ONLINE 0 0 0
> da9 ONLINE 0 0 0 block size: 512B configured, 4096B native
>
> errors: No known data errors

Looks good, are you planing on adding a "see:" link so we can provide
more to the user?

> I don't think we should change the state of the vdev or pool unless
> we expect the system to perform some type of automatic action to
> correct the issue.
>
> Right now, the priority for this pool status is just above "pool
> version is down rev". Is this the correct priority?

> Do we want to report all status that is available for a pool instead
> of just the highest priority status?

While it would be nice to have addition messages it could get big quickly
so I'd be happy with just the highest priority personally. When that
issue is resolved the lower prio one will show then.

> Should we only report the per-vdev block size data when the pool status
> indicates a non-native block size?

Not sure what you mean there could you clarify?

> We also have more information available to validate spares. Should
> we indicate when members of a pool are not protected by the attached
> spares due to ashift or capacity issues?

I'd go with yes.

> How about spares that are suboptimal due to their native block-size?

Again yes, reasoning being if the box is highly loaded having a spare
kickin after a failure could perform worse than not having it present
at all.

> Right now, I don't print out block size information for spares (da9
> is a native 4KiB device attached to a 512B pool):
>
> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
> pool: foo
> state: ONLINE
> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
> config:
>
> NAME STATE READ WRITE CKSUM
> foo ONLINE 0 0 0
> da8 ONLINE 0 0 0
> spares
> da9 AVAIL
>
> errors: No known data errors

Keep up the good work, looking forward to seeing this :)

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Justin T. Gibbs

2013-08-11 21:24:07 UTC

Permalink

On Aug 11, 2013, at 2:39 PM, "Steven Hartland" <***@multiplay.co.uk> wrote:

>> ----- Original Message ----- From: "Justin T. Gibbs" <***@scsiguy.com>
>> This is what I have now:
>> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>> pool: foo
>> state: ONLINE
>> status: One or more devices are configured to use a non-native block size.
>> Expect reduced performance.
>> action: Replace affected devices with devices that support the
>> configured block size, or migrate data to a properly configured
>> pool.
>> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>> config:
>> NAME STATE READ WRITE CKSUM
>> foo ONLINE 0 0 0
>> mirror-0 ONLINE 0 0 0
>> da8 ONLINE 0 0 0
>> da9 ONLINE 0 0 0 block size: 512B configured, 4096B native
>> errors: No known data errors
>
> Looks good, are you planing on adding a "see:" link so we can provide
> more to the user?
>
>> I don't think we should change the state of the vdev or pool unless
>> we expect the system to perform some type of automatic action to
>> correct the issue.
>> Right now, the priority for this pool status is just above "pool
>> version is down rev". Is this the correct priority?
>
>> Do we want to report all status that is available for a pool instead
>> of just the highest priority status?
>
> While it would be nice to have addition messages it could get big quickly
> so I'd be happy with just the highest priority personally. When that
> issue is resolved the lower prio one will show then.
>
>> Should we only report the per-vdev block size data when the pool status
>> indicates a non-native block size?
>
> Not sure what you mean there could you clarify?

Just like with pool status, the per-vdev status entries are prioritized. If one vdev is DEGRADED, should we show the block size mismatch for vdevs that aren't even though the pool status message doesn't mention the non-native block size issue?

>> We also have more information available to validate spares. Should
>> we indicate when members of a pool are not protected by the attached
>> spares due to ashift or capacity issues?
>
> I'd go with yes.
>
>> How about spares that are suboptimal due to their native block-size?
>
> Again yes, reasoning being if the box is highly loaded having a spare
> kickin after a failure could perform worse than not having it present
> at all.

Ok. I may push this off to a separate webrev.

--
Justin

Richard Elling

2013-08-11 16:18:25 UTC

Permalink

On Aug 10, 2013, at 6:15 PM, Justin T. Gibbs <***@scsiguy.com> wrote:
>
> I finally got around to plumbing the logical and physical information out to userland. If folks can provide some guidance on how zpool status should complain about these types of errors, I'll get this into a webrev. What I have now is a bit wordy, but I expect that a run of "zpool status" after an OS upgrade will be the first time the user becomes aware of the problem. There needs to be clear and easily accessible guidance for the user about what to do when they have this problem.
>
> --
> Justin
>
> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
> pool: foo
> state: ONLINE
> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
> config:
>
> NAME STATE READ WRITE CKSUM
> foo ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> da8 ONLINE 0 0 0
> da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal

'b' is bits. Perhaps you meant 'B'?

Looking forward, the vast majority of devices will RMW. Is it really worth
showing this to everyone every time? I understand the concern for the
people using modern HDDs, but nobody expects performance from HDDs.

>
> ************************ WARNING ************************

IMHO, this should be notice level, not warning. There is no functional risk.
-- richard

> Expect reduced performance.
>
> One or more devices are configured to use a block size
> smaller than their native block size. This can occur if
> the native size was improperly reported by the version of
> the system software used to add the device to the pool, or
> when an existing pool device is replaced with a device that
> does not support the native block size of the original.
>
> For cache and log devices, removing and re-adding the device,
> or mirror containing the device, will clear the error.
>
> For data devices, either replace them with devices that
> support the configured block size, or create and migrate
> your data to a new pool. The devices from the original
> pool can then be added to the new pool.
>
> errors: No known data errors
>
> -------------------------------------------
> illumos-developer
> Archives: https://www.listbox.com/member/archive/182179/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175251-86d1afdc
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Steven Hartland

2013-08-11 17:32:43 UTC

Permalink

> ----- Original Message -----
> From: "Richard Elling" <***@richardelling.com>
> > Justin
> > [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
> > pool: foo
> > state: ONLINE
> > scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
> > config:
> >
> > NAME STATE READ WRITE CKSUM
> > foo ONLINE 0 0 0
> > mirror-0 ONLINE 0 0 0
> > da8 ONLINE 0 0 0
> > da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>
> 'b' is bits. Perhaps you meant 'B'?
>
> Looking forward, the vast majority of devices will RMW. Is it really worth
> showing this to everyone every time? I understand the concern for the
> people using modern HDDs, but nobody expects performance from HDDs.

I'd say yes given the quite serious performance issues, otherwise
you risk the user missing it.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Richard Elling

2013-08-11 19:04:15 UTC

Permalink

On Aug 11, 2013, at 10:32 AM, "Steven Hartland" <***@multiplay.co.uk> wrote:

>> ----- Original Message ----- From: "Richard Elling" <***@richardelling.com>
>> > Justin
>> > [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>> > pool: foo
>> > state: ONLINE
>> > scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>> > config:
>> > > NAME STATE READ WRITE CKSUM
>> > foo ONLINE 0 0 0
>> > mirror-0 ONLINE 0 0 0
>> > da8 ONLINE 0 0 0
>> > da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>> 'b' is bits. Perhaps you meant 'B'?
>> Looking forward, the vast majority of devices will RMW. Is it really worth
>> showing this to everyone every time? I understand the concern for the
>> people using modern HDDs, but nobody expects performance from HDDs.
>
> I'd say yes given the quite serious performance issues, otherwise
> you risk the user missing it.

What is "quite serious"? Show me the data.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Justin T. Gibbs

2013-08-11 20:20:08 UTC

Permalink

On Aug 11, 2013, at 1:04 PM, Richard Elling <***@richardelling.com> wrote:

> On Aug 11, 2013, at 10:32 AM, "Steven Hartland" <***@multiplay.co.uk> wrote:
>
>>> ----- Original Message ----- From: "Richard Elling" <***@richardelling.com>
>>> > Justin
>>> > [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>>> > pool: foo
>>> > state: ONLINE
>>> > scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>>> > config:
>>> > > NAME STATE READ WRITE CKSUM
>>> > foo ONLINE 0 0 0
>>> > mirror-0 ONLINE 0 0 0
>>> > da8 ONLINE 0 0 0
>>> > da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>>> 'b' is bits. Perhaps you meant 'B'?
>>> Looking forward, the vast majority of devices will RMW. Is it really worth
>>> showing this to everyone every time? I understand the concern for the
>>> people using modern HDDs, but nobody expects performance from HDDs.
>>
>> I'd say yes given the quite serious performance issues, otherwise
>> you risk the user missing it.
>
> What is "quite serious"? Show me the data.
> -- richard

Here are some old graphs I found that compare the performance of a smart optimus SSD with ashift 9 and ashift 12. There are other, unrelated performance issues revealed in these graphs (e.g. the falloff for 1MB block reads), but they should give a sense of how important getting the ashift correct is for write performance:

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-08-11 20:35:12 UTC

Permalink

On Aug 11, 2013, at 1:20 PM, "Justin T. Gibbs" <***@scsiguy.com> wrote:

> On Aug 11, 2013, at 1:04 PM, Richard Elling <***@richardelling.com> wrote:
>
>> On Aug 11, 2013, at 10:32 AM, "Steven Hartland" <***@multiplay.co.uk> wrote:
>>
>>>> ----- Original Message ----- From: "Richard Elling" <***@richardelling.com>
>>>> > Justin
>>>> > [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>>>> > pool: foo
>>>> > state: ONLINE
>>>> > scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>>>> > config:
>>>> > > NAME STATE READ WRITE CKSUM
>>>> > foo ONLINE 0 0 0
>>>> > mirror-0 ONLINE 0 0 0
>>>> > da8 ONLINE 0 0 0
>>>> > da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>>>> 'b' is bits. Perhaps you meant 'B'?
>>>> Looking forward, the vast majority of devices will RMW. Is it really worth
>>>> showing this to everyone every time? I understand the concern for the
>>>> people using modern HDDs, but nobody expects performance from HDDs.
>>>
>>> I'd say yes given the quite serious performance issues, otherwise
>>> you risk the user missing it.
>>
>> What is "quite serious"? Show me the data.
>> -- richard
>
> Here are some old graphs I found that compare the performance of a smart optimus SSD with ashift 9 and ashift 12. There are other, unrelated performance issues revealed in these graphs (e.g. the falloff for 1MB block reads), but they should give a sense of how important getting the ashift correct is for write performance:

Unfortunately, these are SSDs and they tend to have large physical pages and known
optimizations around 4k block sizes (optimized for NTFS). In any case, it beats the snot
out of HDD performance ;-)

As expected, we see no real difference in read performance. Did you measure latency by chance?
-- richard
>

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Justin T. Gibbs

2013-08-11 20:57:01 UTC

Permalink

On Aug 11, 2013, at 2:35 PM, Richard Elling <***@richardelling.com> wrote:

>
> On Aug 11, 2013, at 1:20 PM, "Justin T. Gibbs" <***@scsiguy.com> wrote:
>
>> On Aug 11, 2013, at 1:04 PM, Richard Elling <***@richardelling.com> wrote:
>>
>>> On Aug 11, 2013, at 10:32 AM, "Steven Hartland" <***@multiplay.co.uk> wrote:
>>>
>>>>> ----- Original Message ----- From: "Richard Elling" <***@richardelling.com>
>>>>> > Justin
>>>>> > [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>>>>> > pool: foo
>>>>> > state: ONLINE
>>>>> > scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>>>>> > config:
>>>>> > > NAME STATE READ WRITE CKSUM
>>>>> > foo ONLINE 0 0 0
>>>>> > mirror-0 ONLINE 0 0 0
>>>>> > da8 ONLINE 0 0 0
>>>>> > da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>>>>> 'b' is bits. Perhaps you meant 'B'?
>>>>> Looking forward, the vast majority of devices will RMW. Is it really worth
>>>>> showing this to everyone every time? I understand the concern for the
>>>>> people using modern HDDs, but nobody expects performance from HDDs.
>>>>
>>>> I'd say yes given the quite serious performance issues, otherwise
>>>> you risk the user missing it.
>>>
>>> What is "quite serious"? Show me the data.
>>> -- richard
>>
>> Here are some old graphs I found that compare the performance of a smart optimus SSD with ashift 9 and ashift 12. There are other, unrelated performance issues revealed in these graphs (e.g. the falloff for 1MB block reads), but they should give a sense of how important getting the ashift correct is for write performance:
>
> Unfortunately, these are SSDs and they tend to have large physical pages and known
> optimizations around 4k block sizes (optimized for NTFS). In any case, it beats the snot
> out of HDD performance ;-)

Sure, but the question at hand was, "Is it really worth showing this to everyone every time?" I believe the graphs support answering, "yes".

> As expected, we see no real difference in read performance. Did you measure latency by chance?

I have a latency graph, but this one wasn't from my own experiments and I haven't yet found the original fio profile that was used. I believe the queue depth was 128, but without knowing it for sure, the data is not as useful as it would otherwise be.

--
Justin

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-08-11 21:19:49 UTC

Permalink

On Aug 11, 2013, at 1:57 PM, "Justin T. Gibbs" <***@scsiguy.com> wrote:
>
> I have a latency graph, but this one wasn't from my own experiments and I haven't yet found the original fio profile that was used. I believe the queue depth was 128, but without knowing it for sure, the data is not as useful as it would otherwise be.

These graphs raise more questions than answers. Clearly the clumps around 1s need more
investigation as well as discovering why 1MB sizes are > 8x 128k sizes. Alas, we probably
need more modern measurements and plenty of spare time to analyze.

In short, I agree that it is useful to know the logical/physical block size ratio and presenting it
in zpool status (-v?) is fine. But we need to take care to not conflate optimal performance with
functionality. And most importantly, don't scare the Ops.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Justin T. Gibbs

2013-08-11 20:26:57 UTC

Permalink

On Aug 11, 2013, at 10:18 AM, Richard Elling <***@richardelling.com> wrote:

>
> On Aug 10, 2013, at 6:15 PM, Justin T. Gibbs <***@scsiguy.com> wrote:
>>
>> I finally got around to plumbing the logical and physical information out to userland. If folks can provide some guidance on how zpool status should complain about these types of errors, I'll get this into a webrev. What I have now is a bit wordy, but I expect that a run of "zpool status" after an OS upgrade will be the first time the user becomes aware of the problem. There needs to be clear and easily accessible guidance for the user about what to do when they have this problem.
>>
>> --
>> Justin
>>
>> [***@stealthy /usr/home/justing/perforce/SpectraBSD/cddl/sbin/zpool]# zpool status
>> pool: foo
>> state: ONLINE
>> scan: resilvered 98.5K in 0h0m with 0 errors on Sun Aug 11 01:03:51 2013
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> foo ONLINE 0 0 0
>> mirror-0 ONLINE 0 0 0
>> da8 ONLINE 0 0 0
>> da9 ONLINE 0 0 0 suboptimal block size: 512b configured, 4096b optimal
>
> 'b' is bits. Perhaps you meant 'B'?

Yes. Fixed now.

> Looking forward, the vast majority of devices will RMW. Is it really worth
> showing this to everyone every time? I understand the concern for the
> people using modern HDDs, but nobody expects performance from HDDs.

The impact is still very large on SSDs. See my other follow up.

>> ************************ WARNING ************************
>
> IMHO, this should be notice level, not warning. There is no functional risk.

Please review the output in more recent email to see if it addresses this concern.

Thanks,
Justin

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

George Wilson

2013-08-04 15:45:11 UTC

Permalink

On 8/3/13 11:48 AM, Steven Hartland wrote:
> ----- Original Message ----- From: "George Wilson"
> <***@delphix.com>
>
>> On 8/3/13 10:39 AM, Steven Hartland wrote:
>>> A faked and fully native disk aren't the same as at the device layer
>>> in that a faked one will still addesss the disks at 512b offsets so
>>> said requests can be satisified.
>> The zio pipeline will adjust reads and writes to be 4k aligned. Are
>> you not seeing that? I wouldn't think that we would send any
>> non-aligned read/write requests to the device. Although we are
>> looking at making a change to allow unaligned access to the label of
>> the devices.
>
> It's my understanding this will only happen if the top level vdev was
> 4k on creation, as its this persitent ashift value which determines
> how the requests are sent to the lower level subsystems.
>
> What this change allows is a subordinate vdev with a higher ashift
> requirement than the top level vdev, which I believe will result in ZFS
> still trying to access said device according to the top level ashift
> and hence result in invalid offsets and sizes.
>
> I may have missed something but I've not seen any stage of the pipeline
> which dynamically transforms the IO's for each leaf vdev ashift.
>
> I believe this is why the original ashift hard fail was there in the
> first place.
This should happen in zio_vdev_io_start():

2534 if (P2PHASE(zio->io_size, align) != 0) {
2535 uint64_t asize = P2ROUNDUP(zio->io_size, align);
2536 char *abuf = NULL;
2537 if (zio->io_type == ZIO_TYPE_READ ||
2538 zio->io_type == ZIO_TYPE_WRITE)
2539 abuf = zio_buf_alloc(asize);
2540 ASSERT(vd == vd->vdev_top);
2541 if (zio->io_type == ZIO_TYPE_WRITE) {
2542 bcopy(zio->io_data, abuf, zio->io_size);
2543 bzero(abuf + zio->io_size, asize - zio->io_size);
2544 }
2545 zio_push_transform(zio, abuf, asize, abuf ? asize : 0,
2546 zio_subblock);
2547 }

- George

Steven Hartland

2013-08-04 15:56:17 UTC

Permalink

----- Original Message -----
From: "George Wilson" <***@delphix.com>
To: "Steven Hartland" <***@multiplay.co.uk>
Cc: <***@lists.illumos.org>
Sent: Sunday, August 04, 2013 4:45 PM
Subject: Re: [zfs] zpool should not accept devices with a larger (incompatible) ashift

> On 8/3/13 11:48 AM, Steven Hartland wrote:
>> ----- Original Message ----- From: "George Wilson"
>> <***@delphix.com>
>>
>>> On 8/3/13 10:39 AM, Steven Hartland wrote:
>>>> A faked and fully native disk aren't the same as at the device layer
>>>> in that a faked one will still addesss the disks at 512b offsets so
>>>> said requests can be satisified.
>>> The zio pipeline will adjust reads and writes to be 4k aligned. Are
>>> you not seeing that? I wouldn't think that we would send any
>>> non-aligned read/write requests to the device. Although we are
>>> looking at making a change to allow unaligned access to the label of
>>> the devices.
>>
>> It's my understanding this will only happen if the top level vdev was
>> 4k on creation, as its this persitent ashift value which determines
>> how the requests are sent to the lower level subsystems.
>>
>> What this change allows is a subordinate vdev with a higher ashift
>> requirement than the top level vdev, which I believe will result in ZFS
>> still trying to access said device according to the top level ashift
>> and hence result in invalid offsets and sizes.
>>
>> I may have missed something but I've not seen any stage of the pipeline
>> which dynamically transforms the IO's for each leaf vdev ashift.
>>
>> I believe this is why the original ashift hard fail was there in the
>> first place.
> This should happen in zio_vdev_io_start():
>
> 2534 if (P2PHASE(zio->io_size, align) != 0) {
> 2535 uint64_t asize = P2ROUNDUP(zio->io_size, align);
> 2536 char *abuf = NULL;
> 2537 if (zio->io_type == ZIO_TYPE_READ ||
> 2538 zio->io_type == ZIO_TYPE_WRITE)
> 2539 abuf = zio_buf_alloc(asize);
> 2540 ASSERT(vd == vd->vdev_top);
> 2541 if (zio->io_type == ZIO_TYPE_WRITE) {
> 2542 bcopy(zio->io_data, abuf, zio->io_size);
> 2543 bzero(abuf + zio->io_size, asize - zio->io_size);
> 2544 }
> 2545 zio_push_transform(zio, abuf, asize, abuf ? asize : 0,
> 2546 zio_subblock);
> 2547 }

Correct but this transform is only done at the top level vdev. The new code
allows a leaf vdev with a larger ashift that that which the IO's have been
transformed hence the issue.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.