Discussion:
SSD alignment, EFI label rpool support
Paul B. Henson
2013-08-01 03:22:02 UTC
Permalink
While putting together a linux system, I found the current advice in
that ecosystem seems to be to align your SSD partitions not just to the
page size, but to the erase block size. Any thoughts on that
recommendation for an rpool, l2arc, or zil partition?

I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe
have an 8k page size and 512k erase block size. Given the rpool
requirement for an SMI label, I would end up having to start the slice
on cylinder 15 (which, when added to the one cylinder used by the fdisk
label, results in cylinder 16) in order to get the alignment right,
which wastes something like 150Mb :(. Not ridiculous on a 256G device,
but annoying. Has any progress been made on booting from EFI labeled
disks? I vaguely recall somebody was working on grub2 but I'm not sure
where it ended up. That would allow alignment without as much waste.

Or if erase block alignment doesn't matter that much for ZFS I could
just align to the page size…

Thanks…
Schlacta, Christ
2013-08-01 04:59:24 UTC
Permalink
Just align to 1MiB and you'll match the 512KiB erase block size and the
sector and logical block size. That's why pretty much everything has
defaulted to aligning to 1MiB boundaries.
While putting together a linux system, I found the current advice in that
ecosystem seems to be to align your SSD partitions not just to the page
size, but to the erase block size. Any thoughts on that recommendation for
an rpool, l2arc, or zil partition?
I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe
have an 8k page size and 512k erase block size. Given the rpool requirement
for an SMI label, I would end up having to start the slice on cylinder 15
(which, when added to the one cylinder used by the fdisk label, results in
cylinder 16) in order to get the alignment right, which wastes something
like 150Mb :(. Not ridiculous on a 256G device, but annoying. Has any
progress been made on booting from EFI labeled disks? I vaguely recall
somebody was working on grub2 but I'm not sure where it ended up. That
would allow alignment without as much waste.
Or if erase block alignment doesn't matter that much for ZFS I could just
align to the page size

Thanks

------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
23054485-60ad043a<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=23054485-335460f5<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Cedric Tineo
2013-08-01 07:18:43 UTC
Permalink
Does that mean that we're supposed to slice and align SSDs every time we add one as l2arc, zil or storage?

So it would mean than just doing zpool add cache ssd_denum is sub-optimal?

For those not familiar with the process, what would be the steps to do that alignement under omnios or OI? And FreeBSD?

Thanks,

Cedric Tineo
Just align to 1MiB and you'll match the 512KiB erase block size and the sector and logical block size. That's why pretty much everything has defaulted to aligning to 1MiB boundaries.
While putting together a linux system, I found the current advice in that ecosystem seems to be to align your SSD partitions not just to the page size, but to the erase block size. Any thoughts on that recommendation for an rpool, l2arc, or zil partition?
I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe have an 8k page size and 512k erase block size. Given the rpool requirement for an SMI label, I would end up having to start the slice on cylinder 15 (which, when added to the one cylinder used by the fdisk label, results in cylinder 16) in order to get the alignment right, which wastes something like 150Mb :(. Not ridiculous on a 256G device, but annoying. Has any progress been made on booting from EFI labeled disks? I vaguely recall somebody was working on grub2 but I'm not sure where it ended up. That would allow alignment without as much waste.
Or if erase block alignment doesn't matter that much for ZFS I could just align to the page size…
Thanks…
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives | Modify Your Subscription
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schlacta, Christ
2013-08-01 15:59:53 UTC
Permalink
I don't know about zfs upstream, but zfsonlinux already has patches to
ensure that ssds are properly aligned by default.
Post by Cedric Tineo
Does that mean that we're supposed to slice and align SSDs every time we
add one as l2arc, zil or storage?
So it would mean than just doing zpool add cache ssd_denum is sub-optimal?
For those not familiar with the process, what would be the steps to do
that alignement under omnios or OI? And FreeBSD?
Thanks,
Cedric Tineo
Just align to 1MiB and you'll match the 512KiB erase block size and the
sector and logical block size. That's why pretty much everything has
defaulted to aligning to 1MiB boundaries.
While putting together a linux system, I found the current advice in that
ecosystem seems to be to align your SSD partitions not just to the page
size, but to the erase block size. Any thoughts on that recommendation for
an rpool, l2arc, or zil partition?
I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe
have an 8k page size and 512k erase block size. Given the rpool requirement
for an SMI label, I would end up having to start the slice on cylinder 15
(which, when added to the one cylinder used by the fdisk label, results in
cylinder 16) in order to get the alignment right, which wastes something
like 150Mb :(. Not ridiculous on a 256G device, but annoying. Has any
progress been made on booting from EFI labeled disks? I vaguely recall
somebody was working on grub2 but I'm not sure where it ended up. That
would allow alignment without as much waste.
Or if erase block alignment doesn't matter that much for ZFS I could just
align to the page size

Thanks

------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
23054485-60ad043a<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a>
Modify Your Subscription: https://www.listbox.com/**member/?&id_**
secret=23054485-335460f5 <https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24092944-c0bfe32e> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Paul B. Henson
2013-08-01 19:20:18 UTC
Permalink
Post by Schlacta, Christ
Just align to 1MiB and you'll match the 512KiB erase block size and the
sector and logical block size. That's why pretty much everything has
defaulted to aligning to 1MiB boundaries.
Unless I'm confused (which is quite possible), that is not possible for
an rpool partition on an illumos based OS, as currently grub can only
boot off of disks with an SMI label, and SMI labels can only align
partitions on cylinder boundaries.
Richard Elling
2013-08-01 10:39:31 UTC
Permalink
While putting together a linux system, I found the current advice in that ecosystem seems to be to align your SSD partitions not just to the page size, but to the erase block size. Any thoughts on that recommendation for an rpool, l2arc, or zil partition?
I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe have an 8k page size and 512k erase block size. Given the rpool requirement for an SMI label, I would end up having to start the slice on cylinder 15 (which, when added to the one cylinder used by the fdisk label, results in cylinder 16) in order to get the alignment right, which wastes something like 150Mb :(.
Eh? Cylinders?
The ZFS label already reserves 8KB of space at the front so that it will not clobber an SMI label.
The actual data use begins at a 4MB offset, past the ZFS labels and reserved space.

In other words, why would you purposefully misalign?
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Sam Zaydel
2013-08-01 10:42:01 UTC
Permalink
Richard, thanks for clarifying. I was under the same belief and thought you
effectively should have to do nothing in this instance.
While putting together a linux system, I found the current advice in that
ecosystem seems to be to align your SSD partitions not just to the page
size, but to the erase block size. Any thoughts on that recommendation for
an rpool, l2arc, or zil partition?
I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe
have an 8k page size and 512k erase block size. Given the rpool requirement
for an SMI label, I would end up having to start the slice on cylinder 15
(which, when added to the one cylinder used by the fdisk label, results in
cylinder 16) in order to get the alignment right, which wastes something
like 150Mb :(.
Eh? Cylinders?
The ZFS label already reserves 8KB of space at the front so that it will
not clobber an SMI label.
The actual data use begins at a 4MB offset, past the ZFS labels and reserved space.
In other words, why would you purposefully misalign?
-- richard
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24342081-7731472e> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
--
Please feel free to connect with me on LinkedIn.
http://www.linkedin.com/in/samzaydel

Anthropomorphic Yahweh makes about as much sense in a very as describing
Negative Pressure, or properties of Atomic decay to a 1 year old. Yes, this
means I am an atheist.



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-08-01 12:42:35 UTC
Permalink
Post by Richard Elling
The ZFS label already reserves 8KB of space at the front so that it will
not clobber an SMI label.
The actual data use begins at a 4MB offset, past the ZFS labels and reserved space.
In other words, why would you purposefully misalign?
Well, technically - this is correct, except that this applies to
offsets within the the "device" which you gave to ZFS to be a leaf
component of the pool. This device may be a classic Solaris slice
in a SMI label-table, possibly in MBR-backed partitions on x86, or
a whole partition (in either MBR or EFI definitions), or a file -
just to be complete.

What matters is that this container (slice/partition) usually does
not start at HDD sector 0 (and as history has shown, complex devices
such as "old-windows-compatible 4K AF drives" or RAID0-backed JBOD
LUNs or anything else may lead to the logical 0-offset not being
physically well aligned with hardware sectors either).

The same applies to cases where you give ZFS a "whole disk" and
it creates a EFI partition table according to its rules, and marks
"whole-disk usage" in the pool labels - but otherwise this is an
ordinary partition table (properly aligned by default, in belief
that physical 0 == logical 0).

I believe Paul's question, just like any question of this sort,
regarded the possible need to realign his partitions - or perhaps
a way to verify that they are aligned. In case of SSD, there is a
fresh twist regarding page size vs. erase-block size (and not yet
asked - a recommendation about recommended ZFS minblocksize for
such devices).

Now, since 512k is divisible by 8k, an offset of 512k or 1024k
for the partition which should contain the rpool should be good
for both types of alignment in question (note the next paragraph
though). While it may indeed be problematic to carve disks with
such precision via fdisk/format, one can use the command-line
"parted" to manage disk partitions, including MBR-style ones.
When the MBR partition for the rpool with the desired offset
(and Solaris or maybe Solaris2 type) is made, it can be sliced
with "format" in order to designate a container for rpool.
I believe, manually prepared partitions like this can also be
used in the Caiman installer, so you don't have to fuss with
"format" (the installer will overwrite your slicing anyway).

Note that in my sample box which I glanced at while writing this,
the zeroth "cylinder" (16065 * 512-byte "blocks" or 7.84MB) is
reserved on x86 for "boot", and the rpool starts at cylinder
number 1. This may mean that for proper alignment of the rpool,
its MBR partition may have to start at, for example, 8Mb-7.84Mb
or 16384 - 16065 512b-"blocks" (legacy "sectors", as still used
in partitioning terminology), give or take one ;) This way the
rpool's slice 0 would start at the physical device's logical
sector 16384 which is hopefully properly aligned for the IOs,
and ZFS's 4Mb offset further into that would not contradict
anything.

Note that I've picked 8Mb rather arbitrarily, as a multiple of
1Mb next after this "cylinder" size. The classical MBR layout
does only reserve 63 sectors (and yes, the "tracks" have odd
sizes) before the first partition, which is what bootloaders
should be able to cope with. In my example I give ample room -
over 300 sectors ;) Some software (i.e. for low-level disk
archiving) may complain about offsets which are not whole
tracks, but otherwise this is quite usable.

HTH,
//Jim Klimov
Paul B. Henson
2013-08-01 19:39:10 UTC
Permalink
Post by Jim Klimov
What matters is that this container (slice/partition) usually does
not start at HDD sector 0
Yes, you understood my question exactly, thanks for expanding upon it.
Post by Jim Klimov
fresh twist regarding page size vs. erase-block size (and not yet
asked - a recommendation about recommended ZFS minblocksize for
such devices).
Oh no, I'm almost done tuning my install, not more possible options to
have to explore ;).
Post by Jim Klimov
such precision via fdisk/format, one can use the command-line
"parted" to manage disk partitions, including MBR-style ones.
Hmm, if I understand you correctly, you are recommending using a
non-illumos tool to create a Solaris MBR partition aligned on an
arbitrary sector rather than a cylinder, which would then allow the
beginning of slice zero in that partition to be in the right place? I
didn't think of that, that seems much simpler than trying to calculate a
cylinder offset of the slice relative to the beginning of the partition
that lines up.

Thanks much…
Jim Klimov
2013-08-01 20:21:51 UTC
Permalink
Post by Paul B. Henson
Post by Jim Klimov
What matters is that this container (slice/partition) usually does
not start at HDD sector 0
Yes, you understood my question exactly, thanks for expanding upon it.
Post by Jim Klimov
fresh twist regarding page size vs. erase-block size (and not yet
asked - a recommendation about recommended ZFS minblocksize for
such devices).
Oh no, I'm almost done tuning my install, not more possible options to
have to explore ;).
Post by Jim Klimov
such precision via fdisk/format, one can use the command-line
"parted" to manage disk partitions, including MBR-style ones.
Hmm, if I understand you correctly, you are recommending using a
non-illumos tool to create a Solaris MBR partition aligned on an
arbitrary sector rather than a cylinder, which would then allow the
beginning of slice zero in that partition to be in the right place? I
didn't think of that, that seems much simpler than trying to calculate a
cylinder offset of the slice relative to the beginning of the partition
that lines up.
In short - yes.

As can be seen in src.illumos.org, the GNU parted (or some derivate
thereof) is even part of the illumos-gate, so it is not a non-illumos
tool now ;)
Post by Paul B. Henson
Thanks much…
You are welcome, and I hope this does help :)

//Jim
Paul B. Henson
2013-08-02 02:24:22 UTC
Permalink
Post by Jim Klimov
As can be seen in src.illumos.org, the GNU parted (or some derivate
thereof) is even part of the illumos-gate, so it is not a non-illumos
tool now ;)
Cool; I didn't know that.

It looks like Solaris 11 supports booting from an EFI-labled rpool now;
given grub is GPL, presumably they had to release the modifications to
do that? I wonder how cleanly they would integrate into illumos, it
would be nice to get away from the requirement for SMI labels on rpool.
Dan McDonald
2013-08-02 03:06:32 UTC
Permalink
It looks like Solaris 11 supports booting from an EFI-labled rpool now; given grub is GPL, presumably they had to release the modifications to do that? I wonder how cleanly they would integrate into illumos, it would be nice to get away from the requirement for SMI labels on rpool.
ISTR S11 switched to GRUB2. That would be interesting, but a non-trivial testing task, to say the least.

Dan
Garrett D'Amore
2013-08-02 16:22:26 UTC
Permalink
Post by Dan McDonald
It looks like Solaris 11 supports booting from an EFI-labled rpool now; given grub is GPL, presumably they had to release the modifications to do that? I wonder how cleanly they would integrate into illumos, it would be nice to get away from the requirement for SMI labels on rpool.
ISTR S11 switched to GRUB2. That would be interesting, but a non-trivial testing task, to say the least.
Thats correct. I believe Seth did a large chunk of that work.

I'd like to investigate another alternative altogether -- the BSD licensed loader that FreeBSD uses. I just haven't had cycles. :-)

- Garrett
Post by Dan McDonald
Dan
_______________________________________________
OmniOS-discuss mailing list
http://lists.omniti.com/mailman/listinfo/omnios-discuss
Paul B. Henson
2013-08-01 19:30:42 UTC
Permalink
Post by Richard Elling
Eh? Cylinders?
The ZFS label already reserves 8KB of space at the front so that it will
not clobber an SMI label.
The actual data use begins at a 4MB offset, past the ZFS labels and reserved space.
In other words, why would you purposefully misalign?
Well, obviously I wouldn't *purposefully* misalign, but I can't rule out
the possibility of *incompetently* misaligning ;).

I think you are talking about the case when you give zfs an entire disk?
In that case, yes, I understand there is no extra magic to be performed.

However, I am specifically talking about the case of the rpool, in which
you currently cannot give zfs the entire disk, but must give it a slice
of an SMI labeled disk.

In that case, you have the fdisk partitioning, in which from what I
understand the Solaris partition typically starts with a one cylinder
offset. So, when you give zfs the first slice of the Solaris partition,
it already has an offset. Are you saying zfs refers to the fdisk
partition, figures out where on the disk the first slice actually
starts, and then just does the right thing? I did not believe that to be
the case...
Matthew Ahrens
2013-08-01 20:27:23 UTC
Permalink
It's hard for me to see how aligning the partition on a greater granularity
than the physical sector size (aka "ashift", typically 4k for modern large
devices) would help anything. Regardless of partition alignment, ZFS is
going to write to whatever sectors it wants.

--matt
While putting together a linux system, I found the current advice in that
ecosystem seems to be to align your SSD partitions not just to the page
size, but to the erase block size. Any thoughts on that recommendation for
an rpool, l2arc, or zil partition?
I'm using a couple Crucial 256G m4 SSD's for my rpool, which I believe
have an 8k page size and 512k erase block size. Given the rpool requirement
for an SMI label, I would end up having to start the slice on cylinder 15
(which, when added to the one cylinder used by the fdisk label, results in
cylinder 16) in order to get the alignment right, which wastes something
like 150Mb :(. Not ridiculous on a 256G device, but annoying. Has any
progress been made on booting from EFI labeled disks? I vaguely recall
somebody was working on grub2 but I'm not sure where it ended up. That
would allow alignment without as much waste.
Or if erase block alignment doesn't matter that much for ZFS I could just
align to the page size…
Thanks…
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
21635000-ebd1d460<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=21635000-73dc201a<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-08-01 21:13:41 UTC
Permalink
Post by Matthew Ahrens
It's hard for me to see how aligning the partition on a greater
granularity than the physical sector size (aka "ashift", typically 4k
for modern large devices) would help anything. Regardless of partition
alignment, ZFS is going to write to whatever sectors it wants.
Well, it is my understanding (as detailed in another post) that the
partitioning tables (MBR/SMI, GPT/EFI) still count in 512-byte units
as the minimum size, over whatever hardware sector sizes.

So it is, on one hand, required to do all this accounting in finer
granularities than 4k sectors or 8k pages, and on another hand - it
is possible to make a mistake in all this. And the goal of alignment
is to not have ZFS block writes span incomplete hardware sectors.

One potential for errors, which may break ZFS whole-drive usage with
its default EFI label generation, is the alleged existence of 4k AF
drives with a DIP-switch or something like that which shifts the LBAs
by one legacy sector, so that the first partition which starts at LBA
sector number 63 would in fact start on a physical 4k sector boundary.
Apparently, this helps optimal usage of new hardware from old OSes
like Windows XP by default, which would format the disk with a 4k
clustered NTFS or FAT32 and use the hardware sectors wholly and well
aligned, to store their FS clusters, and not care or know about either
alignment or non-512b sector sizing.

My 2c,
//Jim Klimov
Paul B. Henson
2013-08-02 02:30:20 UTC
Permalink
Post by Matthew Ahrens
It's hard for me to see how aligning the partition on a greater
granularity than the physical sector size (aka "ashift", typically 4k
for modern large devices) would help anything. Regardless of partition
alignment, ZFS is going to write to whatever sectors it wants.
Well, as I originally indicated, this suggestion came from the linux
world, and was more perhaps intended for ext3/4 use cases, they discuss
setting the stride and stripe-width options for ext filesystems to
optimize for the page/erase block size. I wasn't really sure if that was
relevant for ZFS, which is why I asked :).

Thanks…
Loading...