Call for collaborators on implementing SCSI Unmap/SATA TRIM functionality

Discussion:

Call for collaborators on implementing SCSI Unmap/SATA TRIM functionality

Sašo Kiselkov

2013-02-27 00:30:07 UTC

So I've been working on getting some initial support for SCSI Unmap and
SATA TRIM into ZFS and my work can so far be viewed here:

http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

My changes so far consist of:

1) Altering dkioc_free_t to take a linked list of dkioc_free_extent_t's,
so that we can exploit the underlying commands to their fullest
potential (issuing a single Unmap/TRIM command for multiple extents,
not one command per extent).
2) Altering stmf_sbd/sbd.c, stmf_sbd/sbd_scsi.c and zfs/zvol.c to
exploit the above API change.
3) Implemented walking of the freed_map in metaslab_sync_done to notify
the vdevs of freed extents (via DKIOCFREE zio's).
4) Implemented support for DKIOCFREE zio's in vdev_disk_io_start to
issue the ioctl to the underlying sd layer.
5) Tested 3&4 to at least make sure it doesn't bomb on boot.

The current implementation is limited to issuing DKIOCFREE ioctl's only
for leaf disk-vdevs and mirror vdevs, since raidz vdevs can vary their
layout geometry and we'd need to do some smart extent offset
recalculation there to get it right for leaf vdevs.

My questions to the kind people here are:

1) Won't the changes in dkioc_free_t break external applications? My
guess is no, since I'm not aware of anybody using this interface
externally (other than COMSTAR, which I updated).
2) Do you think doing unmap in metaslab_sync_done is a good way to go
about it? In particular, I'm not sure if this won't make implementing
it for raidz at a later date harder or impossible.
3) I'm a bit dubious about whether I understand space_map internals
right and am walking the extents in freed_map in metaslab_vdev_free
correctly (the code for space_maps is completely undocumented, so
I'm inferring how they work by learning from their other uses).

Specifically, I'd also like to ask for somebody to please help me
implement DKIOCFREE ioctl support in sd.c (i.e. encode a SCSI Unmap and
SATA TRIM) - I believe this requires digging around deep in driver
bowels and I don't feel like I'm skilled enough to do it just yet. At
the moment the ZFS ioctl calls simply hit ldi_ioctl which returns an
ENOTSUP, which prompts the vdev to stop making these calls (subsequent
DKIOCFREE zio's are not even attempted and simply return ENOTSUP right
away). Another issue is that while I have documentation for the SCSI
Unmap command structure, I don't for SATA TRIM (subscription required -
yay!). If by some lucky coincidence I figure out how to add support for
DKIOCFREE in sd.c, if somebody could please provide me the correct
command structure for SATA, I'd greatly appreciate.

Please let me know what you think. Thanks!

Cheers,
--
Saso

Garrett D'Amore

2013-02-27 05:07:14 UTC

I'd like to help with the sd.c changes. It aligns with some other work I have planned.

- Garrett

Post by SaÅ¡o Kiselkov
So I've been working on getting some initial support for SCSI Unmap and
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) Altering dkioc_free_t to take a linked list of dkioc_free_extent_t's,
so that we can exploit the underlying commands to their fullest
potential (issuing a single Unmap/TRIM command for multiple extents,
not one command per extent).
2) Altering stmf_sbd/sbd.c, stmf_sbd/sbd_scsi.c and zfs/zvol.c to
exploit the above API change.
3) Implemented walking of the freed_map in metaslab_sync_done to notify
the vdevs of freed extents (via DKIOCFREE zio's).
4) Implemented support for DKIOCFREE zio's in vdev_disk_io_start to
issue the ioctl to the underlying sd layer.
5) Tested 3&4 to at least make sure it doesn't bomb on boot.
The current implementation is limited to issuing DKIOCFREE ioctl's only
for leaf disk-vdevs and mirror vdevs, since raidz vdevs can vary their
layout geometry and we'd need to do some smart extent offset
recalculation there to get it right for leaf vdevs.
1) Won't the changes in dkioc_free_t break external applications? My
guess is no, since I'm not aware of anybody using this interface
externally (other than COMSTAR, which I updated).
2) Do you think doing unmap in metaslab_sync_done is a good way to go
about it? In particular, I'm not sure if this won't make implementing
it for raidz at a later date harder or impossible.
3) I'm a bit dubious about whether I understand space_map internals
right and am walking the extents in freed_map in metaslab_vdev_free
correctly (the code for space_maps is completely undocumented, so
I'm inferring how they work by learning from their other uses).
Specifically, I'd also like to ask for somebody to please help me
implement DKIOCFREE ioctl support in sd.c (i.e. encode a SCSI Unmap and
SATA TRIM) - I believe this requires digging around deep in driver
bowels and I don't feel like I'm skilled enough to do it just yet. At
the moment the ZFS ioctl calls simply hit ldi_ioctl which returns an
ENOTSUP, which prompts the vdev to stop making these calls (subsequent
DKIOCFREE zio's are not even attempted and simply return ENOTSUP right
away). Another issue is that while I have documentation for the SCSI
Unmap command structure, I don't for SATA TRIM (subscription required -
yay!). If by some lucky coincidence I figure out how to add support for
DKIOCFREE in sd.c, if somebody could please provide me the correct
command structure for SATA, I'd greatly appreciate.
Please let me know what you think. Thanks!
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-02-27 09:25:52 UTC

Post by Garrett D'Amore
I'd like to help with the sd.c changes. It aligns with some other work I have planned.

Wonderful! Could you please have a look at it then? I'm not quite
familiar yet with how SCSI commands are actually generated on the wire
(I know it's got something to do with uscsi_cmd) and how this interacts
with SATA. I know exactly where to put the ioctl in sd.c, I'm just too
dumb to implement its handler routine yet (plus the SATA equivalents
totally escape me so far).

Hope we can get this show on the road.

Cheers,
--
Saso

Keith Wesolowski

2013-02-27 16:23:35 UTC

Post by SaÅ¡o Kiselkov
Wonderful! Could you please have a look at it then? I'm not quite
familiar yet with how SCSI commands are actually generated on the wire
(I know it's got something to do with uscsi_cmd) and how this interacts
with SATA. I know exactly where to put the ioctl in sd.c, I'm just too
dumb to implement its handler routine yet (plus the SATA equivalents
totally escape me so far).

Recall that sata.c contains a SATL, and SATA devices are presented to
userland as if they were SCSI. So there are really 2 paths for commands
to take to a SATA target, depending on the type of controller to which
they're attached. If they're attached to a native SATA controller like
AHCI, they'll go through the sata.c SATL. If they're attached to a
controller like LSI's that support both SAS and SATA end devices,
they'll go to the controller as SCSI commands and then through the SATL
in the card's firmware. Be careful with this, though; not every
controller supports translation of this command. See for instance
http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?sSessionID=&inc=8039&caller=%7e%2fFindAnswers.aspx%3ftxtCriteria%3dtrim%26sSessionid%3d.
I would suggest that since UNMAP support is optional anyway, we should
simply treat a SATA device behind a controller that cannot translate
UNMAP as if it does not support TRIM. While one could instead attempt
to send an ATA PASSTHROUGH command, the reality is that the controllers
that can't translate UNMAP are either RAID controllers, for which the
effect would be undefined anyway, or are very old and probably not worth
trying to use what is effectively a performance optimisation.

In any case, everything upstack should consider this the UNMAP command,
including sd. It will almost certainly be necessary to add this to our
software SATL if you wish it to work with SATA end devices. The
translation is defined in SAT-3; see the current draft at
http://www.t10.org/cgi-bin/ac.pl?t=f&f=sat3r04.pdf. You'll want to
start at sata_scsi_start() in the code.

Sašo Kiselkov

2013-02-27 16:31:07 UTC

Post by Keith Wesolowski

Post by SaÅ¡o Kiselkov
Wonderful! Could you please have a look at it then? I'm not quite
familiar yet with how SCSI commands are actually generated on the wire
(I know it's got something to do with uscsi_cmd) and how this interacts
with SATA. I know exactly where to put the ioctl in sd.c, I'm just too
dumb to implement its handler routine yet (plus the SATA equivalents
totally escape me so far).

Recall that sata.c contains a SATL, and SATA devices are presented to
userland as if they were SCSI. So there are really 2 paths for commands
to take to a SATA target, depending on the type of controller to which
they're attached. If they're attached to a native SATA controller like
AHCI, they'll go through the sata.c SATL. If they're attached to a
controller like LSI's that support both SAS and SATA end devices,
they'll go to the controller as SCSI commands and then through the SATL
in the card's firmware. Be careful with this, though; not every
controller supports translation of this command. See for instance
http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?sSessionID=&inc=8039&caller=%7e%2fFindAnswers.aspx%3ftxtCriteria%3dtrim%26sSessionid%3d.
I would suggest that since UNMAP support is optional anyway, we should
simply treat a SATA device behind a controller that cannot translate
UNMAP as if it does not support TRIM. While one could instead attempt
to send an ATA PASSTHROUGH command, the reality is that the controllers
that can't translate UNMAP are either RAID controllers, for which the
effect would be undefined anyway, or are very old and probably not worth
trying to use what is effectively a performance optimisation.
In any case, everything upstack should consider this the UNMAP command,
including sd. It will almost certainly be necessary to add this to our
software SATL if you wish it to work with SATA end devices. The
translation is defined in SAT-3; see the current draft at
http://www.t10.org/cgi-bin/ac.pl?t=f&f=sat3r04.pdf. You'll want to
start at sata_scsi_start() in the code.

Thanks for the helpful pointers and reading, I'll look into it.

Cheers,
--
Saso

Garrett D'Amore

2013-02-28 10:44:44 UTC

Post by Keith Wesolowski

Post by SaÅ¡o Kiselkov
Wonderful! Could you please have a look at it then? I'm not quite
familiar yet with how SCSI commands are actually generated on the wire
(I know it's got something to do with uscsi_cmd) and how this interacts
with SATA. I know exactly where to put the ioctl in sd.c, I'm just too
dumb to implement its handler routine yet (plus the SATA equivalents
totally escape me so far).

Recall that sata.c contains a SATL, and SATA devices are presented to
userland as if they were SCSI. So there are really 2 paths for commands
to take to a SATA target, depending on the type of controller to which
they're attached. If they're attached to a native SATA controller like
AHCI, they'll go through the sata.c SATL. If they're attached to a
controller like LSI's that support both SAS and SATA end devices,
they'll go to the controller as SCSI commands and then through the SATL
in the card's firmware. Be careful with this, though; not every
controller supports translation of this command. See for instance
http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?sSessionID=&inc=8039&caller=%7e%2fFindAnswers.aspx%3ftxtCriteria%3dtrim%26sSessionid%3d.
I would suggest that since UNMAP support is optional anyway, we should
simply treat a SATA device behind a controller that cannot translate
UNMAP as if it does not support TRIM. While one could instead attempt
to send an ATA PASSTHROUGH command, the reality is that the controllers
that can't translate UNMAP are either RAID controllers, for which the
effect would be undefined anyway, or are very old and probably not worth
trying to use what is effectively a performance optimisation.
In any case, everything upstack should consider this the UNMAP command,
including sd. It will almost certainly be necessary to add this to our
software SATL if you wish it to work with SATA end devices. The
translation is defined in SAT-3; see the current draft at
http://www.t10.org/cgi-bin/ac.pl?t=f&f=sat3r04.pdf. You'll want to
start at sata_scsi_start() in the code.

Pawel Jakub Dawidek

2013-02-27 12:30:47 UTC

Post by SaÅ¡o Kiselkov
So I've been working on getting some initial support for SCSI Unmap and
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) Altering dkioc_free_t to take a linked list of dkioc_free_extent_t's,
so that we can exploit the underlying commands to their fullest
potential (issuing a single Unmap/TRIM command for multiple extents,
not one command per extent).
2) Altering stmf_sbd/sbd.c, stmf_sbd/sbd_scsi.c and zfs/zvol.c to
exploit the above API change.
3) Implemented walking of the freed_map in metaslab_sync_done to notify
the vdevs of freed extents (via DKIOCFREE zio's).
4) Implemented support for DKIOCFREE zio's in vdev_disk_io_start to
issue the ioctl to the underlying sd layer.
5) Tested 3&4 to at least make sure it doesn't bomb on boot.
The current implementation is limited to issuing DKIOCFREE ioctl's only
for leaf disk-vdevs and mirror vdevs, since raidz vdevs can vary their
layout geometry and we'd need to do some smart extent offset
recalculation there to get it right for leaf vdevs.

Not sure if you are aware, but I implemented TRIM support for ZFS in
FreeBSD, which was ported then to ZFSOnLinux and improved.

I took different approach, but it seems to work and I know it works on
several hundred production machines at this point that run ZFS pools on
SSDs.

Having said that, your approach with using metaslab_sync_done() seems
much easier. Not sure what are the potential problems with RAIDZ layout,
my approach works fine there. Another concern is that I delay actual
TRIM by configured number of TXGs (64 by default) just so 'zpool import -F'
keeps working to some reasonable extend. Not sure if with
metaslab_sync_done() you have this flexibility.

Does your approach guarantee that every freed block will be TRIMmed?
Mine doesn't, but it would be a nice property to have, as with such
guarantees it would be possible to implement TRIM on underlying device
as secure delete (where we overwrite the given blocks with random data
several times). This is something I wanted to implement in GELI (disk
encryption software I implemented for FreeBSD) for a long time, but it
needs proper support from upper layers, like file system.

Post by SaÅ¡o Kiselkov
1) Won't the changes in dkioc_free_t break external applications? My
guess is no, since I'm not aware of anybody using this interface
externally (other than COMSTAR, which I updated).
2) Do you think doing unmap in metaslab_sync_done is a good way to go
about it? In particular, I'm not sure if this won't make implementing
it for raidz at a later date harder or impossible.
3) I'm a bit dubious about whether I understand space_map internals
right and am walking the extents in freed_map in metaslab_vdev_free
correctly (the code for space_maps is completely undocumented, so
I'm inferring how they work by learning from their other uses).
Specifically, I'd also like to ask for somebody to please help me
implement DKIOCFREE ioctl support in sd.c (i.e. encode a SCSI Unmap and
SATA TRIM) - I believe this requires digging around deep in driver
bowels and I don't feel like I'm skilled enough to do it just yet. At
the moment the ZFS ioctl calls simply hit ldi_ioctl which returns an
ENOTSUP, which prompts the vdev to stop making these calls (subsequent
DKIOCFREE zio's are not even attempted and simply return ENOTSUP right
away). Another issue is that while I have documentation for the SCSI
Unmap command structure, I don't for SATA TRIM (subscription required -
yay!). If by some lucky coincidence I figure out how to add support for
DKIOCFREE in sd.c, if somebody could please provide me the correct
command structure for SATA, I'd greatly appreciate.
Please let me know what you think. Thanks!

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-02-27 13:28:20 UTC

Post by Pawel Jakub Dawidek
Not sure if you are aware, but I implemented TRIM support for ZFS
in FreeBSD, which was ported then to ZFSOnLinux and improved.
I took different approach, but it seems to work and I know it works
on several hundred production machines at this point that run ZFS
pools on SSDs.

Interesting, I'll take a look.

Post by Pawel Jakub Dawidek
Having said that, your approach with using metaslab_sync_done()
seems much easier. Not sure what are the potential problems with
RAIDZ layout, my approach works fine there. Another concern is that
I delay actual TRIM by configured number of TXGs (64 by default)
just so 'zpool import -F' keeps working to some reasonable extend.
Not sure if with metaslab_sync_done() you have this flexibility.

Not at this moment - once the metaslab syncs, the data is gone. Of
course introducing a commit delay is possible, though it would require
tracking further updates to extents in a space map (so that we don't
TRIM extents that have been reallocated to hold new objects). Not
terribly difficult and might be something I'll look into once we get
Unmap/TRIM support in Illumos' sd, but until then I wanted to avoid
the complexities of caching TRIM requests in an early implementation
that's not really verified to be correct.

Post by Pawel Jakub Dawidek
Does your approach guarantee that every freed block will be
TRIMmed?

Yes, we closely track the metaslab allocator, so the only way we can
fail to TRIM extents is if the allocator is leaking (and in places
which don't use the metaslab allocator; IIRC that's only the L2ARC at
this moment).

Post by Pawel Jakub Dawidek
Mine doesn't, but it would be a nice property to have, as with
such guarantees it would be possible to implement TRIM on
underlying device as secure delete (where we overwrite the given
blocks with random data several times). This is something I wanted
to implement in GELI (disk encryption software I implemented for
FreeBSD) for a long time, but it needs proper support from upper
layers, like file system.

Well, with TRIM support in metaslab, SSDs will simply shred (erase)
the flash cells of deleted blocks. That being said, I wouldn't trust
firmware to "do the right thing" if I have data of sufficiently high
value - it all comes down to a cost-benefit tradeoff (even overwriting
an SSD via "dd" isn't guaranteed to work 100% due to flash cell
reallocation).

As for raidz, I *think* it should be possible to get it working right
(though I haven't looked at it in detail). What this essentially
entails is that we need to get rid of the leaf-vdev-fanout code in
zio_ioctl and relocate the vdev-specific bits in a new vdev_ops_t
operation, or perhaps split it up (e.g. add a "dva_to_phys" operation
to recompute DVAs to physical component device offsets). For most vdev
types, this would simply be a "return what you called me with" (since
DVAs correspond directly to physical disk addresses), but raidz might
need to do some more trickery there - again, I haven't looked at the
code, so I don't know yet. My guess is that is should be doable, since
(at least as far as I know it) this condition holds true:

if BP1.DVA < BP2.DVA then
foreach (raidz_COL1 \in BP1), (raidz_COL2 \in BP2)
raidz_COL1.phys_addr < raidz_COL2.phys_addr

Hope this makes sense.

Cheers,

- --
Saso

Pawel Jakub Dawidek

2013-02-28 09:21:21 UTC

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
Does your approach guarantee that every freed block will be
TRIMmed?

Yes, we closely track the metaslab allocator, so the only way we can
fail to TRIM extents is if the allocator is leaking (and in places
which don't use the metaslab allocator; IIRC that's only the L2ARC at
this moment).

Post by Pawel Jakub Dawidek
Mine doesn't, but it would be a nice property to have, as with
such guarantees it would be possible to implement TRIM on
underlying device as secure delete (where we overwrite the given
blocks with random data several times). This is something I wanted
to implement in GELI (disk encryption software I implemented for
FreeBSD) for a long time, but it needs proper support from upper
layers, like file system.

Well, with TRIM support in metaslab, SSDs will simply shred (erase)
the flash cells of deleted blocks. That being said, I wouldn't trust
firmware to "do the right thing" if I have data of sufficiently high
value - it all comes down to a cost-benefit tradeoff (even overwriting
an SSD via "dd" isn't guaranteed to work 100% due to flash cell
reallocation).

Secure delete wouldn't really work on SSDs, because of transaprent
relocation during writes, but it can still be used for regular disks.

On my company's appliance we store very sensitive data and we use ZFS on
top of GELI on top of GPT partition on top of regular disks:

pool: data
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gpt/data0.eli ONLINE 0 0 0
gpt/data1.eli ONLINE 0 0 0
gpt/data2.eli ONLINE 0 0 0
gpt/data3.eli ONLINE 0 0 0
gpt/data4.eli ONLINE 0 0 0
gpt/data5.eli ONLINE 0 0 0

ZFS sends TRIM to GELI, not to the disks directly. GELI can choose how
to interpret TRIM: it can either pass it on or convert it to writes with
random data.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Garrett D'Amore

2013-02-28 10:51:29 UTC

Post by Pawel Jakub Dawidek
Secure delete wouldn't really work on SSDs, because of transaprent
relocation during writes, but it can still be used for regular disks.

*Wrong*. It *may* work for some drives, but spinning rust can relocate logical blocks just as easily as SSDs. It is the case that historically drives had little or no firmware logic, and so they didn't do this. But modern drives have increasingly sophisticated firmwares, and start to exhibit behaviors that are more traditionally associated with SSDs. (For example, data blocks on a drive may be duplicated, relocated for performance reasons, or located in non-volatile caches on hybrid drives.)

The only way to be sure data is physically removed on a drive in a way that is irrecoverable is:

a) Use a manufacturer supplied secure erase function (AFAIK there is no "standard" for this, but different drive manufacturers have different ways of doing this.

b) Physically *destroy* the drive.

Militaries and government organizations that deal with highly classified data always opt for "b", possibly after first performing "a".

- Garrett

Pawel Jakub Dawidek

2013-02-28 20:37:56 UTC

Post by Garrett D'Amore

Post by Pawel Jakub Dawidek
Secure delete wouldn't really work on SSDs, because of transaprent
relocation during writes, but it can still be used for regular disks.

*Wrong*. It *may* work for some drives, but spinning rust can relocate logical blocks just as easily as SSDs. It is the case that historically drives had little or no firmware logic, and so they didn't do this. But modern drives have increasingly sophisticated firmwares, and start to exhibit behaviors that are more traditionally associated with SSDs. (For example, data blocks on a drive may be duplicated, relocated for performance reasons, or located in non-volatile caches on hybrid drives.)

Regular spinning disks (not SSDs and not hybrid) can relocate blocks on
write when the write fails, which is a problem of course, but much
smaller than in case of SSDs, where it is simple totally unpredictable.

Post by Garrett D'Amore
a) Use a manufacturer supplied secure erase function (AFAIK there is no "standard" for this, but different drive manufacturers have different ways of doing this.
b) Physically *destroy* the drive.
Militaries and government organizations that deal with highly classified data always opt for "b", possibly after first performing "a".

Sure, for top secret and secret levels destroying the disks according to
the given procedure is of course a must, but for lower levels it isn't.
Using encryption itself might be enough, but having an option to
overwrite freed regions with random data is also nice option to have.

Note that using block-level encryption software doesn't protect your
data when the system is running and the encrypted device is "unlocked",
so by breaking in you can get old data by copying entire block device
sector by sector. One of our customers is a printing company that has to
guarantee data retention to some of its customers - a bank delivers
passwords that have to be printed on secure envelops and the printing
company can keep the data on its systems only for X days. After that the
data has to be removed. Destroying the disks is out of question on one
hand, but on the other hand simple rm(1) isn't enough in case of system
compromise some time later. Overwriting freed blocks fits nicely into
that picture.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-02-27 17:42:21 UTC

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov
So I've been working on getting some initial support for SCSI
Unmap and SATA TRIM into ZFS and my work can so far be viewed
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) Altering dkioc_free_t to take a linked list of
dkioc_free_extent_t's, so that we can exploit the underlying
commands to their fullest potential (issuing a single Unmap/TRIM
command for multiple extents, not one command per extent). 2)
Altering stmf_sbd/sbd.c, stmf_sbd/sbd_scsi.c and zfs/zvol.c to
exploit the above API change. 3) Implemented walking of the
freed_map in metaslab_sync_done to notify the vdevs of freed
extents (via DKIOCFREE zio's). 4) Implemented support for
DKIOCFREE zio's in vdev_disk_io_start to issue the ioctl to the
underlying sd layer. 5) Tested 3&4 to at least make sure it
doesn't bomb on boot.
The current implementation is limited to issuing DKIOCFREE
ioctl's only for leaf disk-vdevs and mirror vdevs, since raidz
vdevs can vary their layout geometry and we'd need to do some
smart extent offset recalculation there to get it right for leaf
vdevs.

Not sure if you are aware, but I implemented TRIM support for ZFS
in FreeBSD, which was ported then to ZFSOnLinux and improved.
I took different approach, but it seems to work and I know it works
on several hundred production machines at this point that run ZFS
pools on SSDs.
Having said that, your approach with using metaslab_sync_done()
seems much easier. Not sure what are the potential problems with
RAIDZ layout, my approach works fine there. Another concern is that
I delay actual TRIM by configured number of TXGs (64 by default)
just so 'zpool import -F' keeps working to some reasonable extend.
Not sure if with metaslab_sync_done() you have this flexibility.
Does your approach guarantee that every freed block will be
TRIMmed? Mine doesn't, but it would be a nice property to have, as
with such guarantees it would be possible to implement TRIM on
underlying device as secure delete (where we overwrite the given
blocks with random data several times). This is something I wanted
to implement in GELI (disk encryption software I implemented for
FreeBSD) for a long time, but it needs proper support from upper
layers, like file system.

So I've had a look over how you did it and I find it to be, well,
somewhat counterintuitive. From what I understand, you're managing
free space regions on a per-leaf-vdev basis, but instead of using a
top-down approach from the allocator (which already has this
information), you are working from the bottom-up trying to figure it
out post-hoc from the distribution of zio's - is this correct?

Cheers,
- --
Saso

Pawel Jakub Dawidek

2013-02-27 21:25:27 UTC

Post by SaÅ¡o Kiselkov
So I've had a look over how you did it and I find it to be, well,
somewhat counterintuitive. From what I understand, you're managing
free space regions on a per-leaf-vdev basis, but instead of using a
top-down approach from the allocator (which already has this
information), you are working from the bottom-up trying to figure it
out post-hoc from the distribution of zio's - is this correct?

Correct. In the way I did it, I still have to track write zios in
similar fashion. I didn't want to increase syncing phase and block
writes any longer. I use separate thread that can run at any time and
send TRIMs while other pool activity is going on. In your implementation
you have to be sure no writes are going on in parallel. TRIMs are
usually very fast even for large regions, so it might not be a big deal.

BTW. Other interesting things we are doing is to fully TRIM cache vdevs
on pool import and fully TRIM vdevs on add/replace.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-02-27 21:38:31 UTC

Post by Pawel Jakub Dawidek
Correct. In the way I did it, I still have to track write zios in
similar fashion. I didn't want to increase syncing phase and block
writes any longer. I use separate thread that can run at any time
and send TRIMs while other pool activity is going on. In your
implementation you have to be sure no writes are going on in
parallel. TRIMs are usually very fast even for large regions, so it
might not be a big deal.

Well, since we're doing the ioctl's as zio_nowait(zio_ioctl(...)), the
caller won't get blocked anyway. The UNMAP/TRIM commands will get
sandwiched in between regular SCSI traffic to the underlying layers,
and unless the device is reordering SCSI commands on its own (which is
already a "Bad Thing(tm)" for ZFS), there's no way we're going to ever
WRITE to an extent before we UNMAP it. Please correct me if I'm wrong.

Post by Pawel Jakub Dawidek
BTW. Other interesting things we are doing is to fully TRIM cache
vdevs on pool import

This is going to be a bummer with persistent L2ARC (which is already
under review, though it needs some more love).

Post by Pawel Jakub Dawidek
fully TRIM vdevs on add/replace.

Interesting, might be worth looking at, if only to facilitate good
clean housekeeping.

Cheers,
- --
Saso

Etienne Dechamps

2013-02-27 21:55:49 UTC

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
Correct. In the way I did it, I still have to track write zios in
similar fashion. I didn't want to increase syncing phase and block
writes any longer. I use separate thread that can run at any time
and send TRIMs while other pool activity is going on. In your
implementation you have to be sure no writes are going on in
parallel. TRIMs are usually very fast even for large regions, so it
might not be a big deal.

Well, since we're doing the ioctl's as zio_nowait(zio_ioctl(...)), the
caller won't get blocked anyway. The UNMAP/TRIM commands will get
sandwiched in between regular SCSI traffic to the underlying layers,
and unless the device is reordering SCSI commands on its own (which is
already a "Bad Thing(tm)" for ZFS), there's no way we're going to ever
WRITE to an extent before we UNMAP it. Please correct me if I'm wrong.

I'm afraid you're wrong. The device is free to reorder commands as it
sees fit, usually for performance reasons. See:

http://en.wikipedia.org/wiki/NCQ

The reason why ZFS (and filesystems in general) can guarantee on-disk
consistency is that they don't rely on write ordering, instead they use
so-called "write barriers" (which are implemented by first waiting for
all pending writes to complete, then issuing a cache flush command) to
make sure all data written until that point is consistent.

Of course, this means that mixing discard and write operations for the
same range is a recipe for disaster in the form of data corruption,
because the discard could end up being applied after the write. Pawel's
patch protects against this by delaying conflicting writes until after
the conflicting discard request is complete.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Pawel Jakub Dawidek

2013-02-27 22:19:23 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
Correct. In the way I did it, I still have to track write zios in
similar fashion. I didn't want to increase syncing phase and block
writes any longer. I use separate thread that can run at any time
and send TRIMs while other pool activity is going on. In your
implementation you have to be sure no writes are going on in
parallel. TRIMs are usually very fast even for large regions, so it
might not be a big deal.

Well, since we're doing the ioctl's as zio_nowait(zio_ioctl(...)), the
caller won't get blocked anyway. The UNMAP/TRIM commands will get
sandwiched in between regular SCSI traffic to the underlying layers,
and unless the device is reordering SCSI commands on its own (which is
already a "Bad Thing(tm)" for ZFS), there's no way we're going to ever
WRITE to an extent before we UNMAP it. Please correct me if I'm wrong.

I'm afraid you're wrong. The device is free to reorder commands as it
http://en.wikipedia.org/wiki/NCQ
The reason why ZFS (and filesystems in general) can guarantee on-disk
consistency is that they don't rely on write ordering, instead they use
so-called "write barriers" (which are implemented by first waiting for
all pending writes to complete, then issuing a cache flush command) to
make sure all data written until that point is consistent.
Of course, this means that mixing discard and write operations for the
same range is a recipe for disaster in the form of data corruption,
because the discard could end up being applied after the write. Pawel's
patch protects against this by delaying conflicting writes until after
the conflicting discard request is complete.

This may not be a problem in SaÅ¡o's approach, though, as all we need to
do is to ensure no conflicting writes and TRIMs will be send during one
synching phase. Once this phase is done we send cache flush requests on
uberblock update, so even if the given region is reallocated and written
in during next txg sync, we are sure that all previous writes and TRIMs
are already executed thanks to flush.

Having said that, this also means that we have to be very sure that
flush does actually work, because if it doesn't we can corrupt the data
badly and as we know flush not always work as expected.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-02-27 22:26:37 UTC

Post by Etienne Dechamps
I'm afraid you're wrong. The device is free to reorder commands as it
http://en.wikipedia.org/wiki/NCQ

My bad, you're of course right.

--
Saso

George Wilson

2013-02-27 17:13:54 UTC

Saso,

I can help on the ZFS side. The first thing to change is to use the
defer_map in metaslab_sync_done() instead of the freed_map. This will
allow 'zpool import -F' to continue to work. I'll look at some of your
other changes.

- George

Post by SaÅ¡o Kiselkov
So I've been working on getting some initial support for SCSI Unmap and
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) Altering dkioc_free_t to take a linked list of dkioc_free_extent_t's,
so that we can exploit the underlying commands to their fullest
potential (issuing a single Unmap/TRIM command for multiple extents,
not one command per extent).
2) Altering stmf_sbd/sbd.c, stmf_sbd/sbd_scsi.c and zfs/zvol.c to
exploit the above API change.
3) Implemented walking of the freed_map in metaslab_sync_done to notify
the vdevs of freed extents (via DKIOCFREE zio's).
4) Implemented support for DKIOCFREE zio's in vdev_disk_io_start to
issue the ioctl to the underlying sd layer.
5) Tested 3&4 to at least make sure it doesn't bomb on boot.
The current implementation is limited to issuing DKIOCFREE ioctl's only
for leaf disk-vdevs and mirror vdevs, since raidz vdevs can vary their
layout geometry and we'd need to do some smart extent offset
recalculation there to get it right for leaf vdevs.
1) Won't the changes in dkioc_free_t break external applications? My
guess is no, since I'm not aware of anybody using this interface
externally (other than COMSTAR, which I updated).
2) Do you think doing unmap in metaslab_sync_done is a good way to go
about it? In particular, I'm not sure if this won't make implementing
it for raidz at a later date harder or impossible.
3) I'm a bit dubious about whether I understand space_map internals
right and am walking the extents in freed_map in metaslab_vdev_free
correctly (the code for space_maps is completely undocumented, so
I'm inferring how they work by learning from their other uses).
Specifically, I'd also like to ask for somebody to please help me
implement DKIOCFREE ioctl support in sd.c (i.e. encode a SCSI Unmap and
SATA TRIM) - I believe this requires digging around deep in driver
bowels and I don't feel like I'm skilled enough to do it just yet. At
the moment the ZFS ioctl calls simply hit ldi_ioctl which returns an
ENOTSUP, which prompts the vdev to stop making these calls (subsequent
DKIOCFREE zio's are not even attempted and simply return ENOTSUP right
away). Another issue is that while I have documentation for the SCSI
Unmap command structure, I don't for SATA TRIM (subscription required -
yay!). If by some lucky coincidence I figure out how to add support for
DKIOCFREE in sd.c, if somebody could please provide me the correct
command structure for SATA, I'd greatly appreciate.
Please let me know what you think. Thanks!
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-02-27 17:47:26 UTC

Post by George Wilson
I can help on the ZFS side.

Thanks!

Post by George Wilson
The first thing to change is to use the
defer_map in metaslab_sync_done() instead of the freed_map. This will
allow 'zpool import -F' to continue to work. I'll look at some of your
other changes.

Done, webrev updated. This will give a 1-txg delay in unmaps. If this is
too little, maybe we can implement multiple defer maps, or commit unmaps
less often (not every txg). What do you think?

Cheers,
--
Saso

Jim Klimov

2013-02-27 20:33:07 UTC

Post by George Wilson
Saso,
I can help on the ZFS side. The first thing to change is to use the
defer_map in metaslab_sync_done() instead of the freed_map. This will
allow 'zpool import -F' to continue to work. I'll look at some of your
other changes.

Does the deferred-release of blocks now protect the possibility of
rollbacks (i.e. the referenced from the uberblock ring of 32 or 128
recent TXGs)?

I.e. do we have any "guarantee" that recently released blocks won't
in fact be instantly overwritten, and will be available for rollback?

On the other hand, would this anyhow conflict with very full pools
(i.e. do we free the oldest of deferred blocks if we otherwise have
nowhere to write)?

Thanks :)
//Jim

Sašo Kiselkov

2013-02-27 20:40:10 UTC

Post by Jim Klimov

Post by George Wilson
Saso,
I can help on the ZFS side. The first thing to change is to use the
defer_map in metaslab_sync_done() instead of the freed_map. This will
allow 'zpool import -F' to continue to work. I'll look at some of your
other changes.

Does the deferred-release of blocks now protect the possibility of
rollbacks (i.e. the referenced from the uberblock ring of 32 or 128
recent TXGs)?
I.e. do we have any "guarantee" that recently released blocks won't
in fact be instantly overwritten, and will be available for rollback?

Nope, blocks may get reallocated as soon as a single txg completes. It's
only unmaps which will be held back, but only a single txg at this
moment (freed extents are moved to "freed_map" on the first
metaslab_sync and to "defer_map" on the second - it's this "defer_map"
which we unmap).

Post by Jim Klimov
On the other hand, would this anyhow conflict with very full pools
(i.e. do we free the oldest of deferred blocks if we otherwise have
nowhere to write)?

It would only impact performance if:
per_txg_update_volume * num_txgs_to_keep > available_free_space
Unless you have a very weird storage setup, you'll only hit this if your
pool is >98-99% full. And even so the worst case is that blocks simply
won't get trimmed and you'll incur SSD erase-program latency. I'm sure
you realize that at >98% full, degraded sync write performance really is
the least of your problems.

Cheers,
--
Saso

Sašo Kiselkov

2013-03-01 19:02:56 UTC

Post by George Wilson
Saso,
I can help on the ZFS side. The first thing to change is to use the
defer_map in metaslab_sync_done() instead of the freed_map. This will
allow 'zpool import -F' to continue to work. I'll look at some of your
other changes.

So I've reimplemented portions of the trim logic in metaslab to add all
of the features of the current freebsd & linux implementation (save for
raidz support that's next on my list).

http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

Here's how it works:

1) We keep a pair of space_maps (in metaslab->ms_cur_trims and
metaslab->ms_prev_trims) which record free operations happening
on the metaslab.

2) New free's (freed_map) are committed to ms_cur_trims, whereas
allocations (allocmap) are applied to both ms_cur_trims and
ms_prev_trims (to make sure no trims take place on space that has
been allocated).

3) Once zfs_txgs_per_trim time passes, we attempt to swap the
ms_cur_trims space map into ms_prev_trims. If ms_prev_trims is
non-NULL, we TRIM the extents recorded in it first.

4) All of the above takes place in metaslab_sync to take advantage
of the write barrier at the end of metaslab_sync.

This implementation gives us several important properties:

A) We aggregate trims into larger ops.
B) We utilize existing write barriers to protect from trims.
C) Disaster recovery is unaffected and we can even configure the
number of txgs we want for disaster recovery at runtime (using
the zfs_txgs_per_trim tunable).
D) It's super simple; we use existing space_map code and almost all
of the interesting bits span just 40 lines at metaslab.c:1204.

The changes in space_map.c are simply comments I added to document what
the functions do, so that people after me don't have to go through the
code trying to figure out what the hell's going on. Please review to
make sure they don't contain lies.

As before, this is super-early code, totally untested (all I know is
that it doesn't panic on boot - yay!). The reason I'm putting this up is
for the gurus to please take a look at it and judge whether I haven't
totally messed something up. Thanks!

Cheers,
--
Saso

Pawel Jakub Dawidek

2013-03-01 19:45:21 UTC

Post by SaÅ¡o Kiselkov

Post by George Wilson
Saso,
I can help on the ZFS side. The first thing to change is to use the
defer_map in metaslab_sync_done() instead of the freed_map. This will
allow 'zpool import -F' to continue to work. I'll look at some of your
other changes.

So I've reimplemented portions of the trim logic in metaslab to add all
of the features of the current freebsd & linux implementation (save for
raidz support that's next on my list).
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) We keep a pair of space_maps (in metaslab->ms_cur_trims and
metaslab->ms_prev_trims) which record free operations happening
on the metaslab.
2) New free's (freed_map) are committed to ms_cur_trims, whereas
allocations (allocmap) are applied to both ms_cur_trims and
ms_prev_trims (to make sure no trims take place on space that has
been allocated).
3) Once zfs_txgs_per_trim time passes, we attempt to swap the
ms_cur_trims space map into ms_prev_trims. If ms_prev_trims is
non-NULL, we TRIM the extents recorded in it first.
4) All of the above takes place in metaslab_sync to take advantage
of the write barrier at the end of metaslab_sync.
A) We aggregate trims into larger ops.
B) We utilize existing write barriers to protect from trims.
C) Disaster recovery is unaffected and we can even configure the
number of txgs we want for disaster recovery at runtime (using
the zfs_txgs_per_trim tunable).
D) It's super simple; we use existing space_map code and almost all
of the interesting bits span just 40 lines at metaslab.c:1204.
The changes in space_map.c are simply comments I added to document what
the functions do, so that people after me don't have to go through the
code trying to figure out what the hell's going on. Please review to
make sure they don't contain lies.
As before, this is super-early code, totally untested (all I know is
that it doesn't panic on boot - yay!). The reason I'm putting this up is
for the gurus to please take a look at it and judge whether I haven't
totally messed something up. Thanks!

The code looks promissing and indeed much simpler than mine. Some
comments to the implementation:

If zfs_nounmap is set maybe we can avoid TRIM much earlier, ie. not even
manage additional space maps? Currently we pay the price of managing
space maps, etc. just to discard I/Os later. Is it in case someone would
like to change it while the pool is running?

Why not to use standard list API for dkioc_free_t instead of rolling
your own?

I don't see a point for ms_cur_trims and ms_prev_trims to be pointers
and memory has to be allocated and freed for them. Why not to inline
them and check if previous map is initialized based on mti_birth being
non-zero or somethine alone these lines. Not a big deal, though.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-03-01 21:15:39 UTC

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov

Post by George Wilson
Saso,
I can help on the ZFS side. The first thing to change is to use
the defer_map in metaslab_sync_done() instead of the freed_map.
This will allow 'zpool import -F' to continue to work. I'll
look at some of your other changes.

So I've reimplemented portions of the trim logic in metaslab to
add all of the features of the current freebsd & linux
implementation (save for raidz support that's next on my list).
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) We keep a pair of space_maps (in metaslab->ms_cur_trims and
metaslab->ms_prev_trims) which record free operations happening
on the metaslab.
2) New free's (freed_map) are committed to ms_cur_trims, whereas
allocations (allocmap) are applied to both ms_cur_trims and
ms_prev_trims (to make sure no trims take place on space that
has been allocated).
3) Once zfs_txgs_per_trim time passes, we attempt to swap the
ms_cur_trims space map into ms_prev_trims. If ms_prev_trims is
non-NULL, we TRIM the extents recorded in it first.
4) All of the above takes place in metaslab_sync to take
advantage of the write barrier at the end of metaslab_sync.
A) We aggregate trims into larger ops. B) We utilize existing
write barriers to protect from trims. C) Disaster recovery is
unaffected and we can even configure the number of txgs we want
for disaster recovery at runtime (using the zfs_txgs_per_trim
tunable). D) It's super simple; we use existing space_map code
and almost all of the interesting bits span just 40 lines at
metaslab.c:1204.
The changes in space_map.c are simply comments I added to
document what the functions do, so that people after me don't
have to go through the code trying to figure out what the hell's
going on. Please review to make sure they don't contain lies.
As before, this is super-early code, totally untested (all I know
is that it doesn't panic on boot - yay!). The reason I'm putting
this up is for the gurus to please take a look at it and judge
whether I haven't totally messed something up. Thanks!

The code looks promissing and indeed much simpler than mine. Some

Glad to hear from you!

Post by Pawel Jakub Dawidek
If zfs_nounmap is set maybe we can avoid TRIM much earlier, ie. not
even manage additional space maps? Currently we pay the price of
managing space maps, etc. just to discard I/Os later. Is it in case
someone would like to change it while the pool is running?

I'll add some glue code to handle this situation - shouldn't be hard.

Post by Pawel Jakub Dawidek
Why not to use standard list API for dkioc_free_t instead of
rolling your own?

I wasn't aware dkioc_free_t was some sort of a "standard"... I altered
the Illumos API itself (not just created my own) so that we can take
advantage of SCSI Unmap's and SATA TRIM's variable parameter list,
which can include multiple extents per command. I also modified
comstar to use the new API.

Post by Pawel Jakub Dawidek
I don't see a point for ms_cur_trims and ms_prev_trims to be
pointers and memory has to be allocated and freed for them. Why not
to inline them and check if previous map is initialized based on
mti_birth being non-zero or somethine alone these lines. Not a big
deal, though.

The original reason is that we'd have to make the space maps pointers
anyway (in order to swap them around), otherwise we'd have to use two
space_map_vacate calls to move the current trim set over to the
previous trim set. Plus, it makes the whole thing fairly bug-resistant
(panic on uninitialized use). Considering that memory allocation is a
fairly cheap operation, I'm not sure we should worry about it too
much. We do it very infrequently anyway (it's not like it's a
super-hot code path), so I'd like to avoid premature optimization.
Anyway, I don't dwell on it very much, so if others request, I'll
rework it to just be part of the metaslab_t allocation.

Cheers,
- --
Saso

Pawel Jakub Dawidek

2013-03-01 21:28:50 UTC

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
Why not to use standard list API for dkioc_free_t instead of
rolling your own?

I wasn't aware dkioc_free_t was some sort of a "standard"... I altered
the Illumos API itself (not just created my own) so that we can take
advantage of SCSI Unmap's and SATA TRIM's variable parameter list,
which can include multiple extents per command. I also modified
comstar to use the new API.

I mean why don't you use list_create(), list_insert_tail(), etc. instead of:

for (space_seg_t *ss = avl_first(&trim_map->sm_root); ss != NULL;
ss = AVL_NEXT(&trim_map->sm_root, ss)) {
dkioc_free_extent_t *ext;

ext = kmem_zalloc(sizeof (dkioc_free_extent_t), KM_SLEEP);
ext->dfe_start = ss->ss_start;
ext->dfe_length = ss->ss_end - ss->ss_start;
if (df->df_first == NULL) {
df->df_first = df->df_last = ext;
} else {
df->df_last->dfe_next = ext;
df->df_last = ext;
}
}

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
I don't see a point for ms_cur_trims and ms_prev_trims to be
pointers and memory has to be allocated and freed for them. Why not
to inline them and check if previous map is initialized based on
mti_birth being non-zero or somethine alone these lines. Not a big
deal, though.

The original reason is that we'd have to make the space maps pointers
anyway (in order to swap them around), otherwise we'd have to use two
space_map_vacate calls to move the current trim set over to the
previous trim set. Plus, it makes the whole thing fairly bug-resistant
(panic on uninitialized use). Considering that memory allocation is a
fairly cheap operation, I'm not sure we should worry about it too
much. We do it very infrequently anyway (it's not like it's a
super-hot code path), so I'd like to avoid premature optimization.
Anyway, I don't dwell on it very much, so if others request, I'll
rework it to just be part of the metaslab_t allocation.

You could swap space map even if ms_cur_trims/ms_prev_trims are not
pointers, but indeed it can make the code less robust.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-03-01 21:33:15 UTC

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
Why not to use standard list API for dkioc_free_t instead of
rolling your own?

I wasn't aware dkioc_free_t was some sort of a "standard"... I
altered the Illumos API itself (not just created my own) so that
we can take advantage of SCSI Unmap's and SATA TRIM's variable
parameter list, which can include multiple extents per command. I
also modified comstar to use the new API.

for (space_seg_t *ss = avl_first(&trim_map->sm_root); ss != NULL;
ss = AVL_NEXT(&trim_map->sm_root, ss)) { dkioc_free_extent_t *ext;
ext = kmem_zalloc(sizeof (dkioc_free_extent_t), KM_SLEEP);
ext->dfe_start = ss->ss_start; ext->dfe_length = ss->ss_end -
ss->ss_start; if (df->df_first == NULL) { df->df_first =
df->df_last = ext; } else { df->df_last->dfe_next = ext;
df->df_last = ext; } }

Ah, that. I considered that. The answer why not lies in the fact that
some places in the kernel must copy the dkioc_free_t structure around
(namely see zvol.c:1714) - this breaks list_t. I documented this in
another webrev (after a day of hunting around for why my lists were
breaking):
http://cr.illumos.org/~webrev/skiselkov/3525/usr/src/uts/common/sys/list.h.udiff.html.

Post by Pawel Jakub Dawidek
You could swap space map even if ms_cur_trims/ms_prev_trims are
not pointers, but indeed it can make the code less robust.

Due to the above experiences with list_t I wasn't prepared to take any
chances. If it is indeed possible to swap space maps around with a
direct structure assign, then I stand corrected.

Cheers,
- --
Saso

Sašo Kiselkov

2013-03-01 22:07:15 UTC

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov

Post by George Wilson
Saso,
I can help on the ZFS side. The first thing to change is to use
the defer_map in metaslab_sync_done() instead of the freed_map.
This will allow 'zpool import -F' to continue to work. I'll
look at some of your other changes.

So I've reimplemented portions of the trim logic in metaslab to
add all of the features of the current freebsd & linux
implementation (save for raidz support that's next on my list).
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
1) We keep a pair of space_maps (in metaslab->ms_cur_trims and
metaslab->ms_prev_trims) which record free operations happening
on the metaslab.
2) New free's (freed_map) are committed to ms_cur_trims, whereas
allocations (allocmap) are applied to both ms_cur_trims and
ms_prev_trims (to make sure no trims take place on space that
has been allocated).
3) Once zfs_txgs_per_trim time passes, we attempt to swap the
ms_cur_trims space map into ms_prev_trims. If ms_prev_trims is
non-NULL, we TRIM the extents recorded in it first.
4) All of the above takes place in metaslab_sync to take
advantage of the write barrier at the end of metaslab_sync.
A) We aggregate trims into larger ops. B) We utilize existing
write barriers to protect from trims. C) Disaster recovery is
unaffected and we can even configure the number of txgs we want
for disaster recovery at runtime (using the zfs_txgs_per_trim
tunable). D) It's super simple; we use existing space_map code
and almost all of the interesting bits span just 40 lines at
metaslab.c:1204.
The changes in space_map.c are simply comments I added to
document what the functions do, so that people after me don't
have to go through the code trying to figure out what the hell's
going on. Please review to make sure they don't contain lies.
As before, this is super-early code, totally untested (all I know
is that it doesn't panic on boot - yay!). The reason I'm putting
this up is for the gurus to please take a look at it and judge
whether I haven't totally messed something up. Thanks!

The code looks promissing and indeed much simpler than mine. Some
If zfs_nounmap is set maybe we can avoid TRIM much earlier, ie. not
even manage additional space maps? Currently we pay the price of
managing space maps, etc. just to discard I/Os later. Is it in case
someone would like to change it while the pool is running?

Here we go, done: http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

My changes:

*) I renamed zfs_nounmap to zfs_notrim and made it an extern in vdev.h
*) Added zfs_notrim check in metaslab_sync to avoid trimset processing
and added glue code to correctly handle cases when users change this
setting at runtime.
*) Renamed trim_info and friends to "trimset", standardized the naming
and moved all trimset processing out of metaslab_sync into their own
functions.

Hope this looks better now.

Cheers,
- --
Saso

Sašo Kiselkov

2013-03-01 22:30:40 UTC

Post by SaÅ¡o Kiselkov
*) I renamed zfs_nounmap to zfs_notrim and made it an extern in vdev.h

I had actually realized that having a zfs_notrim check in vdev_disk.c
didn't really make sense anymore, since we would want to control that
from metaslab anyway (we have better knowledge of pool state there),
so I moved the tunable to metaslab.c and extern'ed it in
sys/metaslab.h (in case other places need to check it, like arc.c
etc.). This also allows metaslab to correctly commit the last open
trimset when a user changes zfs_notrim to B_TRUE (otherwise metaslab
would simply issue the trimset to the vdev, which would shred it due
to a zfs_notrim being already in effect - kind of defeating the
purpose of the whole exercise in metaslab of trying to issue the last
trimset).

(Original webrev updated)

Cheers,
- --
Saso

Pawel Jakub Dawidek

2013-03-02 20:45:23 UTC

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by SaÅ¡o Kiselkov
*) I renamed zfs_nounmap to zfs_notrim and made it an extern in vdev.h

I had actually realized that having a zfs_notrim check in vdev_disk.c
didn't really make sense anymore, since we would want to control that
from metaslab anyway (we have better knowledge of pool state there),
so I moved the tunable to metaslab.c and extern'ed it in
sys/metaslab.h (in case other places need to check it, like arc.c
etc.). This also allows metaslab to correctly commit the last open
trimset when a user changes zfs_notrim to B_TRUE (otherwise metaslab
would simply issue the trimset to the vdev, which would shred it due
to a zfs_notrim being already in effect - kind of defeating the
purpose of the whole exercise in metaslab of trying to issue the last
trimset).
(Original webrev updated)

More detailed review below (ZFS parts only) and an idea how to handle
RAIDZ vdevs at the end.

/*
+ * Tunable to allow for debugging SCSI Unmap/SATA TRIM calls. Setting
+ * it will prevent ZFS from attempting to issue DKIOCFREE ioctls to the
+ * underlying storage. Synchronous write performance might degrade over
+ * time with zfs_notrim set.
+ */
+boolean_t zfs_notrim = B_FALSE;

In FreeBSD we do prefer to give sysctls and loader tunables positive
names, eg. vfs.zfs.trim_enabled rather than vfs.zfs.trim_disabled and
ZFS is one of those examples in FreeBSD that has many negative names.
Because I'm not the only one pointing that out, maybe it would be a good
time to introduce such rule, even if not very strict to Illumos?

+/*
+ * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
+ * issued to the underlying vdev. This serves to fulfill two functions:
+ * aggregate many small frees into fewer larger ones (which should help
+ * with devices which do not take so kindly to them) and to allow for
+ * disaster recovery (extents won't get trimmed immediately, but
+ * instead only after passing this fixed timeout).
+ */
+unsigned int zfs_txgs_per_trim = 32;

I'm not entirely sure if this is good idea to have one variable to
serve those two purposes. We do want to delay TRIMs by at least those
32 txgs, but maybe gathering TRIMs from 32 txgs will introduce noticable
peaks in load every 32 txgs? Probably not worth the complexity at this
point, just noting that having one varable to configure two largely
unrelated things is something worth avoiding in general.

+static void metaslab_trim(metaslab_t *msp, uint64_t txg,
+ space_map_t *allocmap, space_map_t *freed_map);
+static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg,
+ const space_map_t *sm);
+static void metaslab_free_trimset(metaslab_trimset_t *ts);
+static void metaslab_vdev_trim(metaslab_t *msp, space_map_t *trim_map);
+static void metaslab_vdev_trim_done(zio_t *zio);

I know this is following current variable names in ZFS, but 'allocmap'
isn't consistent with 'freed_map' and 'trim_map' naming.

+static void
+metaslab_trim(metaslab_t *msp, uint64_t txg, space_map_t *allocmap,
+ space_map_t *freed_map)
+{
+ if (msp->ms_cur_trimset == NULL) {
+ msp->ms_cur_trimset = metaslab_new_trimset(txg, allocmap);
+ }

Redundant brackets.

+ /*
+ * Remove any segments which have been allocated in the mean
+ * time from both trim maps.
+ */

s/mean time/meantime/ ?

+/*
+ * Allocates and initializes a new trimset structure. The `txg' argument
+ * indicates when this trimset was born and `sm' is used as the template
+ * for the structure's internal space map (the start, size and shifts are
+ * copied from it).
+ */

s/shifts/shift/ ?

+ ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
[...]
+ df = kmem_zalloc(sizeof (dkioc_free_t), KM_SLEEP);

Sometimes you use sizeof(*var) and sometimes sizeof(struct name).
I personally prefer the former, as it allows to change variable type
without changing any other code, but either way would be nice to use one
method consistently.

+ for (space_seg_t *ss = avl_first(&trim_map->sm_root); ss != NULL;
+ ss = AVL_NEXT(&trim_map->sm_root, ss)) {
+ dkioc_free_extent_t *ext;
+
+ ext = kmem_zalloc(sizeof (dkioc_free_extent_t), KM_SLEEP);
+ ext->dfe_start = ss->ss_start;
+ ext->dfe_length = ss->ss_end - ss->ss_start;
+ if (df->df_first == NULL) {
+ df->df_first = df->df_last = ext;
+ } else {
+ df->df_last->dfe_next = ext;
+ df->df_last = ext;
+ }
+ }

I wonder if you could just drop df_last and if you need those extents
sorted you could do:

for (space_seg_t *ss = avl_last(&trim_map->sm_root); ss != NULL;
ss = AVL_PREV(&trim_map->sm_root, ss)) {
dkioc_free_extent_t *ext;

ext = kmem_zalloc(sizeof (dkioc_free_extent_t), KM_SLEEP);
ext->dfe_start = ss->ss_start;
ext->dfe_length = ss->ss_end - ss->ss_start;
ext->dfe_next = df->df_first;
df->df_first = ext;
}

+ if (df->df_first) {
[...]
+ ASSERT(df != NULL);

Pointers are not booleans - in the ASSERT() above you compare it with
NULL, which gives boolean and is nice, but in the condition above you
treat pointer as boolean instead of checking it against NULL. I know it
compiles, but it is not consistent and it is not pretty:)

--- old/usr/src/uts/common/fs/zfs/space_map.c Fri Mar 1 23:30:30 2013
+++ new/usr/src/uts/common/fs/zfs/space_map.c Fri Mar 1 23:30:30 2013

Nice of you to add comments here.

+/*
+ * Checks whether space map `sm' contains a segment starting at `start'
+ * of length `size'. Returns 1 if it does, 0 otherwise.
+ */
boolean_t
space_map_contains(space_map_t *sm, uint64_t start, uint64_t size)

I'd say that it returns B_TRUE or B_FALSE.

+ /*
+ * DKIOCFREE ioctl's need some special handling on interior
+ * vdevs. Specifically, they are only safe on mirror vdevs.
+ * We can't simply pass the same disk offsets to free on a
+ * raidz, since there the vdev offsets don't correspond to
+ * physical disk addresses.
+ * TODO: move this ioctl distribution machinery into the
+ * vdev implementations details (vdev_ops_t) to allow each
+ * vdev type to implement custom ioctl pre-processing.
+ */
+ zio = zio_null(pio, spa, vd, done, private, flags);
+ if (strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_MIRROR) == 0) {
+ for (c = 0; c < vd->vdev_children; c++)
+ zio_nowait(zio_ioctl(zio, spa,
+ vd->vdev_child[c], cmd,
+ NULL, NULL, priority, flags));
+ }

Or you could not pass it as ioctl, but as ZIO_TYPE_FREE and just allow
vdevs to do the math. You don't need to introduce additional vdev method
then and BTW all code is already there in my implementation to handle
ZIO_TYPE_FREE in RAIDZ vdevs (in addition to mirror vdevs).

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-03-03 02:20:17 UTC

Post by Pawel Jakub Dawidek

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Post by SaÅ¡o Kiselkov
*) I renamed zfs_nounmap to zfs_notrim and made it an extern
in vdev.h

I had actually realized that having a zfs_notrim check in
vdev_disk.c didn't really make sense anymore, since we would want
to control that from metaslab anyway (we have better knowledge of
pool state there), so I moved the tunable to metaslab.c and
extern'ed it in sys/metaslab.h (in case other places need to
check it, like arc.c etc.). This also allows metaslab to
correctly commit the last open trimset when a user changes
zfs_notrim to B_TRUE (otherwise metaslab would simply issue the
trimset to the vdev, which would shred it due to a zfs_notrim
being already in effect - kind of defeating the purpose of the
whole exercise in metaslab of trying to issue the last trimset).
(Original webrev updated)

More detailed review below (ZFS parts only) and an idea how to
handle RAIDZ vdevs at the end.

Cool, thanks. Responses below (updated webrev at the original address).

Post by Pawel Jakub Dawidek
/* + * Tunable to allow for debugging SCSI Unmap/SATA TRIM calls.
Setting + * it will prevent ZFS from attempting to issue DKIOCFREE
ioctls to the + * underlying storage. Synchronous write performance
might degrade over + * time with zfs_notrim set. + */ +boolean_t
zfs_notrim = B_FALSE;
In FreeBSD we do prefer to give sysctls and loader tunables
positive names, eg. vfs.zfs.trim_enabled rather than
vfs.zfs.trim_disabled and ZFS is one of those examples in FreeBSD
that has many negative names. Because I'm not the only one pointing
that out, maybe it would be a good time to introduce such rule,
even if not very strict to Illumos?

Point taken, name and behavior inverted.

Post by Pawel Jakub Dawidek
+/* + * How many TXG's worth of updates should be aggregated per
TRIM/UNMAP + * issued to the underlying vdev. This serves to
fulfill two functions: + * aggregate many small frees into fewer
larger ones (which should help + * with devices which do not take
so kindly to them) and to allow for + * disaster recovery (extents
won't get trimmed immediately, but + * instead only after passing
this fixed timeout). + */ +unsigned int zfs_txgs_per_trim = 32;
I'm not entirely sure if this is good idea to have one variable to
serve those two purposes. We do want to delay TRIMs by at least
those 32 txgs, but maybe gathering TRIMs from 32 txgs will
introduce noticable peaks in load every 32 txgs? Probably not worth
the complexity at this point, just noting that having one varable
to configure two largely unrelated things is something worth
avoiding in general.

This is somewhat by design. In order to avoid doing too many
space_map_walks in metaslab_sync, I limited the number of space map
manipulations for trim to two (the current one and the old one), with
the interval between them serving both functions - to aggregate and to
delay trims. Now, we could introduce an array of space maps and
aggregate into, say, 4 space maps, and only trim the last one, but
this would necessitate a space_map_walk with the allocmap for each of
them, doubling the cost.

I haven't tested what would happen with more space maps, but at the
moment I prefer a simple working solution to a super-fancy one that is
more fragile. Perhaps we could even reduce zfs_trims_per_txg to 16 - I
doubt anybody has to go back that far to recover a lost pool (and at
that point serious data corruption is probably inevitable). I mean, 16
txgs is typically at least 1 minute of history, if not more (on some
of my machines, 16 txgs is over 5 minute's worth of commits).

Post by Pawel Jakub Dawidek
+static void metaslab_trim(metaslab_t *msp, uint64_t txg, +
space_map_t *allocmap, space_map_t *freed_map); +static
metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, + const
space_map_t *sm); +static void
metaslab_free_trimset(metaslab_trimset_t *ts); +static void
metaslab_vdev_trim(metaslab_t *msp, space_map_t *trim_map); +static
void metaslab_vdev_trim_done(zio_t *zio);
I know this is following current variable names in ZFS, but
'allocmap' isn't consistent with 'freed_map' and 'trim_map'
naming.

I renamed "trim_map" to "trimmap" to jive better with the "allocmap",
"freemap" and "defermap" conventions used elsewhere. The only oddball
that remains is "freed_map", which has the underbar in it merely to
emphasize its distinction from "freemap".

Post by Pawel Jakub Dawidek
+static void +metaslab_trim(metaslab_t *msp, uint64_t txg,
space_map_t *allocmap, + space_map_t *freed_map) +{ + if
(msp->ms_cur_trimset == NULL) { + msp->ms_cur_trimset =
metaslab_new_trimset(txg, allocmap); + }
Redundant brackets.

And the problem is? I use it this way all over the place. While I do
agree that using it for simple stuff such as:

if (x > 5) {
y = 1;
}

is somewhat pointless, long assignments and especially multi-line
statements can benefit from this (see for instance metaslab.c:1943).

Post by Pawel Jakub Dawidek
+ /* + * Remove any segments which have been allocated in the
mean + * time from both trim maps. + */
s/mean time/meantime/ ?

Shit my grammar's bad. Fixed.

Post by Pawel Jakub Dawidek
+/* + * Allocates and initializes a new trimset structure. The
`txg' argument + * indicates when this trimset was born and `sm' is
used as the template + * for the structure's internal space map
(the start, size and shifts are + * copied from it). + */
s/shifts/shift/ ?

Fixed.

Post by Pawel Jakub Dawidek
+ ts = kmem_zalloc(sizeof (*ts), KM_SLEEP); [...] + df =
kmem_zalloc(sizeof (dkioc_free_t), KM_SLEEP);

Consistented (I prefer the former too). Some of the other metaslab
code still uses the type name rather than the variable itself, though
that's probably for another commit on another day.

Post by Pawel Jakub Dawidek
+ for (space_seg_t *ss = avl_first(&trim_map->sm_root); ss !=
NULL; + ss = AVL_NEXT(&trim_map->sm_root, ss)) { +
dkioc_free_extent_t *ext; + + ext = kmem_zalloc(sizeof
(dkioc_free_extent_t), KM_SLEEP); + ext->dfe_start =
ss->ss_start; + ext->dfe_length = ss->ss_end - ss->ss_start; + if
(df->df_first == NULL) { + df->df_first = df->df_last = ext; + }
else { + df->df_last->dfe_next = ext; + df->df_last = ext; +
} + }
I wonder if you could just drop df_last and if you need those
for (space_seg_t *ss = avl_last(&trim_map->sm_root); ss != NULL; ss
= AVL_PREV(&trim_map->sm_root, ss)) { dkioc_free_extent_t *ext;
ext = kmem_zalloc(sizeof (dkioc_free_extent_t), KM_SLEEP);
ext->dfe_start = ss->ss_start; ext->dfe_length = ss->ss_end -
ss->ss_start; ext->dfe_next = df->df_first; df->df_first = ext; }
+ if (df->df_first) { [...]

Maybe, but not all consumers of this API use AVL trees - this would
force them to either implement backwards feeding in their own
algorithms (which might not be possible), or to keep around a pointer
to df_last on the stack (possible, but more cumbersome, especially if
you're not filling the list up in one go). I don't know, I just find
it good taste to have both ends of the list always at hand.

Post by Pawel Jakub Dawidek
+ ASSERT(df != NULL);
Pointers are not booleans - in the ASSERT() above you compare it
with NULL, which gives boolean and is nice, but in the condition
above you treat pointer as boolean instead of checking it against
NULL. I know it compiles, but it is not consistent and it is not
pretty:)

I don't understand... at which point do I treat a pointer as a boolean
here? I don't, I'm merely checking for NULL pointers and letting the
kernel barf with a sensible "assertion failed df != NULL" rather than
a non-sensical "BAD TRAP at address 0x0" when used later - good
practice (which I rarely keep, shame on me) dictates to always check
your input values.

Post by Pawel Jakub Dawidek
+/* + * Checks whether space map `sm' contains a segment starting
at `start' + * of length `size'. Returns 1 if it does, 0
otherwise. + */ boolean_t space_map_contains(space_map_t *sm,
uint64_t start, uint64_t size)
I'd say that it returns B_TRUE or B_FALSE.

True, fixed.

Post by Pawel Jakub Dawidek
+ /* + * DKIOCFREE ioctl's need some special handling on
interior + * vdevs. Specifically, they are only safe on mirror
vdevs. + * We can't simply pass the same disk offsets to free on
a + * raidz, since there the vdev offsets don't correspond to +
* physical disk addresses. + * TODO: move this ioctl distribution
machinery into the + * vdev implementations details (vdev_ops_t)
to allow each + * vdev type to implement custom ioctl
pre-processing. + */ + zio = zio_null(pio, spa, vd, done,
private, flags); + if (strcmp(vd->vdev_ops->vdev_op_type,
VDEV_TYPE_MIRROR) == 0) { + for (c = 0; c < vd->vdev_children;
c++) + zio_nowait(zio_ioctl(zio, spa, +
vd->vdev_child[c], cmd, + NULL, NULL, priority, flags)); +
}
Or you could not pass it as ioctl, but as ZIO_TYPE_FREE and just
allow vdevs to do the math. You don't need to introduce additional
vdev method then and BTW all code is already there in my
implementation to handle ZIO_TYPE_FREE in RAIDZ vdevs (in addition
to mirror vdevs).

The reason why I'm using DKIOCFREE ioctls rather than your method of
introducing a new zio type is because this allows us to explicitly
pass a list of extents, rather than just a single offset+length,
making much better use of the underlying bus commands.

I had a look and you were doing something very similar to what I was
already working on. Anyway, you did verify some of my assumptions I
had about raidz_map_alloc, so I can now put it up. See webrev for my
method. The main difference is, my method allows for multi-extent
ioctl's to be handled in one go (i.e. each leaf vdev gets a single
DKIOCFREE ioctl per metaslab, even if the leaf vdevs sit under mirrors
or raidzs).

Let me know what you think.

Cheers,
- --
Saso

Pawel Jakub Dawidek

2013-03-03 13:12:42 UTC

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
+/* + * How many TXG's worth of updates should be aggregated per
TRIM/UNMAP + * issued to the underlying vdev. This serves to
fulfill two functions: + * aggregate many small frees into fewer
larger ones (which should help + * with devices which do not take
so kindly to them) and to allow for + * disaster recovery (extents
won't get trimmed immediately, but + * instead only after passing
this fixed timeout). + */ +unsigned int zfs_txgs_per_trim = 32;
I'm not entirely sure if this is good idea to have one variable to
serve those two purposes. We do want to delay TRIMs by at least
those 32 txgs, but maybe gathering TRIMs from 32 txgs will
introduce noticable peaks in load every 32 txgs? Probably not worth
the complexity at this point, just noting that having one varable
to configure two largely unrelated things is something worth
avoiding in general.

This is somewhat by design. In order to avoid doing too many
space_map_walks in metaslab_sync, I limited the number of space map
manipulations for trim to two (the current one and the old one), with
the interval between them serving both functions - to aggregate and to
delay trims. Now, we could introduce an array of space maps and
aggregate into, say, 4 space maps, and only trim the last one, but
this would necessitate a space_map_walk with the allocmap for each of
them, doubling the cost.
I haven't tested what would happen with more space maps, but at the
moment I prefer a simple working solution to a super-fancy one that is
more fragile. Perhaps we could even reduce zfs_trims_per_txg to 16 - I
doubt anybody has to go back that far to recover a lost pool (and at
that point serious data corruption is probably inevitable). I mean, 16
txgs is typically at least 1 minute of history, if not more (on some
of my machines, 16 txgs is over 5 minute's worth of commits).

I agree we should start with simpler approach first and only if there is
a proof it can cause some problems switch to something more fancy.

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
+static void +metaslab_trim(metaslab_t *msp, uint64_t txg,
space_map_t *allocmap, + space_map_t *freed_map) +{ + if
(msp->ms_cur_trimset == NULL) { + msp->ms_cur_trimset =
metaslab_new_trimset(txg, allocmap); + }
Redundant brackets.

And the problem is? I use it this way all over the place. While I do
if (x > 5) {
y = 1;
}
is somewhat pointless, long assignments and especially multi-line
statements can benefit from this (see for instance metaslab.c:1943).

I much easier can accept style different than me, but it is harder for
me to accept inconsistent style:)

For me this line:

if (msp->ms_cur_trimset == NULL) {
msp->ms_cur_trimset = metaslab_new_trimset(txg, allocmap);
}

doesn't need backets, because entire code under if fits into one line
nicely. On the other hand here:

if (msp->ms_prev_trimset != NULL) {
space_map_walk(&msp->ms_prev_trimset->ts_map, space_map_remove,
allocmap);
}

I'd prefer to leave brackets for readability eventhough they are not
needed.

As for the consistency, here are some examples where you don't use
brackets:

if (error == ENOTSUP || error == ENOTTY)
vd->vdev_notrim = B_TRUE;

if (rm->rm_col[c].rc_data != NULL)
zio_buf_free(rm->rm_col[c].rc_data,
rm->rm_col[c].rc_size);

for (c = 0; c < rm->rm_firstdatacol; c++)
rm->rm_col[c].rc_data =
zio_buf_alloc(rm->rm_col[c].rc_size);

if (io_data != NULL)
rm->rm_col[c].rc_data = io_data;

Some of the above don't use backets to be consistent with the
surrounding code, which is a good thing, but also shows that excessive
brackets are not welcome in ZFS code.

This is not a big deal, so let's not spend too much time on it.

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
I wonder if you could just drop df_last and if you need those
for (space_seg_t *ss = avl_last(&trim_map->sm_root); ss != NULL; ss
= AVL_PREV(&trim_map->sm_root, ss)) { dkioc_free_extent_t *ext;
ext = kmem_zalloc(sizeof (dkioc_free_extent_t), KM_SLEEP);
ext->dfe_start = ss->ss_start; ext->dfe_length = ss->ss_end -
ss->ss_start; ext->dfe_next = df->df_first; df->df_first = ext; }
+ if (df->df_first) { [...]

Maybe, but not all consumers of this API use AVL trees - this would
force them to either implement backwards feeding in their own
algorithms (which might not be possible), or to keep around a pointer
to df_last on the stack (possible, but more cumbersome, especially if
you're not filling the list up in one go). I don't know, I just find
it good taste to have both ends of the list always at hand.

This all assumes that we want those ranges to be sorted by offset, which
at least for SSDs is not required. On the other hand keeping it sorted
by offset doesn't cost much and is more flexible for whatever the
feature will bring, so let's leave it your way.

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
+ ASSERT(df != NULL);
Pointers are not booleans - in the ASSERT() above you compare it
with NULL, which gives boolean and is nice, but in the condition
above you treat pointer as boolean instead of checking it against
NULL. I know it compiles, but it is not consistent and it is not
pretty:)

I don't understand... at which point do I treat a pointer as a boolean
here? I don't, I'm merely checking for NULL pointers and letting the
kernel barf with a sensible "assertion failed df != NULL" rather than
a non-sensical "BAD TRAP at address 0x0" when used later - good
practice (which I rarely keep, shame on me) dictates to always check
your input values.

The ASSERT() is fine and I gave it just as an example that you check
pointers properly most of the time, but you removed the other example
that wasn't fine when quoting. I found two places where you treat
pointers as booleans:

if (df->df_first) {

if (!df)

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek
+ /* + * DKIOCFREE ioctl's need some special handling on
interior + * vdevs. Specifically, they are only safe on mirror
vdevs. + * We can't simply pass the same disk offsets to free on
a + * raidz, since there the vdev offsets don't correspond to +
* physical disk addresses. + * TODO: move this ioctl distribution
machinery into the + * vdev implementations details (vdev_ops_t)
to allow each + * vdev type to implement custom ioctl
pre-processing. + */ + zio = zio_null(pio, spa, vd, done,
private, flags); + if (strcmp(vd->vdev_ops->vdev_op_type,
VDEV_TYPE_MIRROR) == 0) { + for (c = 0; c < vd->vdev_children;
c++) + zio_nowait(zio_ioctl(zio, spa, +
vd->vdev_child[c], cmd, + NULL, NULL, priority, flags)); +
}
Or you could not pass it as ioctl, but as ZIO_TYPE_FREE and just
allow vdevs to do the math. You don't need to introduce additional
vdev method then and BTW all code is already there in my
implementation to handle ZIO_TYPE_FREE in RAIDZ vdevs (in addition
to mirror vdevs).

The reason why I'm using DKIOCFREE ioctls rather than your method of
introducing a new zio type is because this allows us to explicitly
pass a list of extents, rather than just a single offset+length,
making much better use of the underlying bus commands.

Ok, but note that ZIO_TYPE_FREE is already there, it is just not passed
down to vdevs.

Post by SaÅ¡o Kiselkov
I had a look and you were doing something very similar to what I was
already working on. Anyway, you did verify some of my assumptions I
had about raidz_map_alloc, so I can now put it up. See webrev for my
method. The main difference is, my method allows for multi-extent
ioctl's to be handled in one go (i.e. each leaf vdev gets a single
DKIOCFREE ioctl per metaslab, even if the leaf vdevs sit under mirrors
or raidzs).
Let me know what you think.

Unfortunately I won't be able to review it in detail, as I'll be away
next week.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-03-03 22:12:43 UTC

Sorry for a late response, I somehow oversaw your e-mail and skipped
it. Below are my responses.

Post by Pawel Jakub Dawidek
I much easier can accept style different than me, but it is harder
for me to accept inconsistent style:)
For me this line: [...snip...] Some of the above don't use backets
to be consistent with the surrounding code, which is a good thing,
but also shows that excessive brackets are not welcome in ZFS
code.
This is not a big deal, so let's not spend too much time on it.

I guess we're just having a case of difference of taste. I agree with
you that we shouldn't spend too much time on this - as long as people
don't complain that the code is unreadable, I suppose we can focus on
more pressing issues. I don't want to seem like imposing myself on
others here, though - if this is an eyesore to people, by all means,
let's rewrite it.

Post by Pawel Jakub Dawidek
This all assumes that we want those ranges to be sorted by offset,
which at least for SSDs is not required. On the other hand keeping
it sorted by offset doesn't cost much and is more flexible for
whatever the feature will bring, so let's leave it your way.

Hm, the commands themselves don't require sorting, true, but I guess
if we get this feature "for free" we should keep it. As you note, it
might come in handy.

Post by Pawel Jakub Dawidek
The ASSERT() is fine and I gave it just as an example that you
check pointers properly most of the time, but you removed the other
example that wasn't fine when quoting. I found two places where you
if (df->df_first) {
if (!df)

Ah, I though by 'in the condition above' you were talking about the
condition in the ASSERT - my bad. I fixed the conditionals in the
latest webrevs. If you find more, please point them out to me.

Post by Pawel Jakub Dawidek
Ok, but note that ZIO_TYPE_FREE is already there, it is just not
passed down to vdevs.

Hm, far be it for me to introduce unnecessary cruft, but I think this
helps extend trim semantics in a meaningful way. Perhaps, on the other
hand, ZIO_TYPE_FREE is something we could lose, if it isn't really
used anywhere.

Post by Pawel Jakub Dawidek
Unfortunately I won't be able to review it in detail, as I'll be
away next week.

Me too, though I'll be on e-mail. Anyway, I greatly appreciate yours
and Etienne's many insightful comments!

Cheers,
- --
Saso

Matthew Ahrens

2013-11-22 20:11:19 UTC

Saso, what's the status of these changes? You may need to merge them with
the latest illumos code, which has rearranged some of the space_map code.

--matt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Sorry for a late response, I somehow oversaw your e-mail and skipped
it. Below are my responses.

Post by Pawel Jakub Dawidek
I much easier can accept style different than me, but it is harder
for me to accept inconsistent style:)
For me this line: [...snip...] Some of the above don't use backets
to be consistent with the surrounding code, which is a good thing,
but also shows that excessive brackets are not welcome in ZFS
code.
This is not a big deal, so let's not spend too much time on it.

I guess we're just having a case of difference of taste. I agree with
you that we shouldn't spend too much time on this - as long as people
don't complain that the code is unreadable, I suppose we can focus on
more pressing issues. I don't want to seem like imposing myself on
others here, though - if this is an eyesore to people, by all means,
let's rewrite it.

Post by Pawel Jakub Dawidek
This all assumes that we want those ranges to be sorted by offset,
which at least for SSDs is not required. On the other hand keeping
it sorted by offset doesn't cost much and is more flexible for
whatever the feature will bring, so let's leave it your way.

Hm, the commands themselves don't require sorting, true, but I guess
if we get this feature "for free" we should keep it. As you note, it
might come in handy.

Post by Pawel Jakub Dawidek
The ASSERT() is fine and I gave it just as an example that you
check pointers properly most of the time, but you removed the other
example that wasn't fine when quoting. I found two places where you
if (df->df_first) {
if (!df)

Ah, I though by 'in the condition above' you were talking about the
condition in the ASSERT - my bad. I fixed the conditionals in the
latest webrevs. If you find more, please point them out to me.

Post by Pawel Jakub Dawidek
Ok, but note that ZIO_TYPE_FREE is already there, it is just not
passed down to vdevs.

Hm, far be it for me to introduce unnecessary cruft, but I think this
helps extend trim semantics in a meaningful way. Perhaps, on the other
hand, ZIO_TYPE_FREE is something we could lose, if it isn't really
used anywhere.

Post by Pawel Jakub Dawidek
Unfortunately I won't be able to review it in detail, as I'll be
away next week.

Me too, though I'll be on e-mail. Anyway, I greatly appreciate yours
and Etienne's many insightful comments!
Cheers,
- --
Saso
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iEYEARECAAYFAlEzytsACgkQle4gLqwmJMfimACfZAcHyUqnPVxpbCKu2j/8GXpu
e8UAmwQoo6ZZNm02pBMDHuqmViY5RXp2
=h0Uq
-----END PGP SIGNATURE-----
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-11-23 03:09:59 UTC

Post by Matthew Ahrens
Saso, what's the status of these changes? You may need to merge them
with the latest illumos code, which has rearranged some of the space_map
code.

Hi Matt,

I haven't had the inclination / willpower to cut into sd.c yet to get
the show on the road, but seeing as that's probably the only way go get
this done, I'm simply gonna have to muster up the mental attitude. As
for rebasing on latest bits - I already did that about two weeks ago
while trying something out. The stuff at
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/ should actually
compile fine on latest git. I'm also gonna find my time to finalize the
ZFS-portions of it.

Cheers,

--
Saso

Matthew Ahrens

2013-11-23 03:22:25 UTC

Post by Saso Kiselkov

Post by Matthew Ahrens
Saso, what's the status of these changes? You may need to merge them
with the latest illumos code, which has rearranged some of the space_map
code.

Hi Matt,
I haven't had the inclination / willpower to cut into sd.c yet to get
the show on the road, but seeing as that's probably the only way go get
this done, I'm simply gonna have to muster up the mental attitude. As
for rebasing on latest bits - I already did that about two weeks ago
while trying something out. The stuff at
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/ should actually
compile fine on latest git. I'm also gonna find my time to finalize the
ZFS-portions of it.

Great! I'll take another look at the ZFS code. Unfortunately I don't have
any experience with sd.c.

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-11-23 03:27:16 UTC

Post by SaÅ¡o Kiselkov

Post by Matthew Ahrens
Saso, what's the status of these changes? You may need to merge them
with the latest illumos code, which has rearranged some of the

space_map

Post by Matthew Ahrens
code.

Hi Matt,
I haven't had the inclination / willpower to cut into sd.c yet to get
the show on the road, but seeing as that's probably the only way go get
this done, I'm simply gonna have to muster up the mental attitude. As
for rebasing on latest bits - I already did that about two weeks ago
while trying something out. The stuff at
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/ should actually
compile fine on latest git. I'm also gonna find my time to finalize the
ZFS-portions of it.
Great! I'll take another look at the ZFS code. Unfortunately I don't
have any experience with sd.c.

If you've got a moment, can you please instead take a look at L2ARC
persistency? Not that the unmap stuff isn't worth a look, but I think
L2ARC persistency is a bit more mature and closer to being ready (and by
extension, worth your time to review).

Cheers,

--
Saso

Etienne Dechamps

2013-03-01 23:15:21 UTC

Post by SaÅ¡o Kiselkov
Here we go, done: http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

Note that, unless I missed something in your patch, you're not issuing
discards on cache devices since, IIRC, they're not using the metaslab.
Having discards on L2ARC could be quite useful because cache devices are
often SSDs.

My patch implements discards on L2ARC; you can get some inspiration from
the following commits:

https://github.com/dechamps/zfs/commit/31aae373994fd112256607edba7de2359da3e9dc
https://github.com/dechamps/zfs/commit/17122c31ac7f82875e837019205c21651c05f8cd

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-03-01 23:25:18 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Here we go, done: http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

Note that, unless I missed something in your patch, you're not issuing
discards on cache devices since, IIRC, they're not using the metaslab.
Having discards on L2ARC could be quite useful because cache devices are
often SSDs.

Correct, this is by design. Since cache devices are not written to in a
synchronous fashion, write latency has little effect on l2arc
throughput. Moreover, trimming them on device add/remove messes up
persistent l2arc.

Post by Etienne Dechamps
My patch implements discards on L2ARC; you can get some inspiration from
https://github.com/dechamps/zfs/commit/31aae373994fd112256607edba7de2359da3e9dc
https://github.com/dechamps/zfs/commit/17122c31ac7f82875e837019205c21651c05f8cd

Interesting, you are trimming buffers when they are evicted from l2arc -
this could lead to non-trivial bottlenecks with high arc memory
pressure. I see your point - doing trims as part of good housekeeping -
but seeing as it doesn't actually accelerate any performance-critical
portion of the l2arc (keep in mind, the l2arc is read-sensitive, not
write-sensitive), I see little reason to introduce this fragility.
Perhaps I'm missing something?

Cheers,
--
Saso

Etienne Dechamps

2013-03-01 23:32:38 UTC

Post by SaÅ¡o Kiselkov
Interesting, you are trimming buffers when they are evicted from l2arc -
this could lead to non-trivial bottlenecks with high arc memory
pressure. I see your point - doing trims as part of good housekeeping -
but seeing as it doesn't actually accelerate any performance-critical
portion of the l2arc (keep in mind, the l2arc is read-sensitive, not
write-sensitive), I see little reason to introduce this fragility.
Perhaps I'm missing something?

Actually, you may be right. I was going to say that TRIMs are also
important for wear leveling, but I guess considering that the L2ARC is
basically a giant ring buffer, writes should be evenly spread in the
first place. So yeah, maybe this code is not very useful after all. I
guess I'm so obsessed with having TRIM everywhere that it's clouding my
judgment.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-03-01 23:43:16 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Interesting, you are trimming buffers when they are evicted from l2arc -
this could lead to non-trivial bottlenecks with high arc memory
pressure. I see your point - doing trims as part of good housekeeping -
but seeing as it doesn't actually accelerate any performance-critical
portion of the l2arc (keep in mind, the l2arc is read-sensitive, not
write-sensitive), I see little reason to introduce this fragility.
Perhaps I'm missing something?

Actually, you may be right. I was going to say that TRIMs are also
important for wear leveling, but I guess considering that the L2ARC is
basically a giant ring buffer, writes should be evenly spread in the
first place. So yeah, maybe this code is not very useful after all. I
guess I'm so obsessed with having TRIM everywhere that it's clouding my
judgment.

No problem, I was thinking just like you were, and then I read stuff like:
http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review/3
only to realize that my previous assumptions about SSDs being dumb flash
devices were completely wrong. These are some pretty smart pieces of
hardware that do much better wear-leveling and reallocation control than
we ever could (due to not having all the hardware implementation details
in our reach).

Cheers,
--
Saso

Etienne Dechamps

2013-03-02 22:30:00 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
Note that, unless I missed something in your patch, you're not issuing
discards on cache devices since, IIRC, they're not using the metaslab.
Having discards on L2ARC could be quite useful because cache devices are
often SSDs.

Correct, this is by design. Since cache devices are not written to in a
synchronous fashion, write latency has little effect on l2arc
throughput.

Come to think of it... you are right that write performance is not
important on cache devices, but let's not forget that the underlying
physical device might be used for other things than just L2ARC. Indeed,
a very popular way to use an SSD with ZFS is to create two partitions
and use one for the ZIL and one for the L2ARC. I guess that in this
situation, issuing discards to the cache partition could actually help
with the performance of the ZIL partition, for which synchronous write
performance is critical. That's a little far-fetched though.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-03-02 22:38:55 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
Note that, unless I missed something in your patch, you're not issuing
discards on cache devices since, IIRC, they're not using the metaslab.
Having discards on L2ARC could be quite useful because cache devices are
often SSDs.

Correct, this is by design. Since cache devices are not written to in a
synchronous fashion, write latency has little effect on l2arc
throughput.

Come to think of it... you are right that write performance is not
important on cache devices, but let's not forget that the underlying
physical device might be used for other things than just L2ARC. Indeed,
a very popular way to use an SSD with ZFS is to create two partitions
and use one for the ZIL and one for the L2ARC. I guess that in this
situation, issuing discards to the cache partition could actually help
with the performance of the ZIL partition, for which synchronous write
performance is critical. That's a little far-fetched though.

Actually, you have a very interesting point here. Trouble is, I highly
doubt that issuing these frees in the l2arc_evict function will do any
good, seeing as that space will immediately be claimed again by
l2arc_write_buffers. This comes hand in hand with the fact that we
typically always keep the L2ARC at or near 100% full, meaning, TRIM
benefits are relatively few and far between. I'm not opposed to the idea
of doing TRIMs on l2arc_evict per se, I just don't see the point of it.

As for keeping the log device on the same device as the l2cache - it's
common practice and I wouldn't regard it as a breach of good (or
economic) pool design. Slog only needs a few GBs and modern MLC SSDs are
ridiculously large, so it would be a bit of a crime *not* to use the
excess for extra caching.

Cheers,
--
Saso

Etienne Dechamps

2013-03-02 23:06:55 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
Come to think of it... you are right that write performance is not
important on cache devices, but let's not forget that the underlying
physical device might be used for other things than just L2ARC. Indeed,
a very popular way to use an SSD with ZFS is to create two partitions
and use one for the ZIL and one for the L2ARC. I guess that in this
situation, issuing discards to the cache partition could actually help
with the performance of the ZIL partition, for which synchronous write
performance is critical. That's a little far-fetched though.

Actually, you have a very interesting point here. Trouble is, I highly
doubt that issuing these frees in the l2arc_evict function will do any
good, seeing as that space will immediately be claimed again by
l2arc_write_buffers. This comes hand in hand with the fact that we
typically always keep the L2ARC at or near 100% full, meaning, TRIM
benefits are relatively few and far between. I'm not opposed to the idea
of doing TRIMs on l2arc_evict per se, I just don't see the point of it.

Well, l2arc_evict() is not the only code path that removes L2ARC
entries. IIRC, there are other functions like arc_hdr_destroy() and
arc_release() that result in L2ARC entries being removed because the
data the data they hold is obsolete or because pointers to L2ARC entries
are being evicted from RAM as a result of memory pressure.

In these cases, the resulting "holes" are *not* immediately claimed by
l2arc_write_buffers(). The space they occupied will only be rewritten
when the L2ARC hand reaches their position, which could potentially take
a very long time on large cache devices. For this reason, it would be
wise to discard them in the mean time.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-03-03 10:39:08 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
Come to think of it... you are right that write performance is not
important on cache devices, but let's not forget that the underlying
physical device might be used for other things than just L2ARC. Indeed,
a very popular way to use an SSD with ZFS is to create two partitions
and use one for the ZIL and one for the L2ARC. I guess that in this
situation, issuing discards to the cache partition could actually help
with the performance of the ZIL partition, for which synchronous write
performance is critical. That's a little far-fetched though.

Actually, you have a very interesting point here. Trouble is, I highly
doubt that issuing these frees in the l2arc_evict function will do any
good, seeing as that space will immediately be claimed again by
l2arc_write_buffers. This comes hand in hand with the fact that we
typically always keep the L2ARC at or near 100% full, meaning, TRIM
benefits are relatively few and far between. I'm not opposed to the idea
of doing TRIMs on l2arc_evict per se, I just don't see the point of it.

Well, l2arc_evict() is not the only code path that removes L2ARC
entries. IIRC, there are other functions like arc_hdr_destroy() and
arc_release() that result in L2ARC entries being removed because the
data the data they hold is obsolete or because pointers to L2ARC entries
are being evicted from RAM as a result of memory pressure.
In these cases, the resulting "holes" are *not* immediately claimed by
l2arc_write_buffers(). The space they occupied will only be rewritten
when the L2ARC hand reaches their position, which could potentially take
a very long time on large cache devices. For this reason, it would be
wise to discard them in the mean time.

You are right, L2ARC evicts happen ahead of time when an ARC buffer is
dirtied (due to being written to). On systems with a significant
proportion of write activity, this can, over time, indeed amount to
quite large data volumes. I'll investigate whether we could exploit this
to some benefit.

Cheers,
--
Saso

Jim Klimov

2013-03-02 23:40:45 UTC

Post by SaÅ¡o Kiselkov
As for keeping the log device on the same device as the l2cache - it's
common practice and I wouldn't regard it as a breach of good (or
economic) pool design. Slog only needs a few GBs and modern MLC SSDs are
ridiculously large, so it would be a bit of a crime *not* to use the
excess for extra caching.

This is an interesting sub-thread of the overall discussion.
All sides have good arguments, so I still can't pick the "right" side ;)

In regard to TRIM and L2ARC, I wanted to propose trimming of the blocks
read-in from the L2ARC - since they won't be used again. But this is
exactly what we'd do during persistent L2ARC import. Basically, we'd
need most of this ring buffer again (except for those blocks that would
become obsolete because of being overwritten on the pool, so old DVA's
and data become irrelevant - but I guess it is too expensive to track
these obsoletions to TRIM only those ranges of the L2ARC SSD device).

However in regard to SLOG - do we trim obsolete entries? Possibly,
keeping some SSD space unused (manual over-provisioning) plus TRIMming
old ZIL entries would help ensure better performance of ZIL even if it
coexists with an L2ARC slice? I know the recommended SLOG size is to
hold about 3 TXGs at your pool's bandwidth, but in off-peak times this
is an overkill, and there may be some obsolete entries (older than 3TXG)
right?

//Jim

Etienne Dechamps

2013-03-03 10:23:43 UTC

Post by Jim Klimov
(except for those blocks that would
become obsolete because of being overwritten on the pool, so old DVA's
and data become irrelevant - but I guess it is too expensive to track
these obsoletions to TRIM only those ranges of the L2ARC SSD device).

I don't think so. In fact, my patch actually does precisely that (it
schedules discards on arc_hdr_destroy() and arc_release()), although
it's using the TRIM range map code from Pawel.

Post by Jim Klimov
However in regard to SLOG - do we trim obsolete entries? Possibly,
keeping some SSD space unused (manual over-provisioning) plus TRIMming
old ZIL entries would help ensure better performance of ZIL even if it
coexists with an L2ARC slice? I know the recommended SLOG size is to
hold about 3 TXGs at your pool's bandwidth, but in off-peak times this
is an overkill, and there may be some obsolete entries (older than 3TXG)
right?

Yes. ZIL uses the metaslab, so discarding old ZIL entries should already
work using Sašo's patch. Pawel's patch (and so mine) supports it too. It
doesn't really require additional code. Still, I wonder if issuing
discards to the L2ARC slice could improve performance with the ZIL
slice. As I said, it's a little far-fetched.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-03-03 10:34:22 UTC

Post by Etienne Dechamps
Still, I wonder if issuing
discards to the L2ARC slice could improve performance with the ZIL
slice. As I said, it's a little far-fetched.

We don't currently have in-kernel support for TRIM in Illumos, so it's
difficult for me to verify this, but if you could pull over my changes
to Linux, perhaps you could test this out. I'm curious myself.

Cheers,
--
Saso

Sašo Kiselkov

2013-03-03 20:48:14 UTC

Post by Etienne Dechamps

Post by Jim Klimov
(except for those blocks that would
become obsolete because of being overwritten on the pool, so old DVA's
and data become irrelevant - but I guess it is too expensive to track
these obsoletions to TRIM only those ranges of the L2ARC SSD device).

I don't think so. In fact, my patch actually does precisely that (it
schedules discards on arc_hdr_destroy() and arc_release()), although
it's using the TRIM range map code from Pawel.

Post by Jim Klimov
However in regard to SLOG - do we trim obsolete entries? Possibly,
keeping some SSD space unused (manual over-provisioning) plus TRIMming
old ZIL entries would help ensure better performance of ZIL even if it
coexists with an L2ARC slice? I know the recommended SLOG size is to
hold about 3 TXGs at your pool's bandwidth, but in off-peak times this
is an overkill, and there may be some obsolete entries (older than 3TXG)
right?

Yes. ZIL uses the metaslab, so discarding old ZIL entries should already
work using Sašo's patch. Pawel's patch (and so mine) supports it too. It
doesn't really require additional code. Still, I wonder if issuing
discards to the L2ARC slice could improve performance with the ZIL
slice. As I said, it's a little far-fetched.

Just a quickie, I implemented TRIM for L2ARC now. What we do there is we
track extents that are flushed out of L2ARC (either due to an
arc_release, arc_hdr_destroy, l2arc write error or just regular
l2arc_evict), put them in a space map (to coalesce adjacent extents) and
then trim them at the end of each l2arc_evict.
The l2arc_trim itself is done in the l2arc_feed_thread, so it doesn't
block normal ARC operation and it is a write barrier (the trim is issued
'sync'), so as to protect the subsequent l2arc_write_buffers from
reordering.
I purposely didn't add a l2arc_trim at device addition or removal to
allow for recovering these buffers once #3525 gets integrated.

P.S. code here: http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

--
Saso

Sašo Kiselkov

2013-03-03 20:47:32 UTC

Post by Etienne Dechamps

Post by Jim Klimov
(except for those blocks that would
become obsolete because of being overwritten on the pool, so old DVA's
and data become irrelevant - but I guess it is too expensive to track
these obsoletions to TRIM only those ranges of the L2ARC SSD device).

I don't think so. In fact, my patch actually does precisely that (it
schedules discards on arc_hdr_destroy() and arc_release()), although
it's using the TRIM range map code from Pawel.

Post by Jim Klimov
However in regard to SLOG - do we trim obsolete entries? Possibly,
keeping some SSD space unused (manual over-provisioning) plus TRIMming
old ZIL entries would help ensure better performance of ZIL even if it
coexists with an L2ARC slice? I know the recommended SLOG size is to
hold about 3 TXGs at your pool's bandwidth, but in off-peak times this
is an overkill, and there may be some obsolete entries (older than 3TXG)
right?

Yes. ZIL uses the metaslab, so discarding old ZIL entries should already
work using Sašo's patch. Pawel's patch (and so mine) supports it too. It
doesn't really require additional code. Still, I wonder if issuing
discards to the L2ARC slice could improve performance with the ZIL
slice. As I said, it's a little far-fetched.

Just a quickie, I implemented TRIM for L2ARC now. What we do there is we
track extents that are flushed out of L2ARC (either due to an
arc_release, arc_hdr_destroy, l2arc write error or just regular
l2arc_evict), put them in a space map (to coalesce adjacent extents) and
then trim them at the end of each l2arc_evict.

The l2arc_trim itself is done in the l2arc_feed_thread, so it doesn't
block normal ARC operation and it is a write barrier (the trim is issued
'sync'), so as to protect the subsequent l2arc_write_buffers from
reordering.

I purposely didn't add a l2arc_trim at device addition or removal to
allow for recovering these buffers once #3525 gets integrated.

Cheers,
--
Saso

Tomas Forsman

2013-03-02 17:13:31 UTC

Post by SaÅ¡o Kiselkov
Here we go, done: http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/
*) I renamed zfs_nounmap to zfs_notrim and made it an extern in vdev.h

A personal quirk, having negations in the name can be confusing..
"unset nodisable_blargh=false" ..

"Do you not want to use trim?" -> "yes, I do" or "yes, that's correct, I
do not" ..

zfs_dotrim = yes/no is quite clear what it does.

/Tomas

--
Tomas Forsman, ***@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se

Etienne Dechamps

2013-02-27 21:45:09 UTC

Hi Sašo,

Post by SaÅ¡o Kiselkov
So I've been working on getting some initial support for SCSI Unmap and
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

I am the one who ported Pawel's patch to ZFS On Linux. You will find
more details at the following URL:

https://github.com/zfsonlinux/zfs/pull/1016

I didn't take an in-depth look at your patch yet, and I probably won't
have time to do so in the near future. That said, I would like to bring
to your attention the following features that I would find useful in any
potential "discard" (which I use as a generic term for UNMAP, TRIM &
co.) implementation:

- Discard should not break disaster recovery (i.e. zpool import -F).
George apparently has a suggestion on how to make sure your patch
complies with this requirement. Pawel's patch is already compliant.

- Discards should be batched. For example in my proposed patch for ZFS
On Linux this is done by issuing discards every N TXGs (with e.g. N =
32) instead of each TXG. This is important because on most devices, a
few large discards will be done much faster and more efficiently than a
lot of small discards.

- Care should be taken with regard to proper scheduling of discard
requests vs. read/write requests (i.e. proper queuing and priority). The
reason for this is that, AFAIK, on some devices (especially SATA
devices), issuing discard requests effectively kills the performance of
the device for read/write operations. For this reason, discard requests
should only be issued when the device is otherwise idle. My patch
doesn't fulfill this requirement yet, and as a result, one of the
testers is complaining that his system is basically unusable when my
patch is issuing discard requests.

- It would be nice to have zpool create/add/attach issue a discard
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux. Same thing when
exporting, removing or destroying cache vdevs. My patch implements this.

- If discards are delayed (which I think is unavoidable to comply with
requirements 1 and 2), then it would be nice to make sure pending
discards don't get lost in case the pool is exported, gracefully or not
(e.g. power failure). My patch doesn't implement this.

We now have two competing ZFS TRIM/UNMAP implementations in the wild. I
don't know if I should be happy or sad about this :/

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Nico Williams

2013-02-27 22:01:14 UTC

Post by Etienne Dechamps
We now have two competing ZFS TRIM/UNMAP implementations in the wild. I
don't know if I should be happy or sad about this :/

"Yes".

In general it's OK to have two competing implementations of any one
feature, as long as a) on-disk formats are interoperable, b)
preferably we settle for just one implementation in the long term (to
keep diffs to a minimum and thus merges easier).

Nico
--

Sašo Kiselkov

2013-02-27 21:58:25 UTC

Post by Etienne Dechamps
Hi Sašo,

Hi Etienne,

Post by Etienne Dechamps
I am the one who ported Pawel's patch to ZFS On Linux. You will find
https://github.com/zfsonlinux/zfs/pull/1016

Thanks, I'll take a look.

Post by Etienne Dechamps
I didn't take an in-depth look at your patch yet, and I probably won't
have time to do so in the near future. That said, I would like to bring
to your attention the following features that I would find useful in any
potential "discard" (which I use as a generic term for UNMAP, TRIM &
- Discard should not break disaster recovery (i.e. zpool import -F).
George apparently has a suggestion on how to make sure your patch
complies with this requirement.

With George's changes we delay "discards" by one txg. That being said,
there's probably room for improvement.

Post by Etienne Dechamps
- Discards should be batched. For example in my proposed patch for ZFS
On Linux this is done by issuing discards every N TXGs (with e.g. N =
32) instead of each TXG. This is important because on most devices, a
few large discards will be done much faster and more efficiently than a
lot of small discards.

The current implementation sends one discard per metaslab per txg,
containing all the extents to be freed. Of course, aggregation would
improve this, but I think we should quantify this before going on to
implementing it for its own sake.

Post by Etienne Dechamps
- Care should be taken with regard to proper scheduling of discard
requests vs. read/write requests (i.e. proper queuing and priority). The
reason for this is that, AFAIK, on some devices (especially SATA
devices), issuing discard requests effectively kills the performance of
the device for read/write operations. For this reason, discard requests
should only be issued when the device is otherwise idle. My patch
doesn't fulfill this requirement yet, and as a result, one of the
testers is complaining that his system is basically unusable when my
patch is issuing discard requests.

Interesting, I'll look into this. Perhaps we can implement some sort of
"pressure valve" which would delay and aggregate these ops when needed.

Post by Etienne Dechamps
- It would be nice to have zpool create/add/attach issue a discard
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux.

As I noted before to Pawel, it's a nice thing to have, though probably
not strictly a big performance improvement.

Post by Etienne Dechamps
Same thing when exporting, removing or destroying cache vdevs.

This would break https://www.illumos.org/issues/3525 so I disagree here.

Post by Etienne Dechamps
- If discards are delayed (which I think is unavoidable to comply with
requirements 1 and 2), then it would be nice to make sure pending
discards don't get lost in case the pool is exported, gracefully or not
(e.g. power failure).

The non-graceful shutdown mode would necessitate that we rework discard
support to operate at the object layer and cache them as transactions in
the ZIL. I see this is a rather egregious layering violation - the DMU
attempting to infer details about the operation of the SPA (and its
metaslabs) or the structures below it. Losing a few discards isn't the
end of the world. Worst thing is that we don't get the speed benefits or
keep around a few more blocks on thin-provisioned storage than we
absolutely need to (those will get freed as soon as they're reused and
discarded again anyway).

Post by Etienne Dechamps
We now have two competing ZFS TRIM/UNMAP implementations in the wild. I
don't know if I should be happy or sad about this :/

To ZFS users *how* we do it is irrelevant, only that we get it right. I
believe that doing free space management should be done in the
allocator, since that's the "right place" to do block allocation
management (doh). Doing it as a post-hoc addon in zio's is kind of like
trying to implement some clever hardware RAID controller tricks while
lacking info about the logical structure of the data. We're ZFS, we have
the info, so we don't need to do that.

Cheers,
--
Saso

Nico Williams

2013-02-27 22:32:56 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
- If discards are delayed (which I think is unavoidable to comply with
requirements 1 and 2), then it would be nice to make sure pending
discards don't get lost in case the pool is exported, gracefully or not
(e.g. power failure).

The non-graceful shutdown mode would necessitate that we rework discard
support to operate at the object layer and cache them as transactions in
the ZIL. I see this is a rather egregious layering violation - the DMU
attempting to infer details about the operation of the SPA (and its
metaslabs) or the structures below it. Losing a few discards isn't the
end of the world. Worst thing is that we don't get the speed benefits or
keep around a few more blocks on thin-provisioned storage than we
absolutely need to (those will get freed as soon as they're reused and
discarded again anyway).

I don't think this is a layer violation if you're willing to lose up
to one transaction's worth of discards: SPA builds a list of discards,
at the end of a TXG commit we hoist that list into DMU where we write
it in an object and/or ZIL for the next TXG.

If we have an unclean shutdown we can recompute the list of discards
from looking at the transaction we recovered from.

Now we have the to-be-discarded block list in the DMU and we can
periodically do it either every N transactions, when we're idle for
long enough, or when we are low on free space.

Nico
--

Jim Klimov

2013-02-28 00:07:08 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
- It would be nice to have zpool create/add/attach issue a discard
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux.

As I noted before to Pawel, it's a nice thing to have, though probably
not strictly a big performance improvement.

From what I read about SSD's, their performance can be sub-optimal
if they write small pieces of data into larger flash blocks that
their controller considers as holding data - causing RMW lags and
wear on the chips.

So, issuing a big "discard" should at least let the controller know
that it can reuse all those ranges and on the background it would
start to un-program the cells involved. When you get to writing data
into the previously-used device, be it resilvering or just plain
pool component, or perhaps a ZIL device, your IOs would hopefully
complete faster and with less pain to the SSD drive's longevity.

But this is theory, I am not sure if anyone quantified and verified
it for all the different devices there are on the market ;)

//Jim

Timothy Coalson

2013-02-28 02:40:02 UTC

Post by Jim Klimov

Post by Etienne Dechamps
- It would be nice to have zpool create/add/attach issue a discard

Post by Etienne Dechamps
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux.

As I noted before to Pawel, it's a nice thing to have, though probably
not strictly a big performance improvement.

From what I read about SSD's, their performance can be sub-optimal
if they write small pieces of data into larger flash blocks that
their controller considers as holding data - causing RMW lags and
wear on the chips.
So, issuing a big "discard" should at least let the controller know
that it can reuse all those ranges and on the background it would
start to un-program the cells involved. When you get to writing data
into the previously-used device, be it resilvering or just plain
pool component, or perhaps a ZIL device, your IOs would hopefully
complete faster and with less pain to the SSD drive's longevity.
But this is theory, I am not sure if anyone quantified and verified
it for all the different devices there are on the market ;)

I gather it works differently - if your SSD is doing RMW cycles, it has
insufficient "free" blocks and no overprovisioning (even sustained writes
don't cause continuous RMW as long as there is some overprovisioning or
otherwise "free" blocks, though it may have to wait for a known free block
to be erased, which can hurt performance), and your write performance will
be virtually dead, and causing a lot more wear on the flash. As I
understand it, the erase mechanism erases much larger chunks than the
sector size (afaik, all writes are a multiple of sector size), and the SSD
has no trouble writing a single sector to any sector-sized block of flash
that is empty (that is, not written to since last erasure), even if the
other ones in the same erasure block are not empty. However, there is
really no correspondence between what LBA was written and what flash cells
the data ends up in, the SSD controller/firmware always writes it to an
empty block which can be anywhere (usually chosen as one that has not been
erased much, for wear leveling), and if that LBA was previously not "free",
the block that used to contain the data for that LBA is marked as not in
use, and can later be erased in the background, if all of the other blocks
in that erase chunk are also unused (or if it wants to move the other
blocks, to consolidate space or for wear leveling, etc).

As for performance of lots of small discards versus few large discards,
that may be mainly a firmware question, but I do not think it is related to
RMW, at least not of the written sectors (the information the controller
uses to track which LBAs went where might have some RMW, I really don't
know, and it may be highly variable by controller/firmware). I have heard
of it being a major performance issue before, I seem to recall reading that
ext4 delays discards for this reason. I do not know of anyone quantifying
this performance problem on many devices, though. I found this anecdote,
though:

http://oss.sgi.com/archives/xfs/2011-11/msg00108.html

My apologies if this is off the original topic.

Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Timothy Coalson

2013-02-28 03:40:34 UTC

...I seem to recall reading that ext4 delays discards for this reason.

I think I have to take this back, some searching makes it seem like
linux/ext4 still issues trims immediately. I think windows
delays/coalesces TRIM, though, which may be what I remembered.

I do not know of anyone quantifying this performance problem on many
http://oss.sgi.com/archives/xfs/2011-11/msg00108.html

Some additional anecdotes: this problem is not present with
ext4/linux-3.0.0-31 with discard on my vertex 3, and from some comments,
the crucial m4 doesn't seem to be affected, either, while the vertex 2 is
affected to a lesser degree.

My apologies if this is off the original topic.

Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Freddie Cash

2013-02-28 04:05:05 UTC

Anandtech.com has in-depth benchmarks that include:
- empty drive
- full drive
- trim'd drive
- secure erased drive

The performance difference between empty and full is quite significant. And
the difference between empty and trim'd is very small.

However, leaving 25% of the drive as unpartitioned gives similar
performance to a trim'd drive.

Post by Jim Klimov

Post by Etienne Dechamps
- It would be nice to have zpool create/add/attach issue a discard

Post by Etienne Dechamps
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux.

As I noted before to Pawel, it's a nice thing to have, though probably
not strictly a big performance improvement.

From what I read about SSD's, their performance can be sub-optimal
if they write small pieces of data into larger flash blocks that
their controller considers as holding data - causing RMW lags and
wear on the chips.
So, issuing a big "discard" should at least let the controller know
that it can reuse all those ranges and on the background it would
start to un-program the cells involved. When you get to writing data
into the previously-used device, be it resilvering or just plain
pool component, or perhaps a ZIL device, your IOs would hopefully
complete faster and with less pain to the SSD drive's longevity.
But this is theory, I am not sure if anyone quantified and verified
it for all the different devices there are on the market ;)
//Jim
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24018584-b00f08b0<https://www.listbox.com/member/archive/rss/182191/24018584-b00f08b0>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=24018584-3ae1ddff<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Timothy Coalson

2013-02-28 06:09:11 UTC

Post by Jim Klimov

Post by Etienne Dechamps
- It would be nice to have zpool create/add/attach issue a discard

Post by Etienne Dechamps
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux.

As I noted before to Pawel, it's a nice thing to have, though probably
not strictly a big performance improvement.

From what I read about SSD's, their performance can be sub-optimal
if they write small pieces of data into larger flash blocks that
their controller considers as holding data - causing RMW lags and
wear on the chips.
So, issuing a big "discard" should at least let the controller know
that it can reuse all those ranges and on the background it would
start to un-program the cells involved. When you get to writing data
into the previously-used device, be it resilvering or just plain
pool component, or perhaps a ZIL device, your IOs would hopefully
complete faster and with less pain to the SSD drive's longevity.
But this is theory, I am not sure if anyone quantified and verified
it for all the different devices there are on the market ;)

I managed to miss your point despite your careful quoting. My apologies to
the list for the noise.

Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-02-28 15:33:35 UTC

Post by Timothy Coalson
I managed to miss your point despite your careful quoting. My apologies
to the list for the noise.

Shame on me as a writer, then.

To be concise (which I'm long trying to learn to be), the
statement I countered was that "discard request for the
whole vdev" is "a nice thing to have, though probably not
strictly a big performance improvement".

To detail it a bit more in other words:

My gut feeling is that with SSDs, such device-wide discard
*can* be a big improvement at least until the device gets
more and more full so that RMW kicks in again and wears
out the flash cells excessively.

In fact, if I get another poster right, not only do the
larger cell blocks (pages?) store many "sectors" (which
I knew), but also they might store "sectors" whose LBA
numbers are not adjacent (which I was not aware OTOH).

Then if the device is bulk-filled (as in resilvering) with
lots of data, even randomly-addressed (LBA) writes, for the
flash chips these might be coalesced into full-page writes
which just reprogram the whole page's content from zeroes
to data and don't RMW it soon. The trick is that after the
big TRIM event, the SSD device knows that all pages are to
be overwritten and does not care to save their old partial
contents (as in RMWing a sector into a page full of other
data). Also, on the background, the SSD's controller might
begin to prepare zeroed-out pages for further reprogramming
as they would become needed.

If this is all correct (which I don't know really - but tests
and theory seem to approve of), then the sentiment of "not
strictly a big performance improvement" is not correct. This
*would* improve both performance and reliability (longevity)
of the SSD devices reused from some obsolete task into a pool
VDEV.

If I still haven't conveyed my idea... well, probably I had
too little sleep in the past weeks, and I can't write well.
Sorry for the amplified noise, then.

Thanks,
//Jim

Timothy Coalson

2013-02-28 21:29:45 UTC

Post by Jim Klimov

Post by Timothy Coalson
I managed to miss your point despite your careful quoting. My apologies
to the list for the noise.

Shame on me as a writer, then.

My fault, not yours. It seemed pretty clear on a second reading.

To be concise (which I'm long trying to learn to be), the

Post by Jim Klimov
statement I countered was that "discard request for the
whole vdev" is "a nice thing to have, though probably not
strictly a big performance improvement".

It depends quite a bit on how the SSD was previously used. The magnitude
of the maximum difference depends on quite a few factors, including
firmware, nand type, and internal overprovisioning. In all cases, though,
if it is initially trimmed, it will take heavier use to hit the point where
it has to wait for an erasure before it can write new data. Here's a
fairly relevant collection of graphs on the worst case, but beware, it uses
log scale on most of them, so it doesn't look like as big of a performance
hit:

http://www.anandtech.com/show/6725/the-full-intel-ssd-525-review-30gb-60gb-120gb-180gb-240gb-tested/4

Post by Jim Klimov
My gut feeling is that with SSDs, such device-wide discard
*can* be a big improvement at least until the device gets
more and more full so that RMW kicks in again and wears
out the flash cells excessively.
In fact, if I get another poster right, not only do the
larger cell blocks (pages?) store many "sectors" (which
I knew), but also they might store "sectors" whose LBA
numbers are not adjacent (which I was not aware OTOH).

That is my understanding, and that the main problem for performance is
being forced to wait for a page to be erased, not that it necessarily
writes the data back to the same cells (in fact, one cycle of reading a
page and writing its used blocks back out somewhere can make several blocks
available for more writing, so typical RMW is probably not the best way to
think of it). As for longevity, I think that mainly depends on how often
it has to rewrite existing data. While these two things become more severe
for related reasons, they may not always be directly linked - as long as
there are some already erased blocks available, writes can burst at close
to full speed (though obviously for shorter periods if the drive has fewer
unused pages), even if the garbage collection during low usage periods does
a lot of shifting of data (wear on flash cells) in order to obtain more
erasable pages. Obviously, knowing that the blocks are initially unused is
a significant help to the SSD controller, for both problems.

Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-02-28 05:30:33 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
Hi Sašo,

Hi Etienne,

Post by Etienne Dechamps
I am the one who ported Pawel's patch to ZFS On Linux. You will find
https://github.com/zfsonlinux/zfs/pull/1016

Thanks, I'll take a look.

Nice work, Etienne!

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
I didn't take an in-depth look at your patch yet, and I probably won't
have time to do so in the near future. That said, I would like to bring
to your attention the following features that I would find useful in any
potential "discard" (which I use as a generic term for UNMAP, TRIM &
- Discard should not break disaster recovery (i.e. zpool import -F).
George apparently has a suggestion on how to make sure your patch
complies with this requirement.

With George's changes we delay "discards" by one txg. That being said,
there's probably room for improvement.

In my experience with recoveries, one txg is not enough. I've seen several
cases where we had to go back more than one txg. In this respect, I prefer
Pawel's implementation.
-- richard

--
ZFS and performance consulting
http://www.RichardElling.com

Etienne Dechamps

2013-02-27 21:43:18 UTC

Hi Sašo,

Post by SaÅ¡o Kiselkov
So I've been working on getting some initial support for SCSI Unmap and
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/

I am the one who ported Pawel's patch to ZFS On Linux. You will find
more details at the following URL:

https://github.com/zfsonlinux/zfs/pull/1016

I didn't take an in-depth look at your patch yet, and I probably won't
have time to do so in the near future. That said, I would like to bring
to your attention the following features that I would find useful in any
potential "discard" (which I use as a generic term for UNMAP, TRIM &
co.) implementation:

- Discard should not break disaster recovery (i.e. zpool import -F).
George apparently has a suggestion on how to make sure your patch
complies with this requirement. Pawel's patch is already compliant.

- Discards should be batched. For example in my proposed patch for ZFS
On Linux this is done by issuing discards every N TXGs (with e.g. N =
32) instead of each TXG. This is important because on most devices, a
few large discards will be done much faster and more efficiently than a
lot of small discards.

- Care should be taken with regard to proper scheduling of discard
requests vs. read/write requests (i.e. proper queuing and priority). The
reason for this is that, AFAIK, on some devices (especially SATA
devices), issuing discard requests effectively kills the performance of
the device for read/write operations. For this reason, discard requests
should only be issued when the device is otherwise idle. My patch
doesn't fulfill this requirement yet, and as a result, one of the
testers is complaining that his system is basically unusable when my
patch is issuing discard requests.

- It would be nice to have zpool create/add/attach issue a discard
request for the whole vdev to make sure we have a clean vdev to begin
with. This is what e.g. mke2fs does by default on Linux. Same thing when
exporting, removing or destroying cache vdevs. My patch implements this.

- If discards are delayed (which I think is unavoidable to comply with
requirements 1 and 2), then it would be nice to make sure pending
discards don't get lost in case the pool is exported, gracefully or not
(e.g. power failure). My patch doesn't implement this.

We now have two competing ZFS TRIM/UNMAP implementations in the wild. I
don't know if I should be happy or sad about this :/

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

62 Replies
7 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Sašo Kiselkov 2013-02-27 00:30:07 UTC

Garrett D'Amore 2013-02-27 05:07:14 UTC

Sašo Kiselkov 2013-02-27 09:25:52 UTC

Keith Wesolowski 2013-02-27 16:23:35 UTC

Sašo Kiselkov 2013-02-27 16:31:07 UTC

Garrett D'Amore 2013-02-28 10:44:44 UTC

Pawel Jakub Dawidek 2013-02-27 12:30:47 UTC

Sašo Kiselkov 2013-02-27 13:28:20 UTC

Pawel Jakub Dawidek 2013-02-28 09:21:21 UTC

Garrett D'Amore 2013-02-28 10:51:29 UTC

Pawel Jakub Dawidek 2013-02-28 20:37:56 UTC

Sašo Kiselkov 2013-02-27 17:42:21 UTC

Pawel Jakub Dawidek 2013-02-27 21:25:27 UTC

Sašo Kiselkov 2013-02-27 21:38:31 UTC

Etienne Dechamps 2013-02-27 21:55:49 UTC

Pawel Jakub Dawidek 2013-02-27 22:19:23 UTC

Sašo Kiselkov 2013-02-27 22:26:37 UTC

George Wilson 2013-02-27 17:13:54 UTC

Sašo Kiselkov 2013-02-27 17:47:26 UTC

Jim Klimov 2013-02-27 20:33:07 UTC

Sašo Kiselkov 2013-02-27 20:40:10 UTC

Sašo Kiselkov 2013-03-01 19:02:56 UTC

Pawel Jakub Dawidek 2013-03-01 19:45:21 UTC

Sašo Kiselkov 2013-03-01 21:15:39 UTC

Pawel Jakub Dawidek 2013-03-01 21:28:50 UTC

Sašo Kiselkov 2013-03-01 21:33:15 UTC

Sašo Kiselkov 2013-03-01 22:07:15 UTC

Sašo Kiselkov 2013-03-01 22:30:40 UTC

Pawel Jakub Dawidek 2013-03-02 20:45:23 UTC

Sašo Kiselkov 2013-03-03 02:20:17 UTC

Pawel Jakub Dawidek 2013-03-03 13:12:42 UTC

Sašo Kiselkov 2013-03-03 22:12:43 UTC

Matthew Ahrens 2013-11-22 20:11:19 UTC

Saso Kiselkov 2013-11-23 03:09:59 UTC

Matthew Ahrens 2013-11-23 03:22:25 UTC

Saso Kiselkov 2013-11-23 03:27:16 UTC

Etienne Dechamps 2013-03-01 23:15:21 UTC

Sašo Kiselkov 2013-03-01 23:25:18 UTC

Etienne Dechamps 2013-03-01 23:32:38 UTC

Sašo Kiselkov 2013-03-01 23:43:16 UTC

Etienne Dechamps 2013-03-02 22:30:00 UTC

Sašo Kiselkov 2013-03-02 22:38:55 UTC

Etienne Dechamps 2013-03-02 23:06:55 UTC

Sašo Kiselkov 2013-03-03 10:39:08 UTC

Jim Klimov 2013-03-02 23:40:45 UTC

Etienne Dechamps 2013-03-03 10:23:43 UTC

Sašo Kiselkov 2013-03-03 10:34:22 UTC

Sašo Kiselkov 2013-03-03 20:48:14 UTC

Sašo Kiselkov 2013-03-03 20:47:32 UTC

Tomas Forsman 2013-03-02 17:13:31 UTC

Etienne Dechamps 2013-02-27 21:45:09 UTC

Nico Williams 2013-02-27 22:01:14 UTC

Sašo Kiselkov 2013-02-27 21:58:25 UTC

Nico Williams 2013-02-27 22:32:56 UTC

Jim Klimov 2013-02-28 00:07:08 UTC

Timothy Coalson 2013-02-28 02:40:02 UTC

Timothy Coalson 2013-02-28 03:40:34 UTC

Freddie Cash 2013-02-28 04:05:05 UTC

Timothy Coalson 2013-02-28 06:09:11 UTC

Jim Klimov 2013-02-28 15:33:35 UTC

Timothy Coalson 2013-02-28 21:29:45 UTC

Richard Elling 2013-02-28 05:30:33 UTC

Etienne Dechamps 2013-02-27 21:43:18 UTC

about - legalese

Loading...