Guidance for 'ZFS L2ARC persistence' idea

Post by Prasad Joshi
http://wiki.illumos.org/display/illumos/Project+Ideas and after
talking with e^ipi over IRC, I have decided to work on "ZFS L2ARC
persistence". Would anyone be interested in guiding me out?

I believe Saso has almost finished that:

http://www.listbox.com/member/archive/182191/2013/05/search/cGVyc2lzdGVudA/sort/time/page/1/entry/7:10/20130515181637:19F41FF6-BDAD-11E2-88B3-F9DF5BC54725/

His last update was in May though, I'm not sure when he'll get back to
it.

Prasad Joshi

2013-07-30 22:58:52 UTC

Thanks.

Is there any other ZFS work (project) which is not under development?

Post by Paul B. Henson
http://www.listbox.com/member/archive/182191/2013/05/search/cGVyc2lzdGVudA/sort/time/page/1/entry/7:10/20130515181637:19F41FF6-BDAD-11E2-88B3-F9DF5BC54725/
His last update was in May though, I'm not sure when he'll get back to
it.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24824859-3356d804
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Paul B. Henson

2013-07-31 00:51:35 UTC

Post by Prasad Joshi
Is there any other ZFS work (project) which is not under development?

Have you looked through the issues database?

https://www.illumos.org/projects/illumos-gate/issues

There are a variety of bugs and feature requests of various levels of
difficulty, perhaps one would interest you.

Personally, I'd be interested in seeing TRIM/UNMAP support integrated
into illumos, I believe both the freebsd and linux forks have developed
different implementations.

Saso Kiselkov

2013-07-30 23:15:19 UTC

http://www.listbox.com/member/archive/182191/2013/05/search/cGVyc2lzdGVudA/sort/time/page/1/entry/7:10/20130515181637:19F41FF6-BDAD-11E2-88B3-F9DF5BC54725/
His last update was in May though, I'm not sure when he'll get back to
it.

Hi Paul,

It was working in May, but unfortunately a did a few changes after that
that broke some parts. As luck would have it, though, I'm working right
now on getting that sorted out and ready for upstreaming again.

Cheers,

--
Saso

Paul B. Henson

2013-07-31 00:53:29 UTC

Post by Saso Kiselkov
It was working in May, but unfortunately a did a few changes after that
that broke some parts. As luck would have it, though, I'm working right
now on getting that sorted out and ready for upstreaming again.

Cool; I'll keep my fingers crossed it gets integrated in time for the
next omnios stable :).

Thanks…

Saso Kiselkov

2013-07-31 13:08:23 UTC

Cool; I'll keep my fingers crossed it gets integrated in time for the
next omnios stable :).
Thanks…

Ok, I think I've got it working again. If people could give it a kick:
http://cr.illumos.org/~webrev/skiselkov/3525_take3/

In order to start using the persistency you need to do an L2-cache
remove & add again to regenerate the vdev configuration to mark the
device as persistency-capable. I haven't done a lot of extensive testing
on this so far, I need to get some more hardware for that, so anybody
able to conduct some early testing on this would help a great deal in
getting this stuff accelerated. Thanks!

Cheers,

--
Saso

Paul B. Henson

2013-08-01 03:10:08 UTC

Post by Saso Kiselkov
http://cr.illumos.org/~webrev/skiselkov/3525_take3/

Cool. You don't happen to have an omnios stable compatible binary :)? I
still haven't rebuilt my illumos dev box :(, I don't really want to put
together another OI system, and after some initial discussion there
ended up being no progress towards being able to build illumos-gate
under omnios <sigh>. I suppose I should just go ahead and reinstall OI
or I'll probably be stalled indefinitely.

Thanks…

Matthew Ahrens

2013-08-02 00:08:32 UTC

Saso - I took a quick look through the high-level comments and structure
definitions. Here are some initial comments:

list.h -- that's what list_move_tail() is for.

l2uberblock -- this is an on-disk structure. We can not allow it to have
different representations in different compilation environments. E.g. on
64-bit, the compiler will insert padding after ub_pbuf_asize so that
ub_pbuf_cksum will be 64-bit aligned. But on 32-bit it will not. do not
allow the compiler to add padding; and do not use enums (whose size is not
defined). Add explicit padding (e.g. after ub_version and ub_pbuf_asize)
and use uint*_t rather than enums. You can still declare
l2uberblock_flags_t as an enum, just declare ub_flags as a uint32_t.

what are the possible values for ub_alloc_space?

l2pbuf -- obviously this is not an on-disk structure, because you have a
pointer embedded in it, and it is not called X_phys_t. So why do you need
a magic number (pb_magic)?

pb_buflists_list -- what are possible items? If only bufs (l2pbuf_buf_t)
then say that. In general be as explicit as possible. E.g. "This is a
list of l2pbuf_buflist_t's, which each points to a list of l2pbuf_buf_t's"

4283 - what about unscheduled downtime? Is the l2arc not persistent if we
crash?

4291 - "what what's" - extra "what"

4300 - would you consider having 2 l2uberblocks and round-robin between
them, so that we can survive a power failure in the middle of writing the
l2uber?

4338 - why are you repeating the struct definition in the comment? Refer
us to the struct, or put the actual struct definition here. Having this in
2 places is just an opportunity for it to get out of date. It already
doesn't match the l2uberblock_t -- e.g. the space before the ub_cksum,
padding after pbuf_asize, and sizeof (ub_flags).

4353 - does compressing the array of struct l2pbuf_buf_item have a
substantial benefit? Where is this struct defined?

--matt

Cool; I'll keep my fingers crossed it gets integrated in time for the
next omnios stable :).
Thanks

http://cr.illumos.org/~webrev/skiselkov/3525_take3/
In order to start using the persistency you need to do an L2-cache
remove & add again to regenerate the vdev configuration to mark the
device as persistency-capable. I haven't done a lot of extensive testing
on this so far, I need to get some more hardware for that, so anybody
able to conduct some early testing on this would help a great deal in
getting this stuff accelerated. Thanks!
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-02 09:44:35 UTC

Post by Matthew Ahrens
Saso - I took a quick look through the high-level comments and structure

Hi Matt, responses below, updated webrev here:
http://cr.illumos.org/~webrev/skiselkov/3525_take4/

Post by Matthew Ahrens
list.h -- that's what list_move_tail() is for.

Reworded it to say:

"When copying structures with lists use list_move_tail() to move the
list from the src to dst (the source reference will then become invalid)."

I just think these kinds of gotchas should be made very clear, that's
why I added the comment.

Post by Matthew Ahrens
l2uberblock -- this is an on-disk structure. We can not allow it to
have different representations in different compilation environments.
E.g. on 64-bit, the compiler will insert padding after ub_pbuf_asize so
that ub_pbuf_cksum will be 64-bit aligned. But on 32-bit it will not.
do not allow the compiler to add padding; and do not use enums (whose
size is not defined). Add explicit padding (e.g. after ub_version and
ub_pbuf_asize) and use uint*_t rather than enums. You can still declare
l2uberblock_flags_t as an enum, just declare ub_flags as a uint32_t.

It's not used as an on-disk structure. I never ever use direct
bcopy/memcpy to read or write these. Instead, there are the
l2arc_uberblock_[en|de]code functions which take care of that. The
actual on-device data structure are described in arc.c:4316 (and there
you'll notice that in fact all fields have exact sizes).

Post by Matthew Ahrens
what are the possible values for ub_alloc_space?

See arc.c:4330:
uint64_t alloc_space; how much space is alloc'd on the dev

Post by Matthew Ahrens
l2pbuf -- obviously this is not an on-disk structure, because you have a
pointer embedded in it, and it is not called X_phys_t. So why do you
need a magic number (pb_magic)?

Because it's an in-memory representation of the fields of the on-disk
structure, without the alignment hassles (these are taken care of in the
encoder/decoder functions). This way anybody who needs to can verify
whether a pbuf follows the correct magic format by using it in ASSERTs.
Of course you might argue that it's not used that way right now, so we
might only want to make it local to l2arc_pbuf_decode, but the cost of
carrying around 4 extra bytes (plus at any given point in time there are
at most ~4 pbuf structures allocated), so I didn't worry about the
wasted overhead.

Also, I wasn't aware of the requirements for the <X>_phys_t naming
nomenclature. How does that work?

Post by Matthew Ahrens
pb_buflists_list -- what are possible items? If only bufs
(l2pbuf_buf_t) then say that. In general be as explicit as possible.
E.g. "This is a list of l2pbuf_buflist_t's, which each points to a list
of l2pbuf_buf_t's"

Added.

Post by Matthew Ahrens
4283 - what about unscheduled downtime? Is the l2arc not persistent if
we crash?

Reworded it to say "any downtime".

Post by Matthew Ahrens
4291 - "what what's" - extra "what"

Fixed.

Post by Matthew Ahrens
4300 - would you consider having 2 l2uberblocks and round-robin between
them, so that we can survive a power failure in the middle of writing
the l2uber?

Since writing of the L2 uberblock isn't done in a transactionally-safe
manner (as opposed to the main pool uberblocks), I'm not sure there's
any point in bothering with this. L2 cache devices (mostly SSDs) are
really quite complex and it's dubious any of our simple assumptions
about write ordering or cache handling (since we don't do cache flushes
on L2ARC writes anyway) would apply.

Post by Matthew Ahrens
4338 - why are you repeating the struct definition in the comment?
Refer us to the struct, or put the actual struct definition here.
Having this in 2 places is just an opportunity for it to get out of
date. It already doesn't match the l2uberblock_t -- e.g. the space
before the ub_cksum, padding after pbuf_asize, and sizeof (ub_flags).

This is the actual on-disk representation documentation and is
authoritative. The in-memory struct should follow it as far as is
practical for an implementation. Think of it as the formal spec to which
the implementation is coded. That the C struct appears "before" it is an
unfortunate coincidence of how our source code is organized.

Post by Matthew Ahrens
4353 - does compressing the array of struct l2pbuf_buf_item have a
substantial benefit?

In my tests typical metadata savings are on the order of 30-50% and the
cost of compression/decompression is negligible (~600 kB worth of
metadata compressed to ~350 kB on-disk per ~100 MB of user data).
l2_asize_to_meta_ratio gives the data:metadata ratio (in my tests I
usually get around 250:1, IOW metadata takes up <0.5% in L2ARC).

Post by Matthew Ahrens
Where is this struct defined?

We implement that array using an l2pbuf_buflist_t. When a pbuf is
decoded from disk, all ARC buffer references are placed in a single
l2pbuf_buflist_t, which means l2pbuf_t.pb_nbuflists = 1. When generating
new pbufs to be written to disk, each l2arc_write_buffers() call
generates a new l2pbuf_buflist_t and chains it into the l2pbuf_t. The
encoding routine then iterates over the lists and serializes all ARC
buffer references into the on-disk format.

Cheers,

--
Saso

Matthew Ahrens

2013-08-02 22:46:41 UTC

It seems like l2arc devices are shared globally. I.e. arc_buf's for any
pool can be written to any cache device. In particular, one pool's blocks
can be written to another pool's cache device. Is that correct? If so,
how do you handle it? Seems like either your pbuf should only contain
pointers to blocks from the containing pool, or you should store the pool's
guid in each on-disk entry.

--matt

Cool; I'll keep my fingers crossed it gets integrated in time for the
next omnios stable :).
Thanks

Paul B. Henson

2013-08-03 01:06:44 UTC

Schlacta, Christ

2013-08-03 02:29:34 UTC

I've gotten numerous confirmations that cache devices are pool specific.
That said, I would love an option to add system specific cache devices and
multi pool cache devices

Huh, dunno about the internals, but that's not what the documentation
makes it sound like. My understanding has always been that cache devices
are added to a specific pool, and my assumption was that they only cached
data for that pool. I'll let someone more knowledgeable about internals
clarify, but if any configured l2arc device can cache data from any pool,
the documentation definitely needs some workâŠ
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
23054485-60ad043a<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=23054485-335460f5<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-05 10:13:24 UTC

Post by Matthew Ahrens
It seems like l2arc devices are shared globally. I.e. arc_buf's for any
pool can be written to any cache device. In particular, one pool's
blocks can be written to another pool's cache device. Is that correct?
If so, how do you handle it? Seems like either your pbuf should only
contain pointers to blocks from the containing pool, or you should store
the pool's guid in each on-disk entry.

Yes, I remember that being mentioned somewhere, but I tested it just now
and it doesn't appear to work:

# mkfile -n 2g test test2 cache
# lofiadm -a cache
/dev/lofi/1
# zpool create pool_A /root/test cache /dev/lofi/1
# zpool status pool_A
pool: pool_A
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
pool_A ONLINE 0 0 0
/root/test ONLINE 0 0 0
cache
/dev/lofi/1 ONLINE 0 0 0

# zpool create pool_B /root/test2 cache /dev/lofi/1
cannot create 'pool_B': one or more vdevs refer to the same device

So maybe it's libzfs checking some run-time status, let's export pool_A
and try again:

# zpool export pool_A
# zpool create pool_B /root/test2 cache /dev/lofi/1
# zpool import -d /root pool_A
# zpool status pool_A
pool: pool_A
state: ONLINE
status: One or more devices could not be used because the label is
missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: none requested
config:

NAME STATE READ WRITE CKSUM
pool_A ONLINE 0 0 0
/root/test ONLINE 0 0 0
cache
7651257546887543790 FAULTED 0 0 0 was /dev/lofi/1

errors: No known data errors

So it appears that once another zpool takes possession of an L2ARC
device, it overwrites the labels, rendering the device unusable on the
original pool. Also, looking at the design of struct l2arc_dev it
appears that this could never have worked. It holds both l2ad_hand,
which is our "current offset" pointer into the L2ARC device, and
l2ad_spa, when the feed thread picks an L2ARC device to write to, it
automatically determines which SPA's buffers it will put there.

Should we fix this? My inclination is "no" - it appears this behavior
was in effect for a long time and so far nobody complained, so it would
appear nobody really needed it.

Cheers,

--
Saso

Matthew Ahrens

2013-08-05 16:12:37 UTC

Yeah, I realized after posting that l2arc_write_eligibile() makes sure that
we only write blocks from the owning SPA. I don't think this needs to be
changed. We might add some more assertions along the l2arc write path to
verify that this is the case, though :)

--matt

Yes, I remember that being mentioned somewhere, but I tested it just now
# mkfile -n 2g test test2 cache
# lofiadm -a cache
/dev/lofi/1
# zpool create pool_A /root/test cache /dev/lofi/1
# zpool status pool_A
pool: pool_A
state: ONLINE
scan: none requested
NAME STATE READ WRITE CKSUM
pool_A ONLINE 0 0 0
/root/test ONLINE 0 0 0
cache
/dev/lofi/1 ONLINE 0 0 0
# zpool create pool_B /root/test2 cache /dev/lofi/1
cannot create 'pool_B': one or more vdevs refer to the same device
So maybe it's libzfs checking some run-time status, let's export pool_A
# zpool export pool_A
# zpool create pool_B /root/test2 cache /dev/lofi/1
# zpool import -d /root pool_A
# zpool status pool_A
pool: pool_A
state: ONLINE
status: One or more devices could not be used because the label is
missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: none requested
NAME STATE READ WRITE CKSUM
pool_A ONLINE 0 0 0
/root/test ONLINE 0 0 0
cache
7651257546887543790 FAULTED 0 0 0 was /dev/lofi/1
errors: No known data errors
So it appears that once another zpool takes possession of an L2ARC
device, it overwrites the labels, rendering the device unusable on the
original pool. Also, looking at the design of struct l2arc_dev it
appears that this could never have worked. It holds both l2ad_hand,
which is our "current offset" pointer into the L2ARC device, and
l2ad_spa, when the feed thread picks an L2ARC device to write to, it
automatically determines which SPA's buffers it will put there.
Should we fix this? My inclination is "no" - it appears this behavior
was in effect for a long time and so far nobody complained, so it would
appear nobody really needed it.
Cheers,
--
Saso

Yaverot

2013-07-31 18:30:56 UTC

Post by Saso Kiselkov
In order to start using the persistency you need to
do an L2-cache remove & add again to regenerate the
vdev configuration to mark the device as
persistency-capable. I haven't done a lot of extensive
testing on this so far, I need to get some more hardware
for that, so anybody able to conduct some early testing
on this would help a great deal in getting this stuff
accelerated. Thanks!

In making sure questions get asked:
Once turned on, will it be possible to make a new non-persistent cache?
Is there a usecase for a non-persistent cache once this feature exists?

(I don't have a personal stake in these, so please reply to the list only.)

Saso Kiselkov

2013-07-31 18:54:22 UTC

Post by Yaverot
Once turned on, will it be possible to make a new non-persistent cache?

No.

Post by Yaverot
Is there a usecase for a non-persistent cache once this feature exists?

Not that I know of, hence no tunable to turn the feature off.

Cheers,

--
Saso

Schweiss, Chip

2013-08-23 12:57:37 UTC

I've been following the progress of L2ARC persistence and I'm looking
forward to it being integrated, but I have a bit of a concern on its
implementation as I understand it.

I have been evaluating RSF-1 to manage hot failover on an active-active
setup. It can migrate my pools in about 3 seconds.

It is my understanding that refreshing the L2ARC operation happens during
the pool import and the pool will not be available until complete. One of
the pools I'm planning to use RSF-1 has 4 480GB SLC SSDs for L2ARC. While
they read fast it still takes a fair amount of time to read all of them.
Won't this make failover times jump significantly?

Am I wrong on this assumption? If not here is a clear case need to be able
to turn it off. If it reloads the L2ARC after pool import in the
background, my point is moot.

-Chip

Post by Yaverot
Once turned on, will it be possible to make a new non-persistent cache?

No.

Post by Yaverot
Is there a usecase for a non-persistent cache once this feature exists?

Not that I know of, hence no tunable to turn the feature off.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21878139-69539aca
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-23 13:00:56 UTC