Discussion:
[Review] #3525 Persistent L2ARC
Saso Kiselkov
2013-10-11 01:49:18 UTC
Permalink
I've decided to resurrect my original persistent L2ARC webrev from about
a year ago and rework it with additional design advice kindly provided
by Matt Ahrens. Interested folks, please take a preliminary view of this
changeset and let's get the ball rolling on getting this integrated.

http://cr.illumos.org/~webrev/skiselkov/3525_simplified/

This partial rewrite accomplishes several important objectives:

1) Makes the design simpler and (I hope) easier to understand. I've cut
it by ~400 lines and it's now following proper naming nomenclature,
plus I got rid of all the encoding/decoding functions.

2) More robust. We no longer "taste" the L2ARC device to see if there's
an uberblock there. Instead we store this attribute in the vdev
config (magic numbers are kept as fail safes, of course).

3) Faster rebuilds. As before, rebuilds are done in background threads
and in parallel on all available L2ARC devices, so it doesn't block
pool import, but thanks to some restructuring of the way we link the
metadata chain I was able to optimize this phase, so that we saturate
the devices much better (rebuilding ~100GB worth of L2ARC takes ~2
seconds on my crappy little system).

Please consider this code alpha for now, as it hasn't seen much exposure
in production yet. I'm currently beating the crap out of it on my backup
machine. Any volunteers willing to help with testing are very much
welcome, especially if you reboot your machine a lot, or have
shared-storage pools capable of simulating ungraceful takeovers. I have
reasonable confidence that it won't trash your data, so worst case is
you'll see a kernel panic (make sure you can capture that if need be).
Please do not deploy into production if you value uptime.

Cheers,
--
Saso
Saso Kiselkov
2013-10-12 23:21:05 UTC
Permalink
While benchmarking the persistent L2ARC implementation I hit upon a
"slight" snag with smaller block sizes and rebuild times. In essence, if
you have many small blocks in your L2ARC (e.g. 8k due to hosting zvols),
rebuild can take forever to complete (tens of seconds). The reason has
to do with the ARC hash table implementation, so first a bit of background:

The ARC holds blocks in a hash table. No surprises there. Aside from the
obvious small room for improvement by marking all very hot hash table
manipulation functions as "inline" there are bigger problems that need
to be addressed. First of all, how does one size the hash table? Small
hash tables take up less space, but then result in more collisions and
longer bucket chains (which hurts performance). Larger hash tables have
better performance but are (doh) larger. I even went and tried switching
the implementation to AVL trees (moderate performance improvement in
some places, moderate performance decrease in others).

The current implementation is hardcoded guesswork as to the correct hash
table size. In principle, it works by taking the amount of physical
memory and dividing it by a 64k block size. The result is the amount of
hash buckets that will be created. On 64-bit machines this comes out to
roughly 128kB of hash tables for each 1 GB of physical memory. This
approach has obvious flaws:

1.What if we need to cache many smaller blocks? We'll get excessively
long bucket chains (see the hash_chain_max and hash_collisions kstats
on your systems).

2.What if we need to cache a lot more blocks than fit into main memory?
The L2ARC needs to keep its ARC headers around.

3.What happens when the hash table size is excessively large, as happens
on bigmem systems? E.g. on a system with 1TB of physical memory (which
can be easily built nowadays, and it's only going to get cheaper) the
hash table will be a contiguous chunk of memory some 128MB in size.
If we try to combat problems #1 & #2 by making the hash table larger,
we exacerbate problem #3.

My solutions to each of the problems are, (implemented in the webrev to
issue #3525):

1: Changed the sizing algorithm to use a kernel tunable. This allows
administrators with specific workloads to change the hash table size
according to their needs without having to build custom kernel
modules.

2: Increased the hash table size so as to run at 1MB for every 1GB of
main memory. We can easily afford to lose 0.1% of memory capacity
considering the performance gains (outlined below), especially
since today many systems use L2ARC as an integral part of their
storage infrastructure.

3: Switch to a 2D hash table. This means that the largest contiguous
chunk of memory will be at most sqrt(hash_table_size).

Webrev: http://cr.illumos.org/~webrev/skiselkov/3525_simplified/

Overall, the above changes boil down to the following hash table size
numbers (from small to large systems):

Physical RAM HT Size HT Dimensions
-------------------------------------------------
4 GB 4 MB 2048x 4k cols
128 GB 128 MB 4096x 32k cols
1 TB 1 GB 32768x 256k cols
16 TB 16 GB 262144x 2M cols

(For the actual algorithm see arc.c in the webrev.)

The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).

I apologize for this long litany of an email and look forward to your
comments and reviews.

Cheers,
--
Saso
Tim Cook
2013-10-13 02:06:23 UTC
Permalink
Post by Saso Kiselkov
While benchmarking the persistent L2ARC implementation I hit upon a
"slight" snag with smaller block sizes and rebuild times. In essence, if
you have many small blocks in your L2ARC (e.g. 8k due to hosting zvols),
rebuild can take forever to complete (tens of seconds). The reason has
The ARC holds blocks in a hash table. No surprises there. Aside from the
obvious small room for improvement by marking all very hot hash table
manipulation functions as "inline" there are bigger problems that need
to be addressed. First of all, how does one size the hash table? Small
hash tables take up less space, but then result in more collisions and
longer bucket chains (which hurts performance). Larger hash tables have
better performance but are (doh) larger. I even went and tried switching
the implementation to AVL trees (moderate performance improvement in
some places, moderate performance decrease in others).
The current implementation is hardcoded guesswork as to the correct hash
table size. In principle, it works by taking the amount of physical
memory and dividing it by a 64k block size. The result is the amount of
hash buckets that will be created. On 64-bit machines this comes out to
roughly 128kB of hash tables for each 1 GB of physical memory. This
1.What if we need to cache many smaller blocks? We'll get excessively
long bucket chains (see the hash_chain_max and hash_collisions kstats
on your systems).
2.What if we need to cache a lot more blocks than fit into main memory?
The L2ARC needs to keep its ARC headers around.
3.What happens when the hash table size is excessively large, as happens
on bigmem systems? E.g. on a system with 1TB of physical memory (which
can be easily built nowadays, and it's only going to get cheaper) the
hash table will be a contiguous chunk of memory some 128MB in size.
If we try to combat problems #1 & #2 by making the hash table larger,
we exacerbate problem #3.
My solutions to each of the problems are, (implemented in the webrev to
1: Changed the sizing algorithm to use a kernel tunable. This allows
administrators with specific workloads to change the hash table size
according to their needs without having to build custom kernel
modules.
2: Increased the hash table size so as to run at 1MB for every 1GB of
main memory. We can easily afford to lose 0.1% of memory capacity
considering the performance gains (outlined below), especially
since today many systems use L2ARC as an integral part of their
storage infrastructure.
3: Switch to a 2D hash table. This means that the largest contiguous
chunk of memory will be at most sqrt(hash_table_size).
Webrev: http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
Overall, the above changes boil down to the following hash table size
Physical RAM HT Size HT Dimensions
-------------------------------------------------
4 GB 4 MB 2048x 4k cols
128 GB 128 MB 4096x 32k cols
1 TB 1 GB 32768x 256k cols
16 TB 16 GB 262144x 2M cols
(For the actual algorithm see arc.c in the webrev.)
The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).
I apologize for this long litany of an email and look forward to your
comments and reviews.
Cheers,
--
Saso
Not in a position to review the code, but appreciate the detailed
description of the problem and fix. No need to apologize for being verbose
;)

--Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-10-14 20:35:06 UTC
Permalink
I agree that assuming 64K average block size is a bad idea. We should
assume smaller average block size, and also allow for this to be overridden
in /etc/system -- as you have done.

Are these changes linked with the persistent L2ARC work? Could we first
integrate the hash table size changes and evaluate persistency separately?
Post by Saso Kiselkov
3.What happens when the hash table size is excessively large, as happens
on bigmem systems?
Indeed, what happens? I would guess that it works just fine, especially
given that this will be executed early in boot, when nearly all of the
address space is available. Do you have evidence that kmem_alloc(1GB) does
not work?

--matt
Post by Saso Kiselkov
While benchmarking the persistent L2ARC implementation I hit upon a
"slight" snag with smaller block sizes and rebuild times. In essence, if
you have many small blocks in your L2ARC (e.g. 8k due to hosting zvols),
rebuild can take forever to complete (tens of seconds). The reason has
The ARC holds blocks in a hash table. No surprises there. Aside from the
obvious small room for improvement by marking all very hot hash table
manipulation functions as "inline" there are bigger problems that need
to be addressed. First of all, how does one size the hash table? Small
hash tables take up less space, but then result in more collisions and
longer bucket chains (which hurts performance). Larger hash tables have
better performance but are (doh) larger. I even went and tried switching
the implementation to AVL trees (moderate performance improvement in
some places, moderate performance decrease in others).
The current implementation is hardcoded guesswork as to the correct hash
table size. In principle, it works by taking the amount of physical
memory and dividing it by a 64k block size. The result is the amount of
hash buckets that will be created. On 64-bit machines this comes out to
roughly 128kB of hash tables for each 1 GB of physical memory. This
1.What if we need to cache many smaller blocks? We'll get excessively
long bucket chains (see the hash_chain_max and hash_collisions kstats
on your systems).
2.What if we need to cache a lot more blocks than fit into main memory?
The L2ARC needs to keep its ARC headers around.
3.What happens when the hash table size is excessively large, as happens
on bigmem systems? E.g. on a system with 1TB of physical memory (which
can be easily built nowadays, and it's only going to get cheaper) the
hash table will be a contiguous chunk of memory some 128MB in size.
If we try to combat problems #1 & #2 by making the hash table larger,
we exacerbate problem #3.
My solutions to each of the problems are, (implemented in the webrev to
1: Changed the sizing algorithm to use a kernel tunable. This allows
administrators with specific workloads to change the hash table size
according to their needs without having to build custom kernel
modules.
2: Increased the hash table size so as to run at 1MB for every 1GB of
main memory. We can easily afford to lose 0.1% of memory capacity
considering the performance gains (outlined below), especially
since today many systems use L2ARC as an integral part of their
storage infrastructure.
3: Switch to a 2D hash table. This means that the largest contiguous
chunk of memory will be at most sqrt(hash_table_size).
Webrev: http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
Overall, the above changes boil down to the following hash table size
Physical RAM HT Size HT Dimensions
-------------------------------------------------
4 GB 4 MB 2048x 4k cols
128 GB 128 MB 4096x 32k cols
1 TB 1 GB 32768x 256k cols
16 TB 16 GB 262144x 2M cols
(For the actual algorithm see arc.c in the webrev.)
The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).
I apologize for this long litany of an email and look forward to your
comments and reviews.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?member_id=21635000&id_secret=21635000-73dc201a
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-10-15 02:34:10 UTC
Permalink
Post by Saso Kiselkov
The current implementation is hardcoded guesswork as to the correct hash
table size. In principle, it works by taking the amount of physical
memory and dividing it by a 64k block size. The result is the amount of
hash buckets that will be created. On 64-bit machines this comes out to
roughly 128kB of hash tables for each 1 GB of physical memory. This
What happens if you export the pool on one system and then try to import
it on a system with less memory? SAS switches make that scenario more
likely than most would expect.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-10-15 09:49:57 UTC
Permalink
Post by Richard Yao
Post by Saso Kiselkov
The current implementation is hardcoded guesswork as to the correct hash
table size. In principle, it works by taking the amount of physical
memory and dividing it by a 64k block size. The result is the amount of
hash buckets that will be created. On 64-bit machines this comes out to
roughly 128kB of hash tables for each 1 GB of physical memory. This
What happens if you export the pool on one system and then try to import
it on a system with less memory? SAS switches make that scenario more
likely than most would expect.
The hash table is sized by each system individually when they load the
zfs kernel module (it's a purely in-memory data structure), so one
system will have a larger hash table than the other. See buf_init() in arc.c

Cheers,
--
Saso
Prakash Surya
2013-10-15 21:26:09 UTC
Permalink
Post by Saso Kiselkov
The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).
Where exactly did the performance improvement come from? I'm not
doubting the work, I'm just curious. Is it faster because the hash
chains are shorter now (i.e. less "work" done under bucket lock leading
to better concurrency)? Or something else?

If the improvement does comes indeed come from increased concurrency due
to shorter hash chains, I'd imagine this would benefit workloads other
than L2ARC rebuild times.

As such, It would be nice to have a before/after workload showing the
better performance numbers without having to rely on the persistent
L2ARC changes. Especially since I see this work being split into its
own independent patch, as posted to the open-zfs list.
Post by Saso Kiselkov
I apologize for this long litany of an email and look forward to your
comments and reviews.
Actually, I think it was useful context.
--
Cheers, Prakash
Post by Saso Kiselkov
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23963346-4bb55813
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-10-15 21:47:22 UTC
Permalink
Post by Prakash Surya
Post by Saso Kiselkov
The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).
Where exactly did the performance improvement come from? I'm not
doubting the work, I'm just curious. Is it faster because the hash
chains are shorter now (i.e. less "work" done under bucket lock leading
to better concurrency)? Or something else?
If the improvement does comes indeed come from increased concurrency due
to shorter hash chains, I'd imagine this would benefit workloads other
than L2ARC rebuild times.
As such, It would be nice to have a before/after workload showing the
better performance numbers without having to rely on the persistent
L2ARC changes. Especially since I see this work being split into its
own independent patch, as posted to the open-zfs list.
The performance improvement has nothing to do with locking or
concurrency, it is simply due to the fact that we have to do less work
to find a particular buffer in the hash table.

I hope you won't find it presumptuous on my part if I believe you are
not very familiar with how the ARC hash table (or perhaps any hash
table?) works. So a quick recap:

* All reads in ZFS go through arc_read() to cache buffers.
* The request specifies a buffer by DVA.
* Thus the ARC needs some method of locating buffers in the cache
by its DVA.

When the ARC reads in a buffer from the pool, it quickly computes a
CRC64 "hash" from the buffer's DVA (plus a few other bits of info in the
arc_read() request). This hash is then used as an index into the hash
table. The items of the hash table are linked lists of ARC buffers. If
there already is an ARC buffer at the index, the current buffer is
appended to the list and we're done.

Similarly, if we're looking for a buffer by DVA, we compute the hash
index, linked list at that index in the hash table until we find the
buffer that exactly matches the search criteria. There may be multiple
buffers in the chain due to hash collisions (i.e. multiple buffers with
different DVAs hashing to the same index) - this is normal and expected.

For details see buf_hash_find(), buf_hash_insert() and buf_hash_remove().

What my change does is it expands the size of the hash table so that
fewer hash collisions occur, leading to shorter buffer chains and faster
lookups.

Cheers,
--
Saso
Prakash Surya
2013-10-16 01:42:05 UTC
Permalink
Post by Saso Kiselkov
Post by Prakash Surya
Post by Saso Kiselkov
The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).
Where exactly did the performance improvement come from? I'm not
doubting the work, I'm just curious. Is it faster because the hash
chains are shorter now (i.e. less "work" done under bucket lock leading
to better concurrency)? Or something else?
If the improvement does comes indeed come from increased concurrency due
to shorter hash chains, I'd imagine this would benefit workloads other
than L2ARC rebuild times.
As such, It would be nice to have a before/after workload showing the
better performance numbers without having to rely on the persistent
L2ARC changes. Especially since I see this work being split into its
own independent patch, as posted to the open-zfs list.
The performance improvement has nothing to do with locking or
concurrency, it is simply due to the fact that we have to do less work
to find a particular buffer in the hash table.
OK, that is where I assumed the speed up was coming from (shorter
chains leading to faster lookups).

I also assumed there would be a "bucket lock" that needs to be acquired
during this lookup similar to the dbuf hash (which would affect
concurrency), but I guess that isn't the case here (I haven't studied
the arc hash code as well as I have the dbuf hash code).
Post by Saso Kiselkov
I hope you won't find it presumptuous on my part if I believe you are
not very familiar with how the ARC hash table (or perhaps any hash
* All reads in ZFS go through arc_read() to cache buffers.
* The request specifies a buffer by DVA.
* Thus the ARC needs some method of locating buffers in the cache
by its DVA.
When the ARC reads in a buffer from the pool, it quickly computes a
CRC64 "hash" from the buffer's DVA (plus a few other bits of info in the
arc_read() request). This hash is then used as an index into the hash
table. The items of the hash table are linked lists of ARC buffers. If
there already is an ARC buffer at the index, the current buffer is
appended to the list and we're done.
Similarly, if we're looking for a buffer by DVA, we compute the hash
index, linked list at that index in the hash table until we find the
buffer that exactly matches the search criteria. There may be multiple
buffers in the chain due to hash collisions (i.e. multiple buffers with
different DVAs hashing to the same index) - this is normal and expected.
For details see buf_hash_find(), buf_hash_insert() and buf_hash_remove().
What my change does is it expands the size of the hash table so that
fewer hash collisions occur, leading to shorter buffer chains and faster
lookups.
So, if this simply comes down to a hash collision issue, can't we try
and take this a bit further.. Can we make the hash size be completely
dynamic? Instead of using another heuristic, can we grow (and shrink?)
the hash as the workload demands? So if the max chain depth reaches a
threshold, we increase the number of hash buckets (e.g. double it).

Of course the details of performing the grow (i.e. rehash) operation
would need to be worked out so it doesn't affect performance,
consistency, etc.. But from a theoretical stand point, moving it to be
sized dynamically seems like a much better solution, IMO.
--
Cheers, Prakash
Post by Saso Kiselkov
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23963346-4bb55813
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-10-16 23:09:20 UTC
Permalink
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218

Cheers,
--
Saso
Steven Hartland
2013-10-16 23:24:58 UTC
Permalink
----- Original Message -----
From: "Saso Kiselkov" <***@gmail.com>
To: <***@open-zfs.org>; "illumos-zfs" <***@lists.illumos.org>
Sent: Thursday, October 17, 2013 12:09 AM
Subject: [OpenZFS Developer] [Review] Tunable ARC buf hash sizing
Post by Saso Kiselkov
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218
Did you mean "This index is then masked with ht_mask" (remove
the word "by"?) in the following:
+ * which spits out a 64-bit hash index. This index is then masked
+ * by with ht_mask to obtain the final index into the hash table:

It may be a silly question but should the hash sizing not be based
around zfs_arc_max instead of physmem?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.
Saso Kiselkov
2013-10-16 23:34:52 UTC
Permalink
Post by Steven Hartland
Sent: Thursday, October 17, 2013 12:09 AM
Subject: [OpenZFS Developer] [Review] Tunable ARC buf hash sizing
Post by Saso Kiselkov
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218
Did you mean "This index is then masked with ht_mask" (remove
+ * which spits out a 64-bit hash index. This index is then masked
You're right, I was hastily editing the description to correspond to the
1D hash table stuff and forgot to delete a preposition. Updated.
Post by Steven Hartland
It may be a silly question but should the hash sizing not be based
around zfs_arc_max instead of physmem?
I don't see much reason to tie these two together. The in-memory size of
the ARC (zfs_arc_max) is independent of the performance of hash lookups
on it (keep in mind that the ARC also holds references to buffers in the
L2ARC).
--
Saso
Steven Hartland
2013-10-17 00:00:16 UTC
Permalink
----- Original Message -----
Post by Saso Kiselkov
Post by Steven Hartland
Post by Saso Kiselkov
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218
Did you mean "This index is then masked with ht_mask" (remove
+ * which spits out a 64-bit hash index. This index is then masked
You're right, I was hastily editing the description to correspond to the
1D hash table stuff and forgot to delete a preposition. Updated.
Post by Steven Hartland
It may be a silly question but should the hash sizing not be based
around zfs_arc_max instead of physmem?
I don't see much reason to tie these two together. The in-memory size of
the ARC (zfs_arc_max) is independent of the performance of hash lookups
on it (keep in mind that the ARC also holds references to buffers in the
L2ARC).
How about the case where the admin has specifically sized a smaller
zfs_arc_max to keep ZFS / ARC memory requirements down as they want
the memory for other uses, and there is no L2ARC.

In this case sizing the hash based of the machine physmem could counter
act this and hence cause a problem, could it not?

I know its extreme but for example a machine with 256GB of ram but
zfs_arc_max set to 1GB you'd be allocating 256MB of that as the hash
size, which is surely a massive waste as you wouldn't need 256MB of
hash for just 1GB of ARC buffers?

Am I still barking up the wrong tree?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.
Matthew Ahrens
2013-10-17 00:03:28 UTC
Permalink
Post by Steven Hartland
Post by Saso Kiselkov
Post by Saso Kiselkov
Okay, I've reworked and simplified the ARC buf hash changes to just
Post by Saso Kiselkov
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/**skiselkov/new_buf_hash/<http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/>
Issue: https://www.illumos.org/**issues/4218<https://www.illumos.org/issues/4218>
Did you mean "This index is then masked with ht_mask" (remove
+ * which spits out a 64-bit hash index. This index is then masked
You're right, I was hastily editing the description to correspond to the
1D hash table stuff and forgot to delete a preposition. Updated.
It may be a silly question but should the hash sizing not be based
Post by Saso Kiselkov
around zfs_arc_max instead of physmem?
I don't see much reason to tie these two together. The in-memory size of
the ARC (zfs_arc_max) is independent of the performance of hash lookups
on it (keep in mind that the ARC also holds references to buffers in the
L2ARC).
How about the case where the admin has specifically sized a smaller
zfs_arc_max to keep ZFS / ARC memory requirements down as they want
the memory for other uses, and there is no L2ARC.
In this case sizing the hash based of the machine physmem could counter
act this and hence cause a problem, could it not?
I know its extreme but for example a machine with 256GB of ram but
zfs_arc_max set to 1GB you'd be allocating 256MB of that as the hash
size, which is surely a massive waste as you wouldn't need 256MB of
hash for just 1GB of ARC buffers?
Am I still barking up the wrong tree?
They can dynamically change arc_c_max after we've booted, which could leave
the hash table much too small, if it was sized based on what zfs_arc_max
was when we booted.

I'd say keep it simple until we see a problem.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-10-17 00:11:32 UTC
Permalink
On Wed, Oct 16, 2013 at 5:00 PM, Steven Hartland
How about the case where the admin has specifically sized a smaller
zfs_arc_max to keep ZFS / ARC memory requirements down as they want
the memory for other uses, and there is no L2ARC.
In this case sizing the hash based of the machine physmem could counter
act this and hence cause a problem, could it not?
I know its extreme but for example a machine with 256GB of ram but
zfs_arc_max set to 1GB you'd be allocating 256MB of that as the hash
size, which is surely a massive waste as you wouldn't need 256MB of
hash for just 1GB of ARC buffers?
Am I still barking up the wrong tree?
They can dynamically change arc_c_max after we've booted, which could
leave the hash table much too small, if it was sized based on what
zfs_arc_max was when we booted.
I'd say keep it simple until we see a problem.
+1.
--
Saso
Prakash Surya
2013-10-18 14:47:00 UTC
Permalink
Post by Saso Kiselkov
On Wed, Oct 16, 2013 at 5:00 PM, Steven Hartland
How about the case where the admin has specifically sized a smaller
zfs_arc_max to keep ZFS / ARC memory requirements down as they want
the memory for other uses, and there is no L2ARC.
In this case sizing the hash based of the machine physmem could counter
act this and hence cause a problem, could it not?
I know its extreme but for example a machine with 256GB of ram but
zfs_arc_max set to 1GB you'd be allocating 256MB of that as the hash
size, which is surely a massive waste as you wouldn't need 256MB of
hash for just 1GB of ARC buffers?
Am I still barking up the wrong tree?
They can dynamically change arc_c_max after we've booted, which could
leave the hash table much too small, if it was sized based on what
zfs_arc_max was when we booted.
I'd say keep it simple until we see a problem.
+1.
I agree. Also, if the admin is changing the default of "arc_c_max", then
they can also change the size of the hash table (right?) if they feel
it's necessary.
--
Cheers, Prakash
Post by Saso Kiselkov
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23963346-4bb55813
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Richard Yao
2013-10-16 23:37:34 UTC
Permalink
Post by Steven Hartland
Sent: Thursday, October 17, 2013 12:09 AM
Subject: [OpenZFS Developer] [Review] Tunable ARC buf hash sizing
Post by Saso Kiselkov
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218
Did you mean "This index is then masked with ht_mask" (remove
+ * which spits out a 64-bit hash index. This index is then masked
It may be a silly question but should the hash sizing not be based
around zfs_arc_max instead of physmem?
No, because zfs_arc_max does not include l2arc.
Post by Steven Hartland
Regards
Steve
================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.
In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
_______________________________________________
developer mailing list
http://lists.open-zfs.org/mailman/listinfo/developer
Matthew Ahrens
2013-10-17 00:50:38 UTC
Permalink
I like these changes -- fix the constant, make it tunable, keep it simple.

86-90: I think for a big theory statement we should document what the
parameters are for the hash table size, rather than the exact algorithm for
achieving it. More of the "why" and less of the "how". You might mention:
- number of buckets is at least enough to have 1 bucket for every 8k of
physical memory
- number of buckets is a power of 2
- at least 4096 buckets
- results in about 1MB of hash tables for every 1GB of memory.

233: zfs_arc_ht_base_masklen - I don't understand the name of this
variable. I would recommend:

/*
* This variable sets the assumed average block size. The ARC hash table
will be sized large enough to
* not have too much chaining even if all of memory used to cache blocks of
this size.
*/
uint64_t zfs_arc_average_block_size = 8 * 1024;

(buf_init() will then need to do a multiply rather than <<(ht_maxlen +
this).)

632: declare each member on its own line.

748: does declaring it inline improve performance? I suspect that it does,
at least on gcc with illumos. But in the past "inline" has been ignored
given the compiler flags used, so it's worth checking.

983: doesn't ht_num_locks need to be a power of 2? physmem is not
necessarily a power of 2.

If you want to add a tunable like zfs_arc_average_block_size for the
dbuf_hash_table, that would be cool too (to replace the constant 4096 in
dbuf_init).

--matt
Post by Saso Kiselkov
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Prakash Surya
2013-10-18 15:36:26 UTC
Permalink
Other than what Matt brought up, it would be nice to have a way to
override the default values without recompiling. But we can add that to
the Linux port as needed. I don't really know how you guys support that
sort of thing.

Also, it _might_ be useful to move statements like:

Using the default settings these values translate to ~1 MB of hash
tables for each 1 GB of physical memory.

closer to the variable declarations just to try and ensure the comment
is updated if the default values are ever changed. But maybe I'm being
overly pedantic.

Do you have any before/after values for the max chain length from your
testing? It should be easy(?) to grab it from the arc stats and would be
nice to have numbers to justify the change.

Looks good, overall. Thanks.
--
Cheers, Prakash
Post by Matthew Ahrens
I like these changes -- fix the constant, make it tunable, keep it simple.
86-90:  I think for a big theory statement we should document what the
parameters are for the hash table size, rather than the exact algorithm for
achieving it.  More of the &quot;why&quot; and less of the &quot;how&quot;.
 - number of buckets is at least enough to have 1 bucket for every 8k of
physical memory
 - number of buckets is a power of 2
 - at least 4096 buckets
 - results in about 1MB of hash tables for every 1GB of memory.
233: zfs_arc_ht_base_masklen - I don't understand the name of this variable.  I
/*
 * This variable sets the assumed average block size.  The ARC hash table will
be sized large enough to
 * not have too much chaining even if all of memory used to cache blocks of
this size.
 */
uint64_t zfs_arc_average_block_size = 8 * 1024;
(buf_init() will then need to do a multiply rather than <<(ht_maxlen + this).)
632: declare each member on its own line.
748: does declaring it inline improve performance?  I suspect that it does, at
least on gcc with illumos.  But in the past &quot;inline&quot; has been ignored
given the compiler flags used, so it's worth checking.
983: doesn't ht_num_locks need to be a power of 2?  physmem is not necessarily
a power of 2.
If you want to add a tunable like zfs_arc_average_block_size for the
dbuf_hash_table, that would be cool too (to replace the constant 4096 in
dbuf_init).
--matt
Okay, I've reworked and simplified the ARC buf hash changes to just
change the way we size the hash table and bump up the amount of hash
tables we allocate to 1MB/1GB.
Webrev: http://cr.illumos.org/~webrev/skiselkov/new_buf_hash/
Issue: https://www.illumos.org/issues/4218
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-
ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&amp;
Powered by Listbox: http://www.listbox.com
illumos-zfs | Archives [http://postlink.www.listbox.com/1594930/ [Loading Image...> 833487e62783d55fe81f119fb93ef644/23963346/ 8cdfacfb.png?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2xpc3Rib3gtbG9nby1zbWFsbC5wbmc]
8cdfacfb.jpg?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2ZlZWQtaWNvbi0xMHgxMC5qcGc]
| Modify Your Subscription
Saso Kiselkov
2013-10-18 17:16:21 UTC
Permalink
Post by Prakash Surya
Other than what Matt brought up, it would be nice to have a way to
override the default values without recompiling. But we can add that to
the Linux port as needed. I don't really know how you guys support that
sort of thing.
On the Linux port I think the guys add sysctl's to change these values.
In Illumos we simply use mdb and modify the in-core value at runtime
(that's essentially what /etc/system does).
Post by Prakash Surya
Using the default settings these values translate to ~1 MB of hash
tables for each 1 GB of physical memory.
closer to the variable declarations just to try and ensure the comment
is updated if the default values are ever changed. But maybe I'm being
overly pedantic.
Then it would be further away from the general theory statement. I don't
know, I have no strong preference either way.
Post by Prakash Surya
Do you have any before/after values for the max chain length from your
testing? It should be easy(?) to grab it from the arc stats and would be
nice to have numbers to justify the change.
Yeah, just grab see the hash_ arcstats on your systems (kstat -n
arcstats | grep hash_).
Post by Prakash Surya
Looks good, overall. Thanks.
Cool, thanks!

Cheers,
--
Saso
Brian Behlendorf
2013-10-18 20:45:31 UTC
Permalink
Post by Saso Kiselkov
Post by Prakash Surya
Other than what Matt brought up, it would be nice to have a way to
override the default values without recompiling. But we can add that to
the Linux port as needed. I don't really know how you guys support that
sort of thing.
On the Linux port I think the guys add sysctl's to change these values.
In Illumos we simply use mdb and modify the in-core value at runtime
(that's essentially what /etc/system does).
Kernel module options are used under Linux. They can all be set at
module load time and some may be changed through the sysfs virtual
filesystem.

Additional comments:

235 * We want to allocate one hash lock for every 4GB of memory with a
236 * of MIN_BUF_LOCKS.
237 */
238 uint64_t zfs_arc_ht_lock_shift = 32;
239 #define MIN_BUF_LOCKS 256

Have you considered allocating more hash locks by default? I've
suspected for a while now that this relatively small number of locks can
cause a noticable performance for highly parallel workloads. Perhaps
increase the minimum to 1024, we're only talking about a few extra KB of
space.

748 static inline uint64_t

The 'inline' keyword was added as part of C99. GNU C has supported it
as an extension forever so this isn't an issue for Linux, but it is
something to be aware of. Do you actually see any impact on performance
when passing this hint?

888 kmem_free(buf_hash_table.ht_locks, sizeof (struct ht_lock) *
889 buf_hash_table.ht_num_locks);
...
986 buf_hash_table.ht_locks = kmem_zalloc(sizeof (struct ht_lock) *
987 buf_hash_table.ht_num_locks, KM_SLEEP);

I'm all for keeping this code simple, but please use vmem_zalloc() /
vmem_free() so we can use this code unmodified on Linux. On 64-bit
systems there's plenty of virtual address space for this and since it's
a one time allocation we can afford the cost of setting up the mapping.

Aside from the above it looks good to me. It would great if you could
propose an identical change for the dbuf hash table which suffers from
the exact same issues.

Thanks,
Brian
Matthew Ahrens
2013-10-18 17:15:40 UTC
Permalink
Post by Prakash Surya
Other than what Matt brought up, it would be nice to have a way to
override the default values without recompiling.
Which default values in particular?

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Prakash Surya
2013-10-18 17:44:14 UTC
Permalink
Well, the more I think about it, the less it worries me. Let's ignore
that statement for now.

We have users tweaking the ARC limits regularly, so I was concerned
about the case where the ARC max is set much lower than physical RAM.
But even if a user sets the ARC max to 5% of RAM, the table size is
only 2% the size of ARC max. Not bad, imo.
--
Cheers, Prakash
Post by Prakash Surya
Other than what Matt brought up, it would be nice to have a way to
override the default values without recompiling.
Which default values in particular?
--matt
illumos-zfs | Archives [http://postlink.www.listbox.com/1596113/ [https://www.listbox.com/images/listbox-logo-small.png> 833487e62783d55fe81f119fb93ef644/23963346/ 8cdfacfb.png?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2xpc3Rib3gtbG9nby1zbWFsbC5wbmc]
8cdfacfb.jpg?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2ZlZWQtaWNvbi0xMHgxMC5qcGc]
| Modify Your Subscription
Richard Elling
2013-10-19 01:07:30 UTC
Permalink
Post by Prakash Surya
Well, the more I think about it, the less it worries me. Let's ignore
that statement for now.
We have users tweaking the ARC limits regularly, so I was concerned
about the case where the ARC max is set much lower than physical RAM.
But even if a user sets the ARC max to 5% of RAM, the table size is
only 2% the size of ARC max. Not bad, imo.
Hi Prakash,
Can you share with us what ARC limits are being tuned regularly? I am
aware of arc_c_max on non-Solaris implementations. If there are others,
then perhaps we can come up with better self-tuning rules.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Prakash Surya
2013-10-21 17:22:13 UTC
Permalink
Post by Prakash Surya
Well, the more I think about it, the less it worries me. Let's ignore
that statement for now.
We have users tweaking the ARC limits regularly, so I was concerned
about the case where the ARC max is set much lower than physical RAM.
But even if a user sets the ARC max to 5% of RAM, the table size is
only 2% the size of ARC max. Not bad, imo.
Hi Prakash,
Can you share with us what ARC limits are being tuned regularly? I am
aware of arc_c_max on non-Solaris implementations. If there are others,
then perhaps we can come up with better self-tuning rules.
 -- richard
I was referring to whatever tunable sets the maximum ARC size, I believe
it's 'zfs_arc_max' but I'd have to double check to be sure. I know some
people set that much smaller than physical RAM because of their
workloads and the memory management issues on the Linux port.
--
Cheers, Prakash
Post by Prakash Surya
--
+1-760-896-4422
illumos-zfs | Archives [http://postlink.www.listbox.com/1596569/ [https://www.listbox.com/images/listbox-logo-small.png> 833487e62783d55fe81f119fb93ef644/23963346/ 8cdfacfb.png?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2xpc3Rib3gtbG9nby1zbWFsbC5wbmc]
8cdfacfb.jpg?uri=aHR0cHM6Ly93d3cubGlzdGJveC5jb20vaW1hZ2VzL2ZlZWQtaWNvbi0xMHgxMC5qcGc]
| Modify Your Subscription
Richard Elling
2013-10-21 17:40:45 UTC
Permalink
Post by Prakash Surya
Post by Prakash Surya
Well, the more I think about it, the less it worries me. Let's ignore
that statement for now.
We have users tweaking the ARC limits regularly, so I was concerned
about the case where the ARC max is set much lower than physical RAM.
But even if a user sets the ARC max to 5% of RAM, the table size is
only 2% the size of ARC max. Not bad, imo.
Hi Prakash,
Can you share with us what ARC limits are being tuned regularly? I am
aware of arc_c_max on non-Solaris implementations. If there are others,
then perhaps we can come up with better self-tuning rules.
-- richard
I was referring to whatever tunable sets the maximum ARC size, I believe
it's 'zfs_arc_max' but I'd have to double check to be sure. I know some
people set that much smaller than physical RAM because of their
workloads and the memory management issues on the Linux port.
That is a trick question :-)
For illumos, the default setting override (for /etc/system) is zfs_arc_max.
When zfs is loaded, it will determine the arc_c_max from the system's
current configuration and then override the value if zfs_arc_max is set.
So when we talk about the limit, arc_c_max is the current value and
zfs_arc_max is the setting to force it at modload time.

NB, the value of arc_c_max is easily visible as the "c_max" kstat.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-11-14 00:01:00 UTC
Permalink
Post by Saso Kiselkov
I've decided to resurrect my original persistent L2ARC webrev from about
a year ago and rework it with additional design advice kindly provided
by Matt Ahrens. Interested folks, please take a preliminary view of this
changeset and let's get the ball rolling on getting this integrated.
http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
1) Makes the design simpler and (I hope) easier to understand. I've cut
it by ~400 lines and it's now following proper naming nomenclature,
plus I got rid of all the encoding/decoding functions.
2) More robust. We no longer "taste" the L2ARC device to see if there's
an uberblock there. Instead we store this attribute in the vdev
config (magic numbers are kept as fail safes, of course).
3) Faster rebuilds. As before, rebuilds are done in background threads
and in parallel on all available L2ARC devices, so it doesn't block
pool import, but thanks to some restructuring of the way we link the
metadata chain I was able to optimize this phase, so that we saturate
the devices much better (rebuilding ~100GB worth of L2ARC takes ~2
seconds on my crappy little system).
Please consider this code alpha for now, as it hasn't seen much exposure
in production yet. I'm currently beating the crap out of it on my backup
machine. Any volunteers willing to help with testing are very much
welcome, especially if you reboot your machine a lot, or have
shared-storage pools capable of simulating ungraceful takeovers. I have
reasonable confidence that it won't trash your data, so worst case is
you'll see a kernel panic (make sure you can capture that if need be).
Please do not deploy into production if you value uptime.
Has anybody had time to look over this webrev yet? I'm getting requests
from people to get this upstreamed.

Cheers,
--
Saso
Paul B. Henson
2013-11-15 00:24:41 UTC
Permalink
Sent: Wednesday, November 13, 2013 4:01 PM
Has anybody had time to look over this webrev yet? I'm getting requests
from people to get this upstreamed.
Unfortunately reviewing this code is out of my depth :(, but you can count
me among the people that would love to see it upstream ;). I do finally have
my storage box running though, so if you happen to have a binary compatible
with the latest omnios stable and are still interested in people pounding on
it, I could do that.
Saso Kiselkov
2013-11-15 16:50:42 UTC
Permalink
Post by Paul B. Henson
Sent: Wednesday, November 13, 2013 4:01 PM
Has anybody had time to look over this webrev yet? I'm getting requests
from people to get this upstreamed.
Unfortunately reviewing this code is out of my depth :(, but you can count
me among the people that would love to see it upstream ;). I do finally have
my storage box running though, so if you happen to have a binary compatible
with the latest omnios stable and are still interested in people pounding on
it, I could do that.
Appreciate your offering to test the code out - here's the latest copy
of both 32 and 64-bit zfs modules:
http://37.153.99.61/zfs_l2arc_persist.tar.bz2
It should run fine on bloody (and perhaps even on stable, but I'm not
super-confident in that).

As per usual, test in a separate boot environment to make sure you won't
hose your main install:

# beadm create l2arc_persist
# zfs set mountpoint=/tmp/l2arc_persist rpool/ROOT/l2arc_persist
# zfs mount rpool/ROOT/l2arc_persist
# wget -O - http://37.153.99.61/zfs_l2arc_persist.tar.bz2 | tar xjf -
# cp zfs_l2arc_persist/obj32/zfs /tmp/l2arc_persist/kernel/fs/zfs
# cp zfs_l2arc_persist/obj64/zfs /tmp/l2arc_persist/kernel/fs/amd64/zfs
# bootadm update-archive -R /tmp/l2arc_persist
# umount /tmp/l2arc_persist
# reboot -fe l2arc_persist

If you plan on capturing crash dumps with this new change running (and
I'd appreciate it if you can), make sure your dump device is set to a
physical partition or device, NOT a zvol, since a crash in the ZFS
module precludes any dump to a zvol. That having been said, I don't
expect any crashes (haven't had any myself).

Cheers,
--
Saso
Paul B. Henson
2013-11-15 19:42:44 UTC
Permalink
Sent: Friday, November 15, 2013 8:51 AM
It should run fine on bloody (and perhaps even on stable, but I'm not
super-confident in that).
Hmm, if it's not intended for stable, it's probably not a good test to try
and use it there. If it blows up or does weird things, there'd be no way to
tell if it was an underlying issue or simply an incompatibility with the
older code base. The box isn't quite in production yet, I'll update the new
boot environment to bloody, pound on it for a week or so and then roll back
to my stable. Although, the next stable should be out any day now, so if I
get lucky I'll just be able to set my publisher back to stable in the new BE
and not have to rollback.
I'd appreciate it if you can), make sure your dump device is set to a
physical partition or device, NOT a zvol, since a crash in the ZFS
module precludes any dump to a zvol. That having been said, I don't
expect any crashes (haven't had any myself).
My dump device is currently a raw disk partition. Any thoughts on a good
test methodology? Also, if I recall correctly I need to remove and then
re-add the cache devices in order to enable persistency?
Saso Kiselkov
2013-11-15 22:52:08 UTC
Permalink
Post by Paul B. Henson
Sent: Friday, November 15, 2013 8:51 AM
It should run fine on bloody (and perhaps even on stable, but I'm not
super-confident in that).
Hmm, if it's not intended for stable, it's probably not a good test to try
and use it there. If it blows up or does weird things, there'd be no way to
tell if it was an underlying issue or simply an incompatibility with the
older code base. The box isn't quite in production yet, I'll update the new
boot environment to bloody, pound on it for a week or so and then roll back
to my stable. Although, the next stable should be out any day now, so if I
get lucky I'll just be able to set my publisher back to stable in the new BE
and not have to rollback.
The "not intended for stable" has more to do with interoperability with
libzfs.so than anything else. The interconnection there is a little
delicate. If you do encounter userland troubles (e.g. zfs(1M) not
working properly), give me a ping, I'll build an up-to-date libzfs.so
for you.
Post by Paul B. Henson
I'd appreciate it if you can), make sure your dump device is set to a
physical partition or device, NOT a zvol, since a crash in the ZFS
module precludes any dump to a zvol. That having been said, I don't
expect any crashes (haven't had any myself).
My dump device is currently a raw disk partition. Any thoughts on a good
test methodology? Also, if I recall correctly I need to remove and then
re-add the cache devices in order to enable persistency?
Just use it as normal and do the occasional reboot, watching for correct
rebuild (i.e. most of your L2 cache contents should get recovered). No
need to re-add your device, it should get picked up automatically.

Cheers,
--
Saso
Paul B. Henson
2013-11-16 02:29:36 UTC
Permalink
Post by Saso Kiselkov
The "not intended for stable" has more to do with interoperability with
Hmm, I updated to the current bloody and copied in the new module:

SunOS Release 5.11 Version omnios-e97fba8 64-bit
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights
reserved.
/kernel/fs/amd64/zfs: undefined symbol 'vnevent_truncate'
WARNING: mod_load: cannot load module 'zfs'

No joy :(...
Saso Kiselkov
2013-11-16 20:25:20 UTC
Permalink
Post by Paul B. Henson
Post by Saso Kiselkov
The "not intended for stable" has more to do with interoperability with
SunOS Release 5.11 Version omnios-e97fba8 64-bit
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights
reserved.
/kernel/fs/amd64/zfs: undefined symbol 'vnevent_truncate'
WARNING: mod_load: cannot load module 'zfs'
No joy :(...
This was introduced by a recent (10 days ago) push-back of #3968 by
Joyent (commit 72102e74), which introduced that symbol. I rebased on the
commit before that and redid the build:

http://37.153.99.61/zfs_l2arc_persist_pre_vnevent_truncate.tar.bz2

Do NOT load this into a kernel which has that commit, hilarity might ensue.

Cheers,
--
Saso
Saso Kiselkov
2013-11-21 02:23:42 UTC
Permalink
Post by Paul B. Henson
Post by Saso Kiselkov
The "not intended for stable" has more to do with interoperability with
SunOS Release 5.11 Version omnios-e97fba8 64-bit
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights
reserved.
/kernel/fs/amd64/zfs: undefined symbol 'vnevent_truncate'
WARNING: mod_load: cannot load module 'zfs'
No joy :(...
Hey,

Just wanted to give people who might be testing my code that with the
help of Gernot Straßer I've discovered a bug in the way we handle L2ARC
metadata overhead when we get to the end of the L2ARC device -
occasionally this can trip an assertion. I've fixed it by correctly
accounting the metadata overhead now (hopefully) in all places where
it's relevant (those who are interested, see l2arc_log_overhead and
calls to it) and updated the webrev:

http://cr.illumos.org/~webrev/skiselkov/3525_simplified/

I've also respun new OmniOS-compatible kernel modules at:

http://37.153.99.61/zfs_l2arc_persist_pre_vnevent_truncate.tar.bz2

Any reviews greatly appreciated.

Cheers,
--
Saso
Matthew Ahrens
2013-12-04 23:25:14 UTC
Permalink
Saso, thanks for your continued work to clean this up. Here are some more
comments:

- terminology: let's use similar terms as with the intent log: there are
not several logs on each l2arc device. There is one log composed of a
linked list of log blocks. (So, s/log/log block/ in the code/comments)

- why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g. L2CACHE -
aren't they all in l2arc)

- L2ARC_CHECK_REBUILD_TIMEOUT - unused macro

- l2arc_rebuild_timeout: comment implies that rebuild happens before
import completes; i thought it rebuilt in background?

- calling it an uberblock is confusing (e.g. line 4415); maybe
l2arc_dev_hdr_t?

- reconstruct time is not that critical. We could simplify the code and
on-disk format, and make it easier to convince ourselves that the code is
correct, if we remove the prefetch & double pointer stuff from reconstruct
code. It should be sufficient to issue one i/o at a time (rather than 2),
perhaps issuing the i/o for the next zio while creating l2arc_hdr_t’s for
this blocks' entries.

- how can there be infinite loops of logs? validity should be determined
by not straddling invalid range (between end and begin)

- Consider storing the “uberblock” in the MOS - then don’t have to worry
about atomic overwrite, self-checksum, magic number. Update the in-memory
version as we do now, and just write it into the MOS every txg (if it's
dirty).
Post by Saso Kiselkov
I've decided to resurrect my original persistent L2ARC webrev from about
a year ago and rework it with additional design advice kindly provided
by Matt Ahrens. Interested folks, please take a preliminary view of this
changeset and let's get the ball rolling on getting this integrated.
http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
1) Makes the design simpler and (I hope) easier to understand. I've cut
it by ~400 lines and it's now following proper naming nomenclature,
plus I got rid of all the encoding/decoding functions.
2) More robust. We no longer "taste" the L2ARC device to see if there's
an uberblock there. Instead we store this attribute in the vdev
config (magic numbers are kept as fail safes, of course).
3) Faster rebuilds. As before, rebuilds are done in background threads
and in parallel on all available L2ARC devices, so it doesn't block
pool import, but thanks to some restructuring of the way we link the
metadata chain I was able to optimize this phase, so that we saturate
the devices much better (rebuilding ~100GB worth of L2ARC takes ~2
seconds on my crappy little system).
Please consider this code alpha for now, as it hasn't seen much exposure
in production yet. I'm currently beating the crap out of it on my backup
machine. Any volunteers willing to help with testing are very much
welcome, especially if you reboot your machine a lot, or have
shared-storage pools capable of simulating ungraceful takeovers. I have
reasonable confidence that it won't trash your data, so worst case is
you'll see a kernel panic (make sure you can capture that if need be).
Please do not deploy into production if you value uptime.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-12-05 02:58:13 UTC
Permalink
Post by Matthew Ahrens
Saso, thanks for your continued work to clean this up. Here are some
Hi Matt,
Post by Matthew Ahrens
- terminology: let's use similar terms as with the intent log: there
are not several logs on each l2arc device. There is one log composed of
a linked list of log blocks. (So, s/log/log block/ in the code/comments)
Ok, will build this in.
Post by Matthew Ahrens
- why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g. L2CACHE -
aren't they all in l2arc)
I'm not sure I understand. Do you mean that we should *not* rebuild the
buffer flags when rebuilding buffers? While it is possible that we can
infer two of the flag values (ARC_IN_HASH_TABLE and ARC_L2CACHE), we
can't certainly say what the right value for ARC_L2COMPRESS and
ARC_PREFETCH should be (as these relate to the state the pool was in
when the buffer got into ARC in the first place).
Post by Matthew Ahrens
- L2ARC_CHECK_REBUILD_TIMEOUT - unused macro
Oops, a bit of obsolete code that I forgot to cut out after the first
round of refactoring.
Post by Matthew Ahrens
- l2arc_rebuild_timeout: comment implies that rebuild happens before
import completes; i thought it rebuilt in background?
While rebuild does happen in the background, it is quite a heavy
operation that can make the system rather busy, so I wanted to make sure
it is bounded. I've coded a non-trivial amount of real-time code in the
past and it's just become kind of a reflex for me.
Post by Matthew Ahrens
- calling it an uberblock is confusing (e.g. line 4415); maybe
l2arc_dev_hdr_t?
Ok.
Post by Matthew Ahrens
- reconstruct time is not that critical. We could simplify the code and
on-disk format, and make it easier to convince ourselves that the code
is correct, if we remove the prefetch & double pointer stuff from
reconstruct code. It should be sufficient to issue one i/o at a time
(rather than 2), perhaps issuing the i/o for the next zio while creating
l2arc_hdr_t’s for this blocks' entries.
Unfortunately, I did experiments with that and rebuild does suffer quite
a bit (by about an order of magnitude slower) unless you get the
prefetching *just* right (i.e. keep the device busy all the time).
Reconstructing one log block's buffers in memory takes a miniscule
amount of time when compared to access latencies even on SSDs, so
without prefetching, the rebuild will be mostly read latency limited. My
tests seemed to indicate that an average-size L2ARC rebuild (with the
prefetching that's in the webrev) should take on the order of 10-20
seconds. Without that, you're looking at probably several minute's worth
of churn on the L2ARC device.
Post by Matthew Ahrens
- how can there be infinite loops of logs? validity should be
determined by not straddling invalid range (between end and begin)
Aside from obvious manually fabricated infinite log block loops to
affect a hang or DoS, buffers can loop due to design. The main reason is
that log blocks point backwards, while our writing goes forwards. We
have thus no way to back and amend "outdated" buffers. So if we get
seriously unlucky, we could, potentially, write a log buffer in exactly
the sector offset pointed to by an ancient previously written log block,
which would then create the loop. It's extremely unlikely, but it's a
possibility.
Post by Matthew Ahrens
- Consider storing the “uberblock” in the MOS - then don’t have to
worry about atomic overwrite, self-checksum, magic number. Update the
in-memory version as we do now, and just write it into the MOS every txg
(if it's dirty).
While it would remove the aspects you mention, I suspect it wouldn't
really simplify the code. First, we'd need to dump a bunch of callbacks
in txg commit places to make sure we check for changes. Next, we'd need
to construct an even more complicated data structure in the MOS, simply
by virtue of having the possibility of having multiple L2ARC devices.
Finally, I'd have to hook into places where "zpool add" and "zpool
remove" add L2ARC vdevs. While none of these are particularly hard, I'm
not convinced it'd be worthwhile simply to replace two 50-odd lines long
functions. We already store a bit of metadata in the vdev config tree
that tells us whether we're supposed to be looking for an uberblock or not.

Anyway, gonna give it some thought and try to estimate the complexity -
I might be completely off here.

Appreciate you taking the time to review and looking forward to your
responses.

Cheers,
--
Saso
Matthew Ahrens
2013-12-05 22:24:04 UTC
Permalink
Post by Saso Kiselkov
Post by Matthew Ahrens
Saso, thanks for your continued work to clean this up. Here are some
Hi Matt,
Post by Matthew Ahrens
- terminology: let's use similar terms as with the intent log: there
are not several logs on each l2arc device. There is one log composed of
a linked list of log blocks. (So, s/log/log block/ in the code/comments)
Ok, will build this in.
Post by Matthew Ahrens
- why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g. L2CACHE -
aren't they all in l2arc)
I'm not sure I understand. Do you mean that we should *not* rebuild the
buffer flags when rebuilding buffers? While it is possible that we can
infer two of the flag values (ARC_IN_HASH_TABLE and ARC_L2CACHE), we
can't certainly say what the right value for ARC_L2COMPRESS and
ARC_PREFETCH should be (as these relate to the state the pool was in
when the buffer got into ARC in the first place).
I'm suggesting that we rebuild the flags, but I'm not sure the benefit of
storing these flags on disk. It seems like ARC_IN_HASH_TABLE and
ARC_L2CACHE should always be set, so don't need to be persisted. Why
should ARC_PREFETCH ever be set when we reconstruct? Can't we set
ARC_L2COMPRESS based on LE_GET_COMPRESS()?
Post by Saso Kiselkov
Post by Matthew Ahrens
- L2ARC_CHECK_REBUILD_TIMEOUT - unused macro
Oops, a bit of obsolete code that I forgot to cut out after the first
round of refactoring.
Post by Matthew Ahrens
- l2arc_rebuild_timeout: comment implies that rebuild happens before
import completes; i thought it rebuilt in background?
While rebuild does happen in the background, it is quite a heavy
operation that can make the system rather busy, so I wanted to make sure
it is bounded. I've coded a non-trivial amount of real-time code in the
past and it's just become kind of a reflex for me.
Thankfully this is not realtime code :-)
Post by Saso Kiselkov
Post by Matthew Ahrens
- calling it an uberblock is confusing (e.g. line 4415); maybe
l2arc_dev_hdr_t?
Ok.
Post by Matthew Ahrens
- reconstruct time is not that critical. We could simplify the code and
on-disk format, and make it easier to convince ourselves that the code
is correct, if we remove the prefetch & double pointer stuff from
reconstruct code. It should be sufficient to issue one i/o at a time
(rather than 2), perhaps issuing the i/o for the next zio while creating
l2arc_hdr_t’s for this blocks' entries.
Unfortunately, I did experiments with that and rebuild does suffer quite
a bit (by about an order of magnitude slower) unless you get the
prefetching *just* right (i.e. keep the device busy all the time).
Reconstructing one log block's buffers in memory takes a miniscule
amount of time when compared to access latencies even on SSDs, so
without prefetching, the rebuild will be mostly read latency limited. My
tests seemed to indicate that an average-size L2ARC rebuild (with the
prefetching that's in the webrev) should take on the order of 10-20
seconds. Without that, you're looking at probably several minute's worth
of churn on the L2ARC device.
I must not be understanding the code then (which is why I would prefer it
to be simpler). It looks like there are at most 2 concurrent i/os going
on, one for each of the 2 linked lists. So how can it go more than 2x the
performance of a single linked list with 1 concurrent i/o?
Post by Saso Kiselkov
Post by Matthew Ahrens
- how can there be infinite loops of logs? validity should be
determined by not straddling invalid range (between end and begin)
Aside from obvious manually fabricated infinite log block loops to
affect a hang or DoS, buffers can loop due to design. The main reason is
that log blocks point backwards, while our writing goes forwards. We
have thus no way to back and amend "outdated" buffers. So if we get
seriously unlucky, we could, potentially, write a log buffer in exactly
the sector offset pointed to by an ancient previously written log block,
which would then create the loop. It's extremely unlikely, but it's a
possibility.
But can't you catch that by validating that the pointer does not cross the
invalid range of the device (l2u_start_lps[0].l2lp_daddr and
l2u_evict_tail)?
Post by Saso Kiselkov
Post by Matthew Ahrens
- Consider storing the “uberblock” in the MOS - then don’t have to
worry about atomic overwrite, self-checksum, magic number. Update the
in-memory version as we do now, and just write it into the MOS every txg
(if it's dirty).
While it would remove the aspects you mention, I suspect it wouldn't
really simplify the code. First, we'd need to dump a bunch of callbacks
in txg commit places to make sure we check for changes. Next, we'd need
to construct an even more complicated data structure in the MOS, simply
by virtue of having the possibility of having multiple L2ARC devices.
Finally, I'd have to hook into places where "zpool add" and "zpool
remove" add L2ARC vdevs. While none of these are particularly hard, I'm
not convinced it'd be worthwhile simply to replace two 50-odd lines long
functions. We already store a bit of metadata in the vdev config tree
that tells us whether we're supposed to be looking for an uberblock or not.
Anyway, gonna give it some thought and try to estimate the complexity -
I might be completely off here.
You might be right. Also it may be that keeping it in the MOS would
increase complexity because then the l2uberblock (aka l2arc_dev_hdr_t) would
not reflect new writes that had happened, which potentially could overwrite
data that the MOS's uberblock thinks are valid.
Post by Saso Kiselkov
Appreciate you taking the time to review and looking forward to your
responses.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-12-09 15:25:01 UTC
Permalink
Post by Saso Kiselkov
Post by Matthew Ahrens
Saso, thanks for your continued work to clean this up. Here are some
Hi Matt,
Post by Matthew Ahrens
- terminology: let's use similar terms as with the intent log: there
are not several logs on each l2arc device. There is one log
composed of
Post by Matthew Ahrens
a linked list of log blocks. (So, s/log/log block/ in the
code/comments)
Ok, will build this in.
Post by Matthew Ahrens
- why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g. L2CACHE -
aren't they all in l2arc)
I'm not sure I understand. Do you mean that we should *not* rebuild the
buffer flags when rebuilding buffers? While it is possible that we can
infer two of the flag values (ARC_IN_HASH_TABLE and ARC_L2CACHE), we
can't certainly say what the right value for ARC_L2COMPRESS and
ARC_PREFETCH should be (as these relate to the state the pool was in
when the buffer got into ARC in the first place).
I'm suggesting that we rebuild the flags, but I'm not sure the benefit
of storing these flags on disk. It seems like ARC_IN_HASH_TABLE and
ARC_L2CACHE should always be set, so don't need to be persisted. Why
should ARC_PREFETCH ever be set when we reconstruct? Can't we set
ARC_L2COMPRESS based on LE_GET_COMPRESS()?
Reworked, now gotten rid of this.
Post by Saso Kiselkov
Thankfully this is not realtime code :-)
I'd contend that all code is real-time code, but that the importance of
its real-time qualities is dependent on what it's being used for. One of
the main reasons I got into ZFS development was to try and understand
how to improve ZFS for video streaming uses (which is inherently real-time).

However, I don't feel particularly strongly about this functionality. If
you think we can safely loose it, I'll defer to your judgment.
Post by Saso Kiselkov
Unfortunately, I did experiments with that and rebuild does suffer quite
a bit (by about an order of magnitude slower) unless you get the
prefetching *just* right (i.e. keep the device busy all the time).
Reconstructing one log block's buffers in memory takes a miniscule
amount of time when compared to access latencies even on SSDs, so
without prefetching, the rebuild will be mostly read latency limited. My
tests seemed to indicate that an average-size L2ARC rebuild (with the
prefetching that's in the webrev) should take on the order of 10-20
seconds. Without that, you're looking at probably several minute's worth
of churn on the L2ARC device.
I must not be understanding the code then (which is why I would prefer
it to be simpler). It looks like there are at most 2 concurrent i/os
going on, one for each of the 2 linked lists. So how can it go more
than 2x the performance of a single linked list with 1 concurrent i/o?
The point is to minimize the hole that's created when one read finishes
and only then we start issuing the next one - at which point we have to
incur another latency hit of passing through the transports to the L2ARC
device. Using this method we always keep at least 1-2 IOs queued for the
device, instead of 0-1 IOs.

A picture is worth a thousand words:

+---> (t)
CPU | /run\ /run\ /run\
xfer | /xfer/ \xfer\ /xfer/ \xfer\ /xfer/ \
SSD |read/ \read/ \read/


+---> (t)
CPU | /run\/run\ /run\/run\ /run\/
xfer | /xfer/xfer/\xfer\xfer\ /xfer/xfer/\xfer\xfer\ /xfer/xfer/\
SSD |read/read/ \read/read/ \read/read/

Now the exact proportions here will dictate how well we are going to
saturate the device, and SSDs typically prefer queues deeper than 1, but
it's better than a straight start-stop walk of the device (which is
going to be much slower than the device's maximum transfer rate).

I'll try to conduct quantitative tests to make sure I'm not talking out
of my ass here.
Post by Saso Kiselkov
Post by Matthew Ahrens
- how can there be infinite loops of logs? validity should be
determined by not straddling invalid range (between end and begin)
Aside from obvious manually fabricated infinite log block loops to
affect a hang or DoS, buffers can loop due to design. The main reason is
that log blocks point backwards, while our writing goes forwards. We
have thus no way to back and amend "outdated" buffers. So if we get
seriously unlucky, we could, potentially, write a log buffer in exactly
the sector offset pointed to by an ancient previously written log block,
which would then create the loop. It's extremely unlikely, but it's a
possibility.
But can't you catch that by validating that the pointer does not cross
the invalid range of the device (l2u_start_lps[0].l2lp_daddr and
l2u_evict_tail)?
Point taken, I've removed this from l2arc_log_blk_read() and instead
strengthened l2arc_log_blk_ptr_valid() to check for log block pointers
into the evicted region (between l2ad_hand and l2ad_evict).

Updated webrev with all of the changes:
http://cr.illumos.org/~webrev/skiselkov/3525_simplified/

Cheers,
--
Saso
Matthew Ahrens
2013-12-09 18:54:39 UTC
Permalink
Post by Matthew Ahrens
Post by Saso Kiselkov
Post by Matthew Ahrens
Saso, thanks for your continued work to clean this up. Here are
some
Post by Saso Kiselkov
Hi Matt,
there
Post by Saso Kiselkov
Post by Matthew Ahrens
are not several logs on each l2arc device. There is one log
composed of
Post by Matthew Ahrens
a linked list of log blocks. (So, s/log/log block/ in the
code/comments)
Ok, will build this in.
Post by Matthew Ahrens
- why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g.
L2CACHE -
Post by Saso Kiselkov
Post by Matthew Ahrens
aren't they all in l2arc)
I'm not sure I understand. Do you mean that we should *not* rebuild
the
Post by Saso Kiselkov
buffer flags when rebuilding buffers? While it is possible that we
can
Post by Saso Kiselkov
infer two of the flag values (ARC_IN_HASH_TABLE and ARC_L2CACHE), we
can't certainly say what the right value for ARC_L2COMPRESS and
ARC_PREFETCH should be (as these relate to the state the pool was in
when the buffer got into ARC in the first place).
I'm suggesting that we rebuild the flags, but I'm not sure the benefit
of storing these flags on disk. It seems like ARC_IN_HASH_TABLE and
ARC_L2CACHE should always be set, so don't need to be persisted. Why
should ARC_PREFETCH ever be set when we reconstruct? Can't we set
ARC_L2COMPRESS based on LE_GET_COMPRESS()?
Reworked, now gotten rid of this.
Post by Saso Kiselkov
Thankfully this is not realtime code :-)
I'd contend that all code is real-time code, but that the importance of
its real-time qualities is dependent on what it's being used for. One of
the main reasons I got into ZFS development was to try and understand
how to improve ZFS for video streaming uses (which is inherently real-time).
However, I don't feel particularly strongly about this functionality. If
you think we can safely loose it, I'll defer to your judgment.
Post by Saso Kiselkov
Unfortunately, I did experiments with that and rebuild does suffer
quite
Post by Saso Kiselkov
a bit (by about an order of magnitude slower) unless you get the
prefetching *just* right (i.e. keep the device busy all the time).
Reconstructing one log block's buffers in memory takes a miniscule
amount of time when compared to access latencies even on SSDs, so
without prefetching, the rebuild will be mostly read latency
limited. My
Post by Saso Kiselkov
tests seemed to indicate that an average-size L2ARC rebuild (with the
prefetching that's in the webrev) should take on the order of 10-20
seconds. Without that, you're looking at probably several minute's
worth
Post by Saso Kiselkov
of churn on the L2ARC device.
I must not be understanding the code then (which is why I would prefer
it to be simpler). It looks like there are at most 2 concurrent i/os
going on, one for each of the 2 linked lists. So how can it go more
than 2x the performance of a single linked list with 1 concurrent i/o?
The point is to minimize the hole that's created when one read finishes
and only then we start issuing the next one - at which point we have to
incur another latency hit of passing through the transports to the L2ARC
device. Using this method we always keep at least 1-2 IOs queued for the
device, instead of 0-1 IOs.
+---> (t)
CPU | /run\ /run\ /run\
xfer | /xfer/ \xfer\ /xfer/ \xfer\ /xfer/ \
SSD |read/ \read/ \read/
+---> (t)
CPU | /run\/run\ /run\/run\ /run\/
xfer | /xfer/xfer/\xfer\xfer\ /xfer/xfer/\xfer\xfer\ /xfer/xfer/\
SSD |read/read/ \read/read/ \read/read/
Now the exact proportions here will dictate how well we are going to
saturate the device, and SSDs typically prefer queues deeper than 1, but
it's better than a straight start-stop walk of the device (which is
going to be much slower than the device's maximum transfer rate).
Right, but how could this result in >2x the performance? As indicated by
your diagram, you are doing at most 2 reads at once (or, you are getting at
most 1 read "for free" while the CPU is busy processing the last block).
You claimed a 10-20x speedup (I am assuming that "several" means 3).

--matt
Post by Matthew Ahrens
I'll try to conduct quantitative tests to make sure I'm not talking out
of my ass here.
Post by Saso Kiselkov
Post by Matthew Ahrens
- how can there be infinite loops of logs? validity should be
determined by not straddling invalid range (between end and begin)
Aside from obvious manually fabricated infinite log block loops to
affect a hang or DoS, buffers can loop due to design. The main
reason is
Post by Saso Kiselkov
that log blocks point backwards, while our writing goes forwards. We
have thus no way to back and amend "outdated" buffers. So if we get
seriously unlucky, we could, potentially, write a log buffer in
exactly
Post by Saso Kiselkov
the sector offset pointed to by an ancient previously written log
block,
Post by Saso Kiselkov
which would then create the loop. It's extremely unlikely, but it's a
possibility.
But can't you catch that by validating that the pointer does not cross
the invalid range of the device (l2u_start_lps[0].l2lp_daddr and
l2u_evict_tail)?
Point taken, I've removed this from l2arc_log_blk_read() and instead
strengthened l2arc_log_blk_ptr_valid() to check for log block pointers
into the evicted region (between l2ad_hand and l2ad_evict).
http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-12-09 18:57:09 UTC
Permalink
Post by Matthew Ahrens
On Wed, Dec 4, 2013 at 6:58 PM, Saso Kiselkov
Post by Matthew Ahrens
Saso, thanks for your continued work to clean this up. Here
are some
Hi Matt,
Post by Matthew Ahrens
- terminology: let's use similar terms as with the intent
log: there
Post by Matthew Ahrens
are not several logs on each l2arc device. There is one log
composed of
Post by Matthew Ahrens
a linked list of log blocks. (So, s/log/log block/ in the
code/comments)
Ok, will build this in.
Post by Matthew Ahrens
- why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g.
L2CACHE -
Post by Matthew Ahrens
aren't they all in l2arc)
I'm not sure I understand. Do you mean that we should *not*
rebuild the
buffer flags when rebuilding buffers? While it is possible
that we can
infer two of the flag values (ARC_IN_HASH_TABLE and
ARC_L2CACHE), we
can't certainly say what the right value for ARC_L2COMPRESS and
ARC_PREFETCH should be (as these relate to the state the pool
was in
when the buffer got into ARC in the first place).
I'm suggesting that we rebuild the flags, but I'm not sure the benefit
of storing these flags on disk. It seems like ARC_IN_HASH_TABLE and
ARC_L2CACHE should always be set, so don't need to be persisted. Why
should ARC_PREFETCH ever be set when we reconstruct? Can't we set
ARC_L2COMPRESS based on LE_GET_COMPRESS()?
Reworked, now gotten rid of this.
Thankfully this is not realtime code :-)
I'd contend that all code is real-time code, but that the importance of
its real-time qualities is dependent on what it's being used for. One of
the main reasons I got into ZFS development was to try and understand
how to improve ZFS for video streaming uses (which is inherently real-time).
However, I don't feel particularly strongly about this functionality. If
you think we can safely loose it, I'll defer to your judgment.
Unfortunately, I did experiments with that and rebuild does
suffer quite
a bit (by about an order of magnitude slower) unless you get the
prefetching *just* right (i.e. keep the device busy all the time).
Reconstructing one log block's buffers in memory takes a miniscule
amount of time when compared to access latencies even on SSDs, so
without prefetching, the rebuild will be mostly read latency
limited. My
tests seemed to indicate that an average-size L2ARC rebuild
(with the
prefetching that's in the webrev) should take on the order of
10-20
seconds. Without that, you're looking at probably several
minute's worth
of churn on the L2ARC device.
I must not be understanding the code then (which is why I would prefer
it to be simpler). It looks like there are at most 2 concurrent i/os
going on, one for each of the 2 linked lists. So how can it go more
than 2x the performance of a single linked list with 1 concurrent i/o?
The point is to minimize the hole that's created when one read finishes
and only then we start issuing the next one - at which point we have to
incur another latency hit of passing through the transports to the L2ARC
device. Using this method we always keep at least 1-2 IOs queued for the
device, instead of 0-1 IOs.
+---> (t)
CPU | /run\ /run\ /run\
xfer | /xfer/ \xfer\ /xfer/ \xfer\ /xfer/ \
SSD |read/ \read/ \read/
+---> (t)
CPU | /run\/run\ /run\/run\ /run\/
xfer | /xfer/xfer/\xfer\xfer\ /xfer/xfer/\xfer\xfer\ /xfer/xfer/\
SSD |read/read/ \read/read/ \read/read/
Now the exact proportions here will dictate how well we are going to
saturate the device, and SSDs typically prefer queues deeper than 1, but
it's better than a straight start-stop walk of the device (which is
going to be much slower than the device's maximum transfer rate).
Right, but how could this result in >2x the performance? As indicated
by your diagram, you are doing at most 2 reads at once (or, you are
getting at most 1 read "for free" while the CPU is busy processing the
last block). You claimed a 10-20x speedup (I am assuming that "several"
means 3).
As I said, I'm gonna have to recheck, it's possible I might be
remembering stuff incorrectly. However, currently my possibilities for
performance testing are somewhat limited. I'll get back to you as soon
as I have more info.

Cheers,
--
Saso
Jim Klimov
2013-12-09 20:30:22 UTC
Permalink
Post by Saso Kiselkov
Post by Matthew Ahrens
Right, but how could this result in >2x the performance? As indicated
by your diagram, you are doing at most 2 reads at once (or, you are
getting at most 1 read "for free" while the CPU is busy processing the
last block). You claimed a 10-20x speedup (I am assuming that "several"
means 3).
As I said, I'm gonna have to recheck, it's possible I might be
remembering stuff incorrectly. However, currently my possibilities for
performance testing are somewhat limited. I'll get back to you as soon
as I have more info.
Speaking from the theoretical peanut gallery, the SSDs' modern
speediness is due to their amount of NAND chips, each doing some
20MB/s or so (figures non-authoritative). So if multiple queued
tasks land on different chips - and there are dozens of them now -
it may indeed be much faster than single-chip single-requests.
The device can indeed issue operations to the chips in parallel
and return answers as they come into its buffers. Maybe so...

HTH,
//Jim
Matthew Ahrens
2013-12-09 21:03:43 UTC
Permalink
Post by Jim Klimov
Post by Matthew Ahrens
Right, but how could this result in >2x the performance? As indicated
Post by Matthew Ahrens
by your diagram, you are doing at most 2 reads at once (or, you are
getting at most 1 read "for free" while the CPU is busy processing the
last block). You claimed a 10-20x speedup (I am assuming that "several"
means 3).
As I said, I'm gonna have to recheck, it's possible I might be
remembering stuff incorrectly. However, currently my possibilities for
performance testing are somewhat limited. I'll get back to you as soon
as I have more info.
Speaking from the theoretical peanut gallery, the SSDs' modern
speediness is due to their amount of NAND chips, each doing some
20MB/s or so (figures non-authoritative). So if multiple queued
tasks land on different chips - and there are dozens of them now -
it may indeed be much faster than single-chip single-requests.
The device can indeed issue operations to the chips in parallel
and return answers as they come into its buffers. Maybe so...
Yes, if we issue N concurrent i/os, it could be up to N times faster than
if we issue one i/o. In this case, N=2.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Continue reading on narkive:
Loading...