spacemap/metaslab work

Post by Christopher Siden
http://cr.illumos.org/~webrev/csiden/illumos-sm/
This is a metaslab/spacemap refactoring by George Wilson. See the full
- The spacemap code has been refactored to have separate data
structures for the on-disk data structure (called spacemaps in the new
code) and the in-memory data structure (called rangetrees in the new
code). This should aid in understanding the code.
- metaslab are preloaded asynchronously to avoid blocking on reads in
hot code paths when we realize we need to load new metaslabs. This has
shown to be a performance improvement in our performance testing.

I haven't looked at the changes yet, so I apologize if it's obvious from
the patch, but could you elaborate on this a bit more? How/Why does this
improve performance, and for what workloads?

Post by Christopher Siden
- There is a new spacemap_histogram on-disk feature flag. When it is
enabled spacemaps store more data about the amount of contiguous free
space in metaslabs. The current disk format only stores the total
amount of free space, which means that heavily fragmented metaslabs
can look appealing, causing us to read them off disk, even though they
don't have enough contiguous free space to satisfy large allocations,
leading us to continually load the same fragmented space maps over and
over again. The allocation algorithm that uses this information is
disabled by default and can be enabled via a tunable. It will become
the default allocator once George is satisfied with the amount of
performance testing it has received. We have extensively tested this
code both with the tunable enabled and disabled. It is not possible to
seperate out spacemap_histogram from the rest of this refactoring
because they were done together.

This sounds very interesting. We've run into workloads where pulling
metaslabs from disk becomes a limiting factor, so much so that we've
turned on "metaslab_debug" to work around this (for now). We believe
it's due to fragmented metaslabs, but haven't confirmed that.

Are you using anything more sophisticated than zdb for peeking into the
internals of the metaslab when testing? I'd love to have more
tools/knowledge for metaslab performance debugging.
--
Cheers, Prakash

Post by Christopher Siden
Chris
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23963346-4bb55813
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

George Wilson

2013-09-04 14:58:03 UTC

I haven't looked at the changes yet, so I apologize if it's obvious from
the patch, but could you elaborate on this a bit more? How/Why does this
improve performance, and for what workloads?

ZFS sorts the metaslab based on a weight (today that's mostly based on
the amount of free space in the metaslab). During allocation it will
pick a metaslab and attempt to allocate from it. If that metaslab is
loaded then there is no I/O penalty, but if ZFS can't find a region in
that metaslab then it must load in a new one. Depending on the metaslab
this could result in lots of random reads. I've measured allocations
times as high as 250ms+ for a single load of a space map.

I've done the same thing. Part of the idea behind the histogram is to
build more intelligent heuristic to load space maps that can provide
faster allocations.

Post by Prakash Surya
Are you using anything more sophisticated than zdb for peeking into the
internals of the metaslab when testing? I'd love to have more
tools/knowledge for metaslab performance debugging.

Right now I've been using zdb but am working on new mdb command to get
in-core stats.

Thanks,
George

Prakash Surya

2013-09-04 15:39:03 UTC

I haven't looked at the changes yet, so I apologize if it's obvious from
the patch, but could you elaborate on this a bit more? How/Why does this
improve performance, and for what workloads?

ZFS sorts the metaslab based on a weight (today that's mostly based
on the amount of free space in the metaslab). During allocation it
will pick a metaslab and attempt to allocate from it. If that
metaslab is loaded then there is no I/O penalty, but if ZFS can't
find a region in that metaslab then it must load in a new one.
Depending on the metaslab this could result in lots of random reads.
I've measured allocations times as high as 250ms+ for a single load
of a space map.

Thanks, that matches up with my understanding. I'm still a little confused
as to what exactly "preloaded asynchronously" means. Doesn't the
metaslab have to be loaded to make an allocation from it? So, I'm
curious how this can be done asynchronously, since the allocation depends
on the load completing. Anyways, I'll have a look at the patch to clear
up my questions.

I've done the same thing. Part of the idea behind the histogram is
to build more intelligent heuristic to load space maps that can
provide faster allocations.

All good! I remember having a brainstorming discussion with Brian about
doing something very similar to what you've done here. Glad to see
somebody else with the same idea, and even better, an implementation!

Right now I've been using zdb but am working on new mdb command to
get in-core stats.

Sigh.. I wish I had that on Linux. :(

Anyways, it sounds like good work from the description. I'm eager to look
it over and potentially get it into the Linux port. I'm curious how it
will affect our server performance.

--
Cheers, Prakash

Post by George Wilson
Thanks,
George

George Wilson

2013-09-04 16:25:05 UTC

I haven't looked at the changes yet, so I apologize if it's obvious from
the patch, but could you elaborate on this a bit more? How/Why does this
improve performance, and for what workloads?

ZFS sorts the metaslab based on a weight (today that's mostly based
on the amount of free space in the metaslab). During allocation it
will pick a metaslab and attempt to allocate from it. If that
metaslab is loaded then there is no I/O penalty, but if ZFS can't
find a region in that metaslab then it must load in a new one.
Depending on the metaslab this could result in lots of random reads.
I've measured allocations times as high as 250ms+ for a single load
of a space map.

Once a transaction group completes we process the metaslabs and find the
"best" metaslabs to preload. The preload happens in the context of a
taskq at the end of spa_sync(). This moves the loading from the hot path
during allocation to the end of spa_sync(). The idea is that they should
be available for the next round of allocations.

I've done the same thing. Part of the idea behind the histogram is
to build more intelligent heuristic to load space maps that can
provide faster allocations.

Right now I'm working on ways that we can use the histogram data. The
biggest benefit with this wad is that we will be able to collect
information about how free space is comprised in a given metaslab and
hopefully give us the data we need to make smarter choices.

Right now I've been using zdb but am working on new mdb command to
get in-core stats.

Sigh.. I wish I had that on Linux. :(
Anyways, it sounds like good work from the description. I'm eager to look
it over and potentially get it into the Linux port. I'm curious how it
will affect our server performance.

If you try this out I would love to hear some feedback.

Thanks,
George

Jim Klimov

2013-09-04 15:12:29 UTC

In layman's terms, does this (at least partially) solve the known yet
elusive degradation of ZFS performance after some percentage of the
pool has been filled (empirically 70-90%, based on a particular pool's
previous history)?

If yes - I'd love to see this update included into distros, and also
to have more information about enabling this allocator :)

Are there estimates about overheads (how much would the histograms
use on-disk and in-processing, perhaps I am drawing on DDT's appetite
which renders it useless on smaller systems), as well as if any
potential dangers to the data are possible in the new allocator -
anything such that you would feel uneasy enabling it just now on your
production pool or a home-NAS with the family history of photos?

Thanks,
//Jim

George Wilson

2013-09-04 16:33:18 UTC

This definitely helps. You may need to increase the number of metaslabs
that you preload. This does lay the foundation to make further
improvements in this area. We've been focused on this problem for some
time and this is the first round of improvements. Note that the cost of
preloading more metaslabs means that you will use more memory to hold
this metadata.

The goal with this initial wad was to give us more information and
improve the areas we know are problematic. There is quite a bit of code
here that is foundation work for future changes as we know that we've
not solved the problem completely.

Post by Jim Klimov
If yes - I'd love to see this update included into distros, and also
to have more information about enabling this allocator :)
Are there estimates about overheads (how much would the histograms
use on-disk and in-processing, perhaps I am drawing on DDT's appetite
which renders it useless on smaller systems), as well as if any
potential dangers to the data are possible in the new allocator -
anything such that you would feel uneasy enabling it just now on your
production pool or a home-NAS with the family history of photos?

The histogram for the space map is stored in the same block as it's
allocation stats so I'm not expecting any noticeable overhead that you
would incur as a result of this features. The way that the histogram
information works is that it's stored in the bonus buffer for the space
map object which always loaded when you're doing allocations. I had to
increase the size of the bonus buffer so as systems age they will
upgrade existing space maps to the new version and start to store
histogram information. The old bonus buffer does not have enough space
to store this information. I'm working on a way to force space map
upgrades to happen on demand but that's not in this wad.

Thanks,
George

Post by Jim Klimov
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Christopher Siden

2013-09-06 22:50:10 UTC

Is anyone actively reviewing this and needs more time? If not I plan
to RTI it in a day or so.

Thanks,
Chris

This definitely helps. You may need to increase the number of metaslabs that
you preload. This does lay the foundation to make further improvements in
this area. We've been focused on this problem for some time and this is the
first round of improvements. Note that the cost of preloading more metaslabs
means that you will use more memory to hold this metadata.
The goal with this initial wad was to give us more information and improve
the areas we know are problematic. There is quite a bit of code here that is
foundation work for future changes as we know that we've not solved the
problem completely.

The histogram for the space map is stored in the same block as it's
allocation stats so I'm not expecting any noticeable overhead that you would
incur as a result of this features. The way that the histogram information
works is that it's stored in the bonus buffer for the space map object which
always loaded when you're doing allocations. I had to increase the size of
the bonus buffer so as systems age they will upgrade existing space maps to
the new version and start to store histogram information. The old bonus
buffer does not have enough space to store this information. I'm working on
a way to force space map upgrades to happen on demand but that's not in this
wad.
Thanks,
George

Post by Jim Klimov
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21639088-b97104e3
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Boris Protopopov

2013-09-06 23:24:21 UTC

Can you wait till tue next week, Chris?
If not, please go ahead,
Boris

Typos courtesy of my iPhone

Post by Christopher Siden
Is anyone actively reviewing this and needs more time? If not I plan
to RTI it in a day or so.
Thanks,
Chris

This definitely helps. You may need to increase the number of metaslabs that
you preload. This does lay the foundation to make further improvements in
this area. We've been focused on this problem for some time and this is the
first round of improvements. Note that the cost of preloading more metaslabs
means that you will use more memory to hold this metadata.
The goal with this initial wad was to give us more information and improve
the areas we know are problematic. There is quite a bit of code here that is
foundation work for future changes as we know that we've not solved the
problem completely.

The histogram for the space map is stored in the same block as it's
allocation stats so I'm not expecting any noticeable overhead that you would
incur as a result of this features. The way that the histogram information
works is that it's stored in the bonus buffer for the space map object which
always loaded when you're doing allocations. I had to increase the size of
the bonus buffer so as systems age they will upgrade existing space maps to
the new version and start to store histogram information. The old bonus
buffer does not have enough space to store this information. I'm working on
a way to force space map upgrades to happen on demand but that's not in this
wad.
Thanks,
George

Post by Jim Klimov
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-09-07 18:02:51 UTC

I would also like more time to review the disk format change.

Also, by next week, I hope you are referring to the 17th and not the
10th. The 10th only buys us an extra day.

Post by Boris Protopopov
Can you wait till tue next week, Chris?
If not, please go ahead,
Boris
Typos courtesy of my iPhone

Post by Christopher Siden
Is anyone actively reviewing this and needs more time? If not I plan
to RTI it in a day or so.
Thanks,
Chris

This definitely helps. You may need to increase the number of metaslabs that
you preload. This does lay the foundation to make further improvements in
this area. We've been focused on this problem for some time and this is the
first round of improvements. Note that the cost of preloading more metaslabs
means that you will use more memory to hold this metadata.
The goal with this initial wad was to give us more information and improve
the areas we know are problematic. There is quite a bit of code here that is
foundation work for future changes as we know that we've not solved the
problem completely.

The histogram for the space map is stored in the same block as it's
allocation stats so I'm not expecting any noticeable overhead that you would
incur as a result of this features. The way that the histogram information
works is that it's stored in the bonus buffer for the space map object which
always loaded when you're doing allocations. I had to increase the size of
the bonus buffer so as systems age they will upgrade existing space maps to
the new version and start to store histogram information. The old bonus
buffer does not have enough space to store this information. I'm working on
a way to force space map upgrades to happen on demand but that's not in this
wad.
Thanks,
George

Post by Jim Klimov
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

j***@cos.ru

2013-09-07 10:35:15 UTC

On a side note, since this fix can help against the performance collapse: would it be possible, and would it help, to chop the pool into a larger amount of smaller metaslabs? Perhaps, limiting size in gigabytes rather than 200 pieces on a pool (tlvdev) regardless of any size from figabytes to petabytes?

Typos courtesy of my Samsung Mobile

-------- ÐÑÑÐŸÐŽÐœÐŸÐµ ÑÐŸÐŸÐ±ÑÐµÐœÐžÐµ --------
ÐÑ: Christopher Siden <***@delphix.com>
ÐÐ°ÑÐ°: 2013.09.07 0:50 (GMT+01:00)
ÐÐŸÐŒÑ: ***@lists.illumos.org
Cc: Jim Klimov <***@cos.ru>
Ð¢ÐµÐŒÐ°: Re: [zfs] spacemap/metaslab work

Is anyone actively reviewing this and needs more time? If not I plan
to RTI it in a day or so.

Thanks,
Chris

This definitely helps. You may need to increase the number of metaslabs that
you preload. This does lay the foundation to make further improvements in
this area. We've been focused on this problem for some time and this is the
first round of improvements. Note that the cost of preloading more metaslabs
means that you will use more memory to hold this metadata.
The goal with this initial wad was to give us more information and improve
the areas we know are problematic. There is quite a bit of code here that is
foundation work for future changes as we know that we've not solved the
problem completely.

The histogram for the space map is stored in the same block as it's
allocation stats so I'm not expecting any noticeable overhead that you would
incur as a result of this features. The way that the histogram information
works is that it's stored in the bonus buffer for the space map object which
always loaded when you're doing allocations. I had to increase the size of
the bonus buffer so as systems age they will upgrade existing space maps to
the new version and start to store histogram information. The old bonus
buffer does not have enough space to store this information. I'm working on
a way to force space map upgrades to happen on demand but that's not in this
wad.
Thanks,
George

Post by Jim Klimov
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22497542-d75cd9d9
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Yao

2013-09-07 13:43:06 UTC

I am skeptical that having a larger number of smaller metaslabs is the
answer. Doing that in the manner that you suggest poses a few problems.

1. Limiting the size of a metaslab to something in the gigabyte range
would cause problems for pools containing less than that amount of
storage or slightly more. For example, if the metaslab size is 5GB, then
anything less than 5GB of space would not have any storage (and likely
crash and burn). If you made a pool on a 9GB disk, then only 5GB would
be available for use.

2. Increasing the number of metaslabs in the manner you suggest also has
implications for memory usage. At 200 metaslabs, the amount of memory
used is likely less than a few metabytes in the worst case (high
fragmentation). If the number is increased significantly, I would expect
the worst case memory usage to begin to compete with zfs_arc_min on some
systems.

I do not have a definitive answer to this problem, but a few ideas occur
to me:

1. Use a best-fit algorithm for allocation from metaslabs, regardless of
the amount of space available.
2. Decouple the space maps from the physical metaslabs, such that it
would be possible to increase/decrease the effective number of metaslabs
at pool import, while still using 200 metaslabs on the actual disk. This
would allow experiments with more/fewer metaslabs to be carried out.
3. Expand on Chris Siden's space/metaslab work to implement a global
lock-less data structure (e.g. like Linux RCU) to track of the sizes in
each metaslab so that metaslabs can be selected via a best-fit algorithm.

Post by j***@cos.ru
On a side note, since this fix can help against the performance collapse: would it be possible, and would it help, to chop the pool into a larger amount of smaller metaslabs? Perhaps, limiting size in gigabytes rather than 200 pieces on a pool (tlvdev) regardless of any size from figabytes to petabytes?
Typos courtesy of my Samsung Mobile
-------- ÐÑÑÐŸÐŽÐœÐŸÐµ ÑÐŸÐŸÐ±ÑÐµÐœÐžÐµ --------
ÐÐ°ÑÐ°: 2013.09.07 0:50 (GMT+01:00)
Ð¢ÐµÐŒÐ°: Re: [zfs] spacemap/metaslab work
Is anyone actively reviewing this and needs more time? If not I plan
to RTI it in a day or so.
Thanks,
Chris