How are space maps represented in AVL trees?

Discussion:

Richard Yao

2013-08-22 20:24:02 UTC

Understanding how the space maps are represented seems to be a
prerequisite for understanding how the block allocation code works. I
started reading the fine code to answer a question from IRC about how
well ZFS' block allocator interacts with SSDs, but I am having trouble
deciphering the representation of space maps in AVL trees.

Is there any documentation on how space maps are represented in AVL trees?

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-22 20:38:57 UTC

Permalink

It's simply a sorted tree of extents, no surprise there. The beauty of
space maps stems from how they are represented on disk and how in-core
changes affect the disk representation:
https://blogs.oracle.com/bonwick/entry/space_maps

Post by Richard Yao
Is there any documentation on how space maps are represented in AVL trees?

I recently added some comments to the space maps API inside of ZFS, but
it's not upstream yet. You can find the stuff here:
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/usr/src/uts/common/fs/zfs/space_map.c.wdiff.html

Cheers

--
Saso

Richard Yao

2013-08-22 21:46:40 UTC

Permalink

Post by Saso Kiselkov

It's simply a sorted tree of extents, no surprise there. The beauty of
space maps stems from how they are represented on disk and how in-core
https://blogs.oracle.com/bonwick/entry/space_maps

Post by Richard Yao
Is there any documentation on how space maps are represented in AVL trees?

I recently added some comments to the space maps API inside of ZFS, but
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/usr/src/uts/common/fs/zfs/space_map.c.wdiff.html
Cheers

Thanks for the quick response. I had expected things to be more
complicated than they actually were, so I did not realize how they
worked until I read your response.

To avoid being another DenverCoder9 (see: http://xkcd.com/979/), let me
explain a little more.

The free space extents are stored in the AVL tree sorted using a
combined comparator. Specifically, things are sorted based on size and
things with the same size are further sorted on address. The code
becomes remarkably easy to understand after you realize this.

Additionally, metaslab_df_alloc() has some magic code:

uint64_t align = size & -size;
uint64_t *cursor = (uint64_t *)sm->sm_ppd + highbit(align) - 1;

The purpose is to maintain a bunch of cursors to last used addresses for
allocations of various sizes so that we are not biased to a particular
region on disk, but that was not initially clear. First, `size & -size`
always returns the lowest set bit, which is then used to find the
cursor. It would have been more readable had the code been like this:

uint64_t *cursor = (uint64_t *)sm->sm_ppd + highbit(size) - 1;

However, we do not allocate more than 128K at a time and we never
allocate less than 512-bytes, so we are limited to 9 cursors either way.
The only difference is whether we are hashing based on the lowest bit or
the highest bit.

Anyway, metaslab_df_alloc() calls metaslab_block_picker(), which then
does a tree search for the a block at the old location with the
appropriate size. This almost always fails, but we then find the first
item after where it would have been in the sorted tree and iterate until
we find an appropriate extent. If we don't find one, we do one last
search with the cursor set to 0 in case there were extents equal to the
size argument located earlier in the address space and none equal or
greater than the size argument afterward.

If we are running low on free space, then we start with the cursor set
to zero, which will always provide the best fit.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-22 21:53:47 UTC

Permalink

Post by Richard Yao

Post by Saso Kiselkov

It's simply a sorted tree of extents, no surprise there. The beauty of
space maps stems from how they are represented on disk and how in-core
https://blogs.oracle.com/bonwick/entry/space_maps

Post by Richard Yao
Is there any documentation on how space maps are represented in AVL trees?

I recently added some comments to the space maps API inside of ZFS, but
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/usr/src/uts/common/fs/zfs/space_map.c.wdiff.html
Cheers

Thanks for the quick response. I had expected things to be more
complicated than they actually were, so I did not realize how they
worked until I read your response.
To avoid being another DenverCoder9 (see: http://xkcd.com/979/), let me
explain a little more.
The free space extents are stored in the AVL tree sorted using a
combined comparator. Specifically, things are sorted based on size and
things with the same size are further sorted on address. The code
becomes remarkably easy to understand after you realize this.
uint64_t align = size & -size;
uint64_t *cursor = (uint64_t *)sm->sm_ppd + highbit(align) - 1;
The purpose is to maintain a bunch of cursors to last used addresses for
allocations of various sizes so that we are not biased to a particular
region on disk, but that was not initially clear. First, `size & -size`
always returns the lowest set bit, which is then used to find the
uint64_t *cursor = (uint64_t *)sm->sm_ppd + highbit(size) - 1;
However, we do not allocate more than 128K at a time and we never
allocate less than 512-bytes, so we are limited to 9 cursors either way.
The only difference is whether we are hashing based on the lowest bit or
the highest bit.
Anyway, metaslab_df_alloc() calls metaslab_block_picker(), which then
does a tree search for the a block at the old location with the
appropriate size. This almost always fails, but we then find the first
item after where it would have been in the sorted tree and iterate until
we find an appropriate extent. If we don't find one, we do one last
search with the cursor set to 0 in case there were extents equal to the
size argument located earlier in the address space and none equal or
greater than the size argument afterward.
If we are running low on free space, then we start with the cursor set
to zero, which will always provide the best fit.

Nice explanation. To avoid this getting lost in the annals of mailing
list archives, would you consider adding this as a block comment to the
code? By virtue of upstreaming and mirroring in code repos all over the
world this knowledge will proliferate to the point of being impossible
to lose.

Cheers,

--
Saso

Richard Yao

2013-08-22 22:06:11 UTC

Permalink

Post by Saso Kiselkov

Post by Richard Yao

Post by Saso Kiselkov

It's simply a sorted tree of extents, no surprise there. The beauty of
space maps stems from how they are represented on disk and how in-core
https://blogs.oracle.com/bonwick/entry/space_maps

Post by Richard Yao
Is there any documentation on how space maps are represented in AVL trees?

I recently added some comments to the space maps API inside of ZFS, but
http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/usr/src/uts/common/fs/zfs/space_map.c.wdiff.html
Cheers

Thanks for the quick response. I had expected things to be more
complicated than they actually were, so I did not realize how they
worked until I read your response.
To avoid being another DenverCoder9 (see: http://xkcd.com/979/), let me
explain a little more.
The free space extents are stored in the AVL tree sorted using a
combined comparator. Specifically, things are sorted based on size and
things with the same size are further sorted on address. The code
becomes remarkably easy to understand after you realize this.
uint64_t align = size & -size;
uint64_t *cursor = (uint64_t *)sm->sm_ppd + highbit(align) - 1;
The purpose is to maintain a bunch of cursors to last used addresses for
allocations of various sizes so that we are not biased to a particular
region on disk, but that was not initially clear. First, `size & -size`
always returns the lowest set bit, which is then used to find the
uint64_t *cursor = (uint64_t *)sm->sm_ppd + highbit(size) - 1;
However, we do not allocate more than 128K at a time and we never
allocate less than 512-bytes, so we are limited to 9 cursors either way.
The only difference is whether we are hashing based on the lowest bit or
the highest bit.
Anyway, metaslab_df_alloc() calls metaslab_block_picker(), which then
does a tree search for the a block at the old location with the
appropriate size. This almost always fails, but we then find the first
item after where it would have been in the sorted tree and iterate until
we find an appropriate extent. If we don't find one, we do one last
search with the cursor set to 0 in case there were extents equal to the
size argument located earlier in the address space and none equal or
greater than the size argument afterward.
If we are running low on free space, then we start with the cursor set
to zero, which will always provide the best fit.

Sure. I will do that sometime this week, although probably not today. I
have stared at code enough for one day. :)

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-08-22 20:51:04 UTC

Permalink

I don't know about better documentation than the old ZFS on-disk spec
book, code comments and blogs, and the list archives. I can share the
concepts - which you probably know already, since the idea is simple
(and your question is about the implementation - which I don't know
very well).

The idea is that ZFS stores (on-disk and in-memory) a map of free and
used space in the pool in a dynamic structure and not a bitmap which
is used in legacy systems and does not scale well. I believe, actually,
that each top-level vdev is split into roughly 200 ranges accounted
separately.

On-disk these spacemaps are stored as a log to which simple entries are
added at the tail - "allocated range [X-Y]", "freed range [M-N]". In
memory they are stored and sorted as a tree, which allows to eliminate
extra entries when "bubbles" of the same type (free or used space)
"collide". Every once in a while this in-memory representation is
written to on-disk storage, thus hopefully reducing the amount of space
needed to store the map as well as the number of iterations to traverse
it - allocations of new blocks have to find the optimal free-space
"bubble" in the map. During the lifetime of the pool, the spacemaps
start trivially small (all is one free range), then expand to many
entries, then collapse to trivially small (all is one allocated range,
or close to that).

On a side note, I think that these "ranges" (are they called metaslabs?)
are used for a couple of reasons - to have favourable locations for
faster IO (metadata, etc.) and slower bulk IO, and to not have to do
some pool-wide bookkeeping and just keep separate tabs on smaller
ranges of the whole addressable space. The choice of approx 200 ranges
per tlvdev is a historical fact, its reasonability today is rather
disputable. Possibly, with larger disks and pools of today, all this
accounting of larger-than-before ranges of data becomes the heavy task
which separation of ranges was meant to prevent. I privately guess that
it may be responsible for the performance degradation seen with "nearly
full pools" which some people see after 70%, and we haven't seen on a
server with ~100Gb of 8Tb remaining free for over a year and intensive
COW IO (adding and deleting backups several times per day).

Maybe the choice of amount of the ranges should take some reasonable
maximum size of the range into account as well, so that walking of the
tree doesn't take forever... and keep some helping structure (if this
is not already done) with total free space and largest available
stretch of space as well, so that the allocation algorithm first
drills down into those spacemaps where it is known to have a chance
to find the positive result. ZFS does allow to store blocks broken
into smaller pieces, if there is not enough free space for the block
size (see gang blocks) which may be another vector for abrupt drop
of performance.

HTH,
//Jim Klimov

Matthew Ahrens

2013-08-22 21:04:19 UTC

Permalink

FYI - George has restructured a bunch of the space map code, hopefully
making it much easier to understand by reading the code. In particular,
the in-memory structure (now called range trees) is more clearly
differentiated from the on-disk structure (now called space maps) . We've
also added at least a little bit more comments. See the comment in
metaslab_impl.h in particular.

We'll be working on getting these changes into illumos within the next
month. In the mean time you can view the changes here:

https://github.com/delphix/delphix-os/commit/f45b64bfccff0df6ad43363dcfd9016afdf165c7#L13R79

--matt

Post by Richard Yao
Understanding how the space maps are represented seems to be a
prerequisite for understanding how the block allocation code works. I
started reading the fine code to answer a question from IRC about how
well ZFS' block allocator interacts with SSDs, but I am having trouble
deciphering the representation of space maps in AVL trees.
Is there any documentation on how space maps are represented in AVL trees?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com