Post by Richard YaoUnderstanding how the space maps are represented seems to be a
prerequisite for understanding how the block allocation code works. I
started reading the fine code to answer a question from IRC about how
well ZFS' block allocator interacts with SSDs, but I am having trouble
deciphering the representation of space maps in AVL trees.
Is there any documentation on how space maps are represented in AVL trees?
I don't know about better documentation than the old ZFS on-disk spec
book, code comments and blogs, and the list archives. I can share the
concepts - which you probably know already, since the idea is simple
(and your question is about the implementation - which I don't know
very well).
The idea is that ZFS stores (on-disk and in-memory) a map of free and
used space in the pool in a dynamic structure and not a bitmap which
is used in legacy systems and does not scale well. I believe, actually,
that each top-level vdev is split into roughly 200 ranges accounted
separately.
On-disk these spacemaps are stored as a log to which simple entries are
added at the tail - "allocated range [X-Y]", "freed range [M-N]". In
memory they are stored and sorted as a tree, which allows to eliminate
extra entries when "bubbles" of the same type (free or used space)
"collide". Every once in a while this in-memory representation is
written to on-disk storage, thus hopefully reducing the amount of space
needed to store the map as well as the number of iterations to traverse
it - allocations of new blocks have to find the optimal free-space
"bubble" in the map. During the lifetime of the pool, the spacemaps
start trivially small (all is one free range), then expand to many
entries, then collapse to trivially small (all is one allocated range,
or close to that).
On a side note, I think that these "ranges" (are they called metaslabs?)
are used for a couple of reasons - to have favourable locations for
faster IO (metadata, etc.) and slower bulk IO, and to not have to do
some pool-wide bookkeeping and just keep separate tabs on smaller
ranges of the whole addressable space. The choice of approx 200 ranges
per tlvdev is a historical fact, its reasonability today is rather
disputable. Possibly, with larger disks and pools of today, all this
accounting of larger-than-before ranges of data becomes the heavy task
which separation of ranges was meant to prevent. I privately guess that
it may be responsible for the performance degradation seen with "nearly
full pools" which some people see after 70%, and we haven't seen on a
server with ~100Gb of 8Tb remaining free for over a year and intensive
COW IO (adding and deleting backups several times per day).
Maybe the choice of amount of the ranges should take some reasonable
maximum size of the range into account as well, so that walking of the
tree doesn't take forever... and keep some helping structure (if this
is not already done) with total free space and largest available
stretch of space as well, so that the allocation algorithm first
drills down into those spacemaps where it is known to have a chance
to find the positive result. ZFS does allow to store blocks broken
into smaller pieces, if there is not enough free space for the block
size (see gang blocks) which may be another vector for abrupt drop
of performance.
HTH,
//Jim Klimov