Discussion:
[developer] speeding up zpool import
Matthew Ahrens via illumos-developer
2014-10-16 20:12:29 UTC
Permalink
(resend to cc a few more lists -- sorry)

I think the overall idea of this change is sound -- and that's a great
performance improvement.

Can you change dmu_objset_find_parallel() to work like
dmu_objset_find_dp(), and then change zil_check_log_chain(0 and zil_claim()
to work off of the dsl_dataset_t*, rather than the char*name? Then I think
the other threads won't need to grab the namespace lock.

--matt

On Thu, Oct 16, 2014 at 12:05 PM, Arne Jansen via illumos-developer <
In our setup zpool import takes a very long time (about 15 minutes) due
to a large count of filesystems (> 5000).
During this time, you can see the disks at only a few percent busy in
iostat, while the pool is at 100%, which indicates a single-threaded
import process.
What's causing this is checking of the zil chains for each filesystem.
Each filesystem is visisted twice, once to check the chain and once
to claim it. This is done in a sequential manner.
I've created a proof-of-concept patch that switches this process to a
parallel enumeration using a taskq.
In my setup this speeds up the import process by a factor of 20,
bringing all disks (30 in my case) to nearly 100% busy.
There's only one problem: locking. spa_open is called with
spa_namespace_lock held. With this, later on zil_check_log_chain and
zil_claim are called. These call in turn spa_open, which tries to claim
spa_namespace_lock again. This problem is not new. The current solution
/*
* As disgusting as this is, we need to support recursive calls to this
* function because dsl_dir_open() is called during spa_load(), and ends
* up calling spa_open() again. The real fix is to figure out how to
* avoid dsl_dir_open() calling this in the first place.
*/
if (mutex_owner(&spa_namespace_lock) != curthread) {
mutex_enter(&spa_namespace_lock);
locked = B_TRUE;
}
This doesn't work anymore when I call zil_claim/check_log_chain through
a taskq and thus from a different thread. My current hacky solution is
to pass a flag through the call chain to spa_open to signal that the
lock isn't needed. It works, but it doesn't look too nice...
My question now is how to build a cleaner solution. The snippet above
is already ugly in itself, so it would be good to get rid of it at all.
The functions involved in the call chains are all very commonly used
functions. Changing the signature of those will probably give a quite
bulky patch, so I'd prefer to find a nice small unintrusive way.
http://cr.illumos.org/~webrev/sensille/find_parallel/
Hopefully you can give me some hints on how to solve this recursive
locking riddle...
-Arne
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/
21175174-cd73734d
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com

Loading...