zio error handling

Bob via illumos-zfs

2014-09-25 05:16:34 UTC

Hello All,

I have couple of questions for the zio error handing in illumos.
Recently I have couple of systems (oi151a8, well kind of old) have
trouble with zio(the phenomenon is that the system not respond to any
command even login, or just got stucked when executing zpool related
command), some are caused by pulling out disk, some are not.

I know ZFS handles io suspend/resume by design but not too much, here's
what I get from code study and test:
1, when ZIO got disk error, zio will be kept in spa_suspend_zio_root and
wait resume(once disk ready for service, zpool clear triggers resume)

2, If there's no zpool clear excecuted, or the disk keeps out of service
all the time, the system will hang all of the zpool related
operations(well, all cmd which takes spa_namespace_lock, )

Here comes my question:
1, I have a disk (a pool on it, raid0) and once I do import with the
pool on that disk, then zpool cmd stucked, the mdb shows that spa_sync
was wait zio in spa_history_write, it seems not the suspended case, how
can an zio not return? Is this a bug?(the backtrace please check
http://pastebin.com/zdsatkL3)

2, I do test with disk pulled out, but the spa_sync also stucked in
dsl_pool_sync, does it mean that not all the zios will wait in
spa_suspend_zio_root? how does the zpool clear fix this?

3, I know that there's fail mode settings, which one should we use? I
don't think reject(hang) all the zpool cmd is a good idea, I mean there
are couples of pools, one failure pool keeps other pools not responding
to operations.

Can someone share more details of ZIO error handing in ZFS? I am not
sure if I understand correctly and if there's bugs on my current system.
Thanks in advance.