Racing condition between vdev_disk_io_start and vdev_disk_off

Discussion:

Racing condition between vdev_disk_io_start and vdev_disk_off_notify.

Zakharov, Pavel via illumos-zfs

2014-06-06 19:46:56 UTC

Currently, when a disk is offlined unexpectedly, there seems to be a racing condition between vdev_disk_io_start and vdev_disk_off_notify that can result in a kernel panic:

panic[cpu16]/thread=ffffff0175367c40:
BAD TRAP: type=e (#pf Page fault) rp=ffffff01753677b0 addr=8 occurred in module "genunix" due to a NULL pointer dereference

ffffff0175367680 unix:die+dd ()
ffffff01753677a0 unix:trap+17db ()
ffffff01753677b0 unix:cmntrap+e6 ()
ffffff01753678c0 genunix:bdev_strategy+3a ()
ffffff01753678f0 genunix:ldi_strategy+59 ()
ffffff0175367930 zfs:vdev_disk_io_start+eb ()
ffffff0175367970 zfs:zio_vdev_io_start+1ea ()
ffffff01753679a0 zfs:zio_execute+8d ()
ffffff01753679c0 zfs:zio_nowait+42 ()
ffffff0175367a20 zfs:vdev_probe+194 ()
ffffff0175367a50 zfs:zio_vdev_io_done+e8 ()
ffffff0175367a80 zfs:zio_execute+8d ()
ffffff0175367b30 genunix:taskq_thread+285 ()
ffffff0175367b40 unix:thread_start+8 ()

vdev_disk_io_start checks at the beginning for (dvd == NULL || (dvd->vd_ldi_offline && dvd->vd_lh == NULL)), and then calls ldi_strategy(dvd->vd_lh, bp).
However there is no synchronization and if a vdev_disk_off_notify event comes in the middle it will call vdev_disk_close which closes & frees the ldi handle.
Here are stack traces for where vdev_disk_off_notify gets called from:

ISCSI:
vdev_disk_off_notify:entry
genunix`ldi_invoke_notify+0x13e
genunix`e_ddi_offline_notify+0xf4
genunix`devi_detach_node+0x68
genunix`ndi_devi_offline+0x16d
genunix`i_mdi_pi_state_change+0x4a7
genunix`mdi_pi_offline+0x3e
iscsi`iscsi_lun_offline+0x1a6
iscsi`iscsi_sess_offline_luns+0x40
iscsi`iscsi_sess_state_failed+0xbd
iscsi`iscsi_sess_state_machine+0x141
iscsi`iscsi_login_end+0xd2
iscsi`iscsi_login_start+0x284
genunix`taskq_thread+0x285
unix`thread_start+0x8

SAS:
vdev_disk_off_notify:entry
genunix`ldi_invoke_notify+0x13e
genunix`e_ddi_offline_notify+0xf4
genunix`devi_detach_node+0x68
genunix`ndi_devi_offline+0x16d
genunix`i_mdi_pi_state_change+0x4a7
genunix`mdi_pi_offline+0x3e
mpt_sas`mptsas_offline_lun+0xb5
mpt_sas`mptsas_offline_target+0x155
mpt_sas`mptsas_handle_topo_change+0x4c7
mpt_sas`mptsas_handle_dr+0x1f7
genunix`taskq_thread+0x285
unix`thread_start+0x8

As a quick workaround we were thinking of putting a RW lock into vdev_disk_io_start and vdev_disk_off_notify, however there's probably a better solution.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Matthew Ahrens via illumos-zfs

2014-06-06 19:58:55 UTC

Permalink

I think that vdev_disk_off_notify() should grab the spa config lock, and
then call vdev_close(), which verifies that the lock is held. It should be
safe to grab the spa config lock from those stack traces.

This code was introduced by:

commit 39cddb10a31c1c2e66aed69e6871d09caa4c8147
Author: Joshua M. Clulow <***@joyent.com>
Date: Wed Sep 11 14:47:59 2013 -0700

4128 disks in zpools never go away when pulled
Reviewed by: Keith Wesolowski <***@joyent.com>
Reviewed by: Bill Pijewski <***@joyent.com>
Reviewed by: Marcel Telka <***@nexenta.com>
Approved by: Eric Schrock <***@delphix.com>

Joshua or Rob, could you take a look at it?

--matt

Post by Zakharov, Pavel via illumos-zfs
Currently, when a disk is offlined unexpectedly, there seems to be a
racing condition between vdev_disk_io_start and vdev_disk_off_notify that
BAD TRAP: type=e (#pf Page fault) rp=ffffff01753677b0 addr=8 occurred in
module "genunix" due to a NULL pointer dereference
ffffff0175367680 unix:die+dd ()
ffffff01753677a0 unix:trap+17db ()
ffffff01753677b0 unix:cmntrap+e6 ()
ffffff01753678c0 genunix:bdev_strategy+3a ()
ffffff01753678f0 genunix:ldi_strategy+59 ()
ffffff0175367930 zfs:vdev_disk_io_start+eb ()
ffffff0175367970 zfs:zio_vdev_io_start+1ea ()
ffffff01753679a0 zfs:zio_execute+8d ()
ffffff01753679c0 zfs:zio_nowait+42 ()
ffffff0175367a20 zfs:vdev_probe+194 ()
ffffff0175367a50 zfs:zio_vdev_io_done+e8 ()
ffffff0175367a80 zfs:zio_execute+8d ()
ffffff0175367b30 genunix:taskq_thread+285 ()
ffffff0175367b40 unix:thread_start+8 ()
vdev_disk_io_start checks at the beginning for (dvd == NULL ||
(dvd->vd_ldi_offline && dvd->vd_lh == NULL)), and then calls
ldi_strategy(dvd->vd_lh, bp).
However there is no synchronization and if a vdev_disk_off_notify event
comes in the middle it will call vdev_disk_close which closes & frees the
ldi handle.
vdev_disk_off_notify:entry
genunix`ldi_invoke_notify+0x13e
genunix`e_ddi_offline_notify+0xf4
genunix`devi_detach_node+0x68
genunix`ndi_devi_offline+0x16d
genunix`i_mdi_pi_state_change+0x4a7
genunix`mdi_pi_offline+0x3e
iscsi`iscsi_lun_offline+0x1a6
iscsi`iscsi_sess_offline_luns+0x40
iscsi`iscsi_sess_state_failed+0xbd
iscsi`iscsi_sess_state_machine+0x141
iscsi`iscsi_login_end+0xd2
iscsi`iscsi_login_start+0x284
genunix`taskq_thread+0x285
unix`thread_start+0x8
vdev_disk_off_notify:entry
genunix`ldi_invoke_notify+0x13e
genunix`e_ddi_offline_notify+0xf4
genunix`devi_detach_node+0x68
genunix`ndi_devi_offline+0x16d
genunix`i_mdi_pi_state_change+0x4a7
genunix`mdi_pi_offline+0x3e
mpt_sas`mptsas_offline_lun+0xb5
mpt_sas`mptsas_offline_target+0x155
mpt_sas`mptsas_handle_topo_change+0x4c7
mpt_sas`mptsas_handle_dr+0x1f7
genunix`taskq_thread+0x285
unix`thread_start+0x8
As a quick workaround we were thinking of putting a RW lock into
vdev_disk_io_start and vdev_disk_off_notify, however thereâs probably a
better solution.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

George Wilson via illumos-zfs

2014-06-06 20:42:28 UTC

Permalink

As a general rule, we should not call the underlying close routines
(i.e. vdev_disk_close) explicitly. These are meant to be called from
vdev_close() which enforces all the proper locking.

- George

Post by Matthew Ahrens via illumos-zfs
I think that vdev_disk_off_notify() should grab the spa config lock,
and then call vdev_close(), which verifies that the lock is held. It
should be safe to grab the spa config lock from those stack traces.
commit 39cddb10a31c1c2e66aed69e6871d09caa4c8147
Date: Wed Sep 11 14:47:59 2013 -0700
4128 disks in zpools never go away when pulled
Joshua or Rob, could you take a look at it?
--matt
On Fri, Jun 6, 2014 at 12:46 PM, Zakharov, Pavel
Currently, when a disk is offlined unexpectedly, there seems to be
a racing condition between vdev_disk_io_start and
BAD TRAP: type=e (#pf Page fault) rp=ffffff01753677b0 addr=8
occurred in module "genunix" due to a NULL pointer dereference
ffffff0175367680 unix:die+dd ()
ffffff01753677a0 unix:trap+17db ()
ffffff01753677b0 unix:cmntrap+e6 ()
ffffff01753678c0 genunix:bdev_strategy+3a ()
ffffff01753678f0 genunix:ldi_strategy+59 ()
ffffff0175367930 zfs:vdev_disk_io_start+eb ()
ffffff0175367970 zfs:zio_vdev_io_start+1ea ()
ffffff01753679a0 zfs:zio_execute+8d ()
ffffff01753679c0 zfs:zio_nowait+42 ()
ffffff0175367a20 zfs:vdev_probe+194 ()
ffffff0175367a50 zfs:zio_vdev_io_done+e8 ()
ffffff0175367a80 zfs:zio_execute+8d ()
ffffff0175367b30 genunix:taskq_thread+285 ()
ffffff0175367b40 unix:thread_start+8 ()
vdev_disk_io_start checks at the beginning for (dvd == NULL ||
(dvd->vd_ldi_offline && dvd->vd_lh == NULL)), and then calls
ldi_strategy(dvd->vd_lh, bp).
However there is no synchronization and if a vdev_disk_off_notify
event comes in the middle it will call vdev_disk_close which
closes & frees the ldi handle.
vdev_disk_off_notify:entry
genunix`ldi_invoke_notify+0x13e
genunix`e_ddi_offline_notify+0xf4
genunix`devi_detach_node+0x68
genunix`ndi_devi_offline+0x16d
genunix`i_mdi_pi_state_change+0x4a7
genunix`mdi_pi_offline+0x3e
iscsi`iscsi_lun_offline+0x1a6
iscsi`iscsi_sess_offline_luns+0x40
iscsi`iscsi_sess_state_failed+0xbd
iscsi`iscsi_sess_state_machine+0x141
iscsi`iscsi_login_end+0xd2
iscsi`iscsi_login_start+0x284
genunix`taskq_thread+0x285
unix`thread_start+0x8
vdev_disk_off_notify:entry
genunix`ldi_invoke_notify+0x13e
genunix`e_ddi_offline_notify+0xf4
genunix`devi_detach_node+0x68
genunix`ndi_devi_offline+0x16d
genunix`i_mdi_pi_state_change+0x4a7
genunix`mdi_pi_offline+0x3e
mpt_sas`mptsas_offline_lun+0xb5
mpt_sas`mptsas_offline_target+0x155
mpt_sas`mptsas_handle_topo_change+0x4c7
mpt_sas`mptsas_handle_dr+0x1f7
genunix`taskq_thread+0x285
unix`thread_start+0x8
As a quick workaround we were thinking of putting a RW lock into
vdev_disk_io_start and vdev_disk_off_notify, however thereâs
probably a better solution.