race between l2arc code in arc_read and l2arc

Discussion:

race between l2arc code in arc_read and l2arc_evict

Andriy Gapon

2013-08-30 12:00:30 UTC

It seems that there is a race between l2arc_evict and the l2arc related code in
arc_read, at least in recent FreeBSD.
We got the following crash:

...
#8 0xffffffff8073f996 in trap (frame=0xfffffe104a076f40) at
/usr/src/sys/amd64/amd64/trap.c:463
#9 0xffffffff80729372 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:232
#10 0xffffffff80d95e21 in arc_read (pio=0xfffff801d4baf3b0,
spa=0xfffff8000fddf000, bp=0xfffffe000e5e8240, done=0x181fd9c00,
private=0xfffff80100000000, priority=0, zio_flags=744616296,
arc_flags=0xfffffe104a0770a0,
zb=0xfffff80100000000) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:3331
#11 0xffffffff80d9d331 in dbuf_read (db=0xfffff801e6e737e0, zio=0x0, flags=10)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:582
#12 0xffffffff80da3ae3 in dmu_buf_hold (os=<value optimized out>, object=<value
optimized out>, offset=<value optimized out>, tag=<value optimized out>,
dbp=0xfffffe104a0771f8, flags=<value optimized out>)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:153
#13 0xffffffff80df6bf9 in zap_lockdir (os=0xfffff8001d258400, obj=17, tx=0x0,
lti=RW_READER, fatreader=1, adding=0, zapp=<value optimized out>)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467
#14 0xffffffff80df776d in zap_lookup_norm (os=0x0, zapobj=18446744071577351547,
name=0xfffffe104a077490 "mktemp", integer_size=8, num_integers=1,
buf=0xfffffe104a077338, mt=501, realname=0x0, rn_len=0, ncp=0x0)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:788
#15 0xffffffff80df771e in zap_lookup (os=0x0, zapobj=18446744071577351547,
name=0x1f5 <Address 0x1f5 out of bounds>, integer_size=1, num_integers=0, buf=0x0)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:773
#16 0xffffffff80e144f6 in zfs_dirent_lock (dlpp=0xfffffe104a0773b0,
dzp=0xfffff8001ddaa5c0, name=0xfffffe104a077490 "mktemp",
zpp=0xfffffe104a0773a8, flag=6, direntflags=0x0, realpnp=<value optimized out>)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:91
#17 0xffffffff80e1484d in zfs_dirlook (dzp=0xfffff8001ddaa5c0,
name=0xfffffe104a077490 "mktemp", vpp=0xfffffe104a077868, flags=<value optimized
out>, deflg=0x0, rpnp=0x0)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:426
#18 0xffffffff80e28b34 in zfs_lookup () at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1500
...

The arc_read code in question is this:
if (hdr->b_l2hdr->b_compress ==
ZIO_COMPRESS_EMPTY) {
rzio = zio_null(pio, spa, vd,
l2arc_read_done, cb,
zio_flags | ZIO_FLAG_DONT_CACHE |
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY);
} else {
rzio = zio_read_phys(pio, vd, addr,
hdr->b_l2hdr->b_asize,
buf->b_data, ZIO_CHECKSUM_OFF,
l2arc_read_done, cb, priority,
zio_flags | ZIO_FLAG_DONT_CACHE |
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY, B_FALSE);
}
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
zio_t *, rzio);
ARCSTAT_INCR(arcstat_l2_read_bytes,
hdr->b_l2hdr->b_asize);

The crash is on the ARCSTAT_INCR line while dereferencing hdr->b_l2hdr.

It's apparent that hdr->b_l2hdr had to be non-NULL for this code path to be
taken. At the moment of the crash it is NULL though:

(kgdb) p *(arc_buf_hdr_t*)0xfffff8001d7cfe58
$6 = {b_dva = {dva_word = {6, 394971608}}, b_birth = 827690, b_cksum0 =
590276882470, b_freeze_lock = {lock_object = {lo_name = 0xffffffff80e84eb6
"buf->b_freeze_lock", lo_flags = 40960000, lo_data = 0, lo_witness = 0x0},
sx_lock = 1}, b_freeze_cksum = 0xfffff8001d0d7e20, b_thawed = 0x0,
b_hash_next = 0x0, b_buf = 0xfffff80e2c61f168, b_flags = 132640, b_datacnt = 1,
b_acb = 0xfffff80f89ba3680, b_cv = {
cv_description = 0xffffffff80e84eab "buf->b_cv", cv_waiters = 0}, b_type =
ARC_BUFC_METADATA, b_size = 8704, b_spa = 2586324942190472777, b_state =
0xffffffff80eaf080, b_arc_node = {list_next = 0x0, list_prev = 0x0},
b_arc_access = 138774732, b_refcnt = {rc_count = 1}, b_l2hdr = 0x0, b_l2node =
{list_next = 0x0, list_prev = 0x0}}

It seems that b_flags include the following bits: ARC_IN_HASH_TABLE,
ARC_IO_IN_PROGRESS, ARC_L2_EVICTED.
As far as I can see it is l2arc_evict that sets ARC_L2_EVICTED and resets
b_l2hdr to NULL. Looks like it does that under protection of hash_lock and
l2arc_buflist_mtx.
On the other hand, it seems that the arc_read code in question runs without
protection of any of these locks (or any other ARC-level lock).

What do you think?
Thank you!

--
Andriy Gapon

Boris Protopopov

2013-08-30 13:33:01 UTC

Permalink

Hi Andy,

Looks like an accurate analysis to me, we saw something similar earlier with use after free as a result of a race on the evict lock; I am working on some locking rework for ARC and can include that fix along with others if you'd like,

Boris

Typos courtesy of my iPhone

Post by Andriy Gapon
It seems that there is a race between l2arc_evict and the l2arc related code in
arc_read, at least in recent FreeBSD.
...
#8 0xffffffff8073f996 in trap (frame=0xfffffe104a076f40) at
/usr/src/sys/amd64/amd64/trap.c:463
#9 0xffffffff80729372 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:232
#10 0xffffffff80d95e21 in arc_read (pio=0xfffff801d4baf3b0,
spa=0xfffff8000fddf000, bp=0xfffffe000e5e8240, done=0x181fd9c00,
private=0xfffff80100000000, priority=0, zio_flags=744616296,
arc_flags=0xfffffe104a0770a0,
zb=0xfffff80100000000) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:3331
#11 0xffffffff80d9d331 in dbuf_read (db=0xfffff801e6e737e0, zio=0x0, flags=10)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:582
#12 0xffffffff80da3ae3 in dmu_buf_hold (os=<value optimized out>, object=<value
optimized out>, offset=<value optimized out>, tag=<value optimized out>,
dbp=0xfffffe104a0771f8, flags=<value optimized out>)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:153
#13 0xffffffff80df6bf9 in zap_lockdir (os=0xfffff8001d258400, obj=17, tx=0x0,
lti=RW_READER, fatreader=1, adding=0, zapp=<value optimized out>)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467
#14 0xffffffff80df776d in zap_lookup_norm (os=0x0, zapobj=18446744071577351547,
name=0xfffffe104a077490 "mktemp", integer_size=8, num_integers=1,
buf=0xfffffe104a077338, mt=501, realname=0x0, rn_len=0, ncp=0x0)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:788
#15 0xffffffff80df771e in zap_lookup (os=0x0, zapobj=18446744071577351547,
name=0x1f5 <Address 0x1f5 out of bounds>, integer_size=1, num_integers=0, buf=0x0)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:773
#16 0xffffffff80e144f6 in zfs_dirent_lock (dlpp=0xfffffe104a0773b0,
dzp=0xfffff8001ddaa5c0, name=0xfffffe104a077490 "mktemp",
zpp=0xfffffe104a0773a8, flag=6, direntflags=0x0, realpnp=<value optimized out>)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:91
#17 0xffffffff80e1484d in zfs_dirlook (dzp=0xfffff8001ddaa5c0,
name=0xfffffe104a077490 "mktemp", vpp=0xfffffe104a077868, flags=<value optimized
out>, deflg=0x0, rpnp=0x0)
at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:426
#18 0xffffffff80e28b34 in zfs_lookup () at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1500
...
if (hdr->b_l2hdr->b_compress ==
ZIO_COMPRESS_EMPTY) {
rzio = zio_null(pio, spa, vd,
l2arc_read_done, cb,
zio_flags | ZIO_FLAG_DONT_CACHE |
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY);
} else {
rzio = zio_read_phys(pio, vd, addr,
hdr->b_l2hdr->b_asize,
buf->b_data, ZIO_CHECKSUM_OFF,
l2arc_read_done, cb, priority,
zio_flags | ZIO_FLAG_DONT_CACHE |
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY, B_FALSE);
}
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
zio_t *, rzio);
ARCSTAT_INCR(arcstat_l2_read_bytes,
hdr->b_l2hdr->b_asize);
The crash is on the ARCSTAT_INCR line while dereferencing hdr->b_l2hdr.
It's apparent that hdr->b_l2hdr had to be non-NULL for this code path to be
(kgdb) p *(arc_buf_hdr_t*)0xfffff8001d7cfe58
$6 = {b_dva = {dva_word = {6, 394971608}}, b_birth = 827690, b_cksum0 =
590276882470, b_freeze_lock = {lock_object = {lo_name = 0xffffffff80e84eb6
"buf->b_freeze_lock", lo_flags = 40960000, lo_data = 0, lo_witness = 0x0},
sx_lock = 1}, b_freeze_cksum = 0xfffff8001d0d7e20, b_thawed = 0x0,
b_hash_next = 0x0, b_buf = 0xfffff80e2c61f168, b_flags = 132640, b_datacnt = 1,
b_acb = 0xfffff80f89ba3680, b_cv = {
cv_description = 0xffffffff80e84eab "buf->b_cv", cv_waiters = 0}, b_type =
ARC_BUFC_METADATA, b_size = 8704, b_spa = 2586324942190472777, b_state =
0xffffffff80eaf080, b_arc_node = {list_next = 0x0, list_prev = 0x0},
b_arc_access = 138774732, b_refcnt = {rc_count = 1}, b_l2hdr = 0x0, b_l2node =
{list_next = 0x0, list_prev = 0x0}}
It seems that b_flags include the following bits: ARC_IN_HASH_TABLE,
ARC_IO_IN_PROGRESS, ARC_L2_EVICTED.
As far as I can see it is l2arc_evict that sets ARC_L2_EVICTED and resets
b_l2hdr to NULL. Looks like it does that under protection of hash_lock and
l2arc_buflist_mtx.
On the other hand, it seems that the arc_read code in question runs without
protection of any of these locks (or any other ARC-level lock).
What do you think?
Thank you!
--
Andriy Gapon
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-30 13:35:46 UTC

Permalink

--
Saso

Boris Protopopov

2013-08-30 18:49:06 UTC

Permalink

Hi, guys,

so, I took a closer look at this issue, and it appears that
l2arc_read_done() knows how to deal with L2ARC reads invalidated "in
flight". As to the specific locking issue, the simplest arc_read() fix is
to perform the
... *if* (hdr<http://src.illumos.org/source/s?defs=hdr&project=illumos-gate>
->b_l2hdr<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#b_l2hdr>!=
NULL<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#NULL>&&
3070<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3070>!
HDR_L2_WRITING<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#HDR_L2_WRITING>
(hdr <http://src.illumos.org/source/s?defs=hdr&project=illumos-gate>) && !
HDR_L2_EVICTED<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#HDR_L2_EVICTED>
(hdr <http://src.illumos.org/source/s?defs=hdr&project=illumos-gate>) &&
3071<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3071>
!(
l2arc_noprefetch<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#l2arc_noprefetch>&&
HDR_PREFETCH<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#HDR_PREFETCH>
(hdr <http://src.illumos.org/source/s?defs=hdr&project=illumos-gate>))) {
3072<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3072>
l2arc_read_callback_t<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#l2arc_read_callback_t>*
cb <http://src.illumos.org/source/s?defs=cb&project=illumos-gate>;
3073<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3073>
<http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3074>
check and to save the b_l2hdr->b_asize and b_l2hdr->b_compress in local
vars while still under the lock. This will allow one to avoid
de-referencing the potentially zeroed out l2hdr pointer in this code path.

What do you guys think ?

Boris.

Post by Andriy Gapon

Post by Andriy Gapon
It seems that there is a race between l2arc_evict and the l2arc related

code in

Post by Andriy Gapon
arc_read, at least in recent FreeBSD.
...
#8 0xffffffff8073f996 in trap (frame=0xfffffe104a076f40) at
/usr/src/sys/amd64/amd64/trap.c:463
#9 0xffffffff80729372 in calltrap () at

/usr/src/sys/amd64/amd64/exception.S:232

Post by Andriy Gapon
#10 0xffffffff80d95e21 in arc_read (pio=0xfffff801d4baf3b0,
spa=0xfffff8000fddf000, bp=0xfffffe000e5e8240, done=0x181fd9c00,
private=0xfffff80100000000, priority=0, zio_flags=744616296,
arc_flags=0xfffffe104a0770a0,
zb=0xfffff80100000000) at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:3331

Post by Andriy Gapon
#11 0xffffffff80d9d331 in dbuf_read (db=0xfffff801e6e737e0, zio=0x0,

flags=10)

Post by Andriy Gapon
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:582

Post by Andriy Gapon
#12 0xffffffff80da3ae3 in dmu_buf_hold (os=<value optimized out>,

object=<value

Post by Andriy Gapon
optimized out>, offset=<value optimized out>, tag=<value optimized out>,
dbp=0xfffffe104a0771f8, flags=<value optimized out>)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:153

Post by Andriy Gapon
#13 0xffffffff80df6bf9 in zap_lockdir (os=0xfffff8001d258400, obj=17,

tx=0x0,

Post by Andriy Gapon
lti=RW_READER, fatreader=1, adding=0, zapp=<value optimized out>)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467

Post by Andriy Gapon
#14 0xffffffff80df776d in zap_lookup_norm (os=0x0,

zapobj=18446744071577351547,

Post by Andriy Gapon
name=0xfffffe104a077490 "mktemp", integer_size=8, num_integers=1,
buf=0xfffffe104a077338, mt=501, realname=0x0, rn_len=0, ncp=0x0)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:788

Post by Andriy Gapon
#15 0xffffffff80df771e in zap_lookup (os=0x0,

zapobj=18446744071577351547,

Post by Andriy Gapon
name=0x1f5 <Address 0x1f5 out of bounds>, integer_size=1,

num_integers=0, buf=0x0)

Post by Andriy Gapon
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:773

Post by Andriy Gapon
#16 0xffffffff80e144f6 in zfs_dirent_lock (dlpp=0xfffffe104a0773b0,
dzp=0xfffff8001ddaa5c0, name=0xfffffe104a077490 "mktemp",
zpp=0xfffffe104a0773a8, flag=6, direntflags=0x0, realpnp=<value

optimized out>)

Post by Andriy Gapon
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:91

Post by Andriy Gapon
#17 0xffffffff80e1484d in zfs_dirlook (dzp=0xfffff8001ddaa5c0,
name=0xfffffe104a077490 "mktemp", vpp=0xfffffe104a077868, flags=<value

optimized

Post by Andriy Gapon
out>, deflg=0x0, rpnp=0x0)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:426

Post by Andriy Gapon
#18 0xffffffff80e28b34 in zfs_lookup () at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1500

Post by Andriy Gapon
...
if (hdr->b_l2hdr->b_compress ==
ZIO_COMPRESS_EMPTY) {
rzio = zio_null(pio, spa, vd,
l2arc_read_done, cb,
zio_flags |

ZIO_FLAG_DONT_CACHE |

Post by Andriy Gapon
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY);
} else {
rzio = zio_read_phys(pio, vd,

addr,

Post by Andriy Gapon
hdr->b_l2hdr->b_asize,
buf->b_data,

ZIO_CHECKSUM_OFF,

Post by Andriy Gapon
l2arc_read_done, cb,

priority,

Post by Andriy Gapon
zio_flags |

ZIO_FLAG_DONT_CACHE |

Post by Andriy Gapon
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY,

B_FALSE);

Post by Andriy Gapon
}
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
zio_t *, rzio);
ARCSTAT_INCR(arcstat_l2_read_bytes,
hdr->b_l2hdr->b_asize);
The crash is on the ARCSTAT_INCR line while dereferencing hdr->b_l2hdr.
It's apparent that hdr->b_l2hdr had to be non-NULL for this code path to

Post by Andriy Gapon
(kgdb) p *(arc_buf_hdr_t*)0xfffff8001d7cfe58
$6 = {b_dva = {dva_word = {6, 394971608}}, b_birth = 827690, b_cksum0 =
590276882470, b_freeze_lock = {lock_object = {lo_name =

0xffffffff80e84eb6

Post by Andriy Gapon
"buf->b_freeze_lock", lo_flags = 40960000, lo_data = 0, lo_witness =

0x0},

Post by Andriy Gapon
sx_lock = 1}, b_freeze_cksum = 0xfffff8001d0d7e20, b_thawed = 0x0,
b_hash_next = 0x0, b_buf = 0xfffff80e2c61f168, b_flags = 132640,

b_datacnt = 1,

Post by Andriy Gapon
b_acb = 0xfffff80f89ba3680, b_cv = {
cv_description = 0xffffffff80e84eab "buf->b_cv", cv_waiters = 0},

b_type =

Post by Andriy Gapon
ARC_BUFC_METADATA, b_size = 8704, b_spa = 2586324942190472777, b_state =
0xffffffff80eaf080, b_arc_node = {list_next = 0x0, list_prev = 0x0},
b_arc_access = 138774732, b_refcnt = {rc_count = 1}, b_l2hdr = 0x0,

b_l2node =

Post by Andriy Gapon
{list_next = 0x0, list_prev = 0x0}}
It seems that b_flags include the following bits: ARC_IN_HASH_TABLE,
ARC_IO_IN_PROGRESS, ARC_L2_EVICTED.
As far as I can see it is l2arc_evict that sets ARC_L2_EVICTED and resets
b_l2hdr to NULL. Looks like it does that under protection of hash_lock

and

Post by Andriy Gapon
l2arc_buflist_mtx.
On the other hand, it seems that the arc_read code in question runs

without

Post by Andriy Gapon
protection of any of these locks (or any other ARC-level lock).
What do you think?
Thank you!

Hi Andriy,
You are right. It appears this was in the L2ARC right from the start - I
have no idea how this could have persisted for so long.
As Boris said he's already reworking some of the locking logic in the
ARC we should coordinate on the fix with him, to make sure we're not
conflicting somewhere.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Best regards,

Boris Protopopov
Nexenta Systems

455 El Camino Real, Santa Clara, CA 95050

[d] 408.791.3366 | [c] 978.621.6901
Skype: bprotopopov

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-08-30 22:00:51 UTC

Permalink

Post by Boris Protopopov
Hi, guys,
so, I took a closer look at this issue, and it appears that
l2arc_read_done() knows how to deal with L2ARC reads invalidated "in
flight". As to the specific locking issue, the simplest arc_read() fix
is to perform the
... *if* (hdr...
check and to save the b_l2hdr->b_asize and b_l2hdr->b_compress in local
vars while still under the lock. This will allow one to avoid
de-referencing the potentially zeroed out l2hdr pointer in this code path.
What do you guys think ?

Just for a little sanity check: what happens in such invalidation?

So we have a stashed pointer to a potentially random content in RAM
which may still contain the data, or be securely wiped, or be reused
by another program. Possibly, just plain attempts to access it and
dereference the pointer might cause another exception (like NX-bit or
whatever, though these are likely not in place for the Kernel Almighty)?

If access to the data does not cause an error, we would get some data
which may or may not pass the checksum test; if it does not - is this
treated like an L2ARC error (re-read from main pool disks), or is this
already considered a RAM ARC access already (currently without checksum
validations IIRC)?

I believe, it would be of little use to test the original pointer for
NULLness just before an attempt to dereference it? That is, if it is
NULL we can skip the dereferencing, and if it's not - we can't guarantee
that the location is not freed by the time this thread gets to the next
instruction (the dereference, perhaps of the stashed pointer), unless
we use a lock which we try to avoid, right?

I'm just wrapping my head around this, so sorry if these questions seem
trivial on the verge of an insult - they are not meant to be so ;)

Thanks,
//Jim

Boris Protopopov

2013-08-30 22:19:35 UTC

Permalink

I do not believe there was any suggestion to stash pointers.
Please refer to the bug description, it spells out what is being stashed.

Boris.

Post by Jim Klimov

Just for a little sanity check: what happens in such invalidation?
So we have a stashed pointer to a potentially random content in RAM
which may still contain the data, or be securely wiped, or be reused
by another program. Possibly, just plain attempts to access it and
dereference the pointer might cause another exception (like NX-bit or
whatever, though these are likely not in place for the Kernel Almighty)?
If access to the data does not cause an error, we would get some data
which may or may not pass the checksum test; if it does not - is this
treated like an L2ARC error (re-read from main pool disks), or is this
already considered a RAM ARC access already (currently without checksum
validations IIRC)?
I believe, it would be of little use to test the original pointer for
NULLness just before an attempt to dereference it? That is, if it is
NULL we can skip the dereferencing, and if it's not - we can't guarantee
that the location is not freed by the time this thread gets to the next
instruction (the dereference, perhaps of the stashed pointer), unless
we use a lock which we try to avoid, right?
I'm just wrapping my head around this, so sorry if these questions seem
trivial on the verge of an insult - they are not meant to be so ;)
Thanks,
//Jim
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
23052084-8e2408bc<https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=23052084-f99c1c78<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com

Jim Klimov

2013-08-30 22:43:12 UTC

Permalink

Post by Boris Protopopov
I do not believe there was any suggestion to stash pointers.
Please refer to the bug description, it spells out what is being stashed.
Boris.

Huh, right... Sorry for the noise, I was confused by the phrase that
we're trying to avoid dereferencing a NULL pointer, and assumed that
we also would stash a copy of it's old (non-NULL) value.

//Jim

Brendan Gregg

2013-08-30 19:35:16 UTC

Permalink

Was this introduced in the L2ARC compression wad? Previously size was
stashed before the mutex_exit(hash_lock). Which should work again (as Boris
said).

Brendan

Post by Andriy Gapon

Post by Andriy Gapon
It seems that there is a race between l2arc_evict and the l2arc related

code in

/usr/src/sys/amd64/amd64/exception.S:232

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:3331

Post by Andriy Gapon
#11 0xffffffff80d9d331 in dbuf_read (db=0xfffff801e6e737e0, zio=0x0,

flags=10)

Post by Andriy Gapon
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:582

Post by Andriy Gapon
#12 0xffffffff80da3ae3 in dmu_buf_hold (os=<value optimized out>,

object=<value

Post by Andriy Gapon
optimized out>, offset=<value optimized out>, tag=<value optimized out>,
dbp=0xfffffe104a0771f8, flags=<value optimized out>)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:153

Post by Andriy Gapon
#13 0xffffffff80df6bf9 in zap_lockdir (os=0xfffff8001d258400, obj=17,

tx=0x0,

Post by Andriy Gapon
lti=RW_READER, fatreader=1, adding=0, zapp=<value optimized out>)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467

Post by Andriy Gapon
#14 0xffffffff80df776d in zap_lookup_norm (os=0x0,

zapobj=18446744071577351547,

Post by Andriy Gapon
name=0xfffffe104a077490 "mktemp", integer_size=8, num_integers=1,
buf=0xfffffe104a077338, mt=501, realname=0x0, rn_len=0, ncp=0x0)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:788

Post by Andriy Gapon
#15 0xffffffff80df771e in zap_lookup (os=0x0,

zapobj=18446744071577351547,

Post by Andriy Gapon
name=0x1f5 <Address 0x1f5 out of bounds>, integer_size=1,

num_integers=0, buf=0x0)

Post by Andriy Gapon
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:773

optimized out>)

Post by Andriy Gapon
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:91

Post by Andriy Gapon
#17 0xffffffff80e1484d in zfs_dirlook (dzp=0xfffff8001ddaa5c0,
name=0xfffffe104a077490 "mktemp", vpp=0xfffffe104a077868, flags=<value

optimized

Post by Andriy Gapon
out>, deflg=0x0, rpnp=0x0)
at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:426

Post by Andriy Gapon
#18 0xffffffff80e28b34 in zfs_lookup () at

/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1500

Post by Andriy Gapon
...
if (hdr->b_l2hdr->b_compress ==
ZIO_COMPRESS_EMPTY) {
rzio = zio_null(pio, spa, vd,
l2arc_read_done, cb,
zio_flags |

ZIO_FLAG_DONT_CACHE |

Post by Andriy Gapon
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY);
} else {
rzio = zio_read_phys(pio, vd,

addr,

Post by Andriy Gapon
hdr->b_l2hdr->b_asize,
buf->b_data,

ZIO_CHECKSUM_OFF,

Post by Andriy Gapon
l2arc_read_done, cb,

priority,

Post by Andriy Gapon
zio_flags |

ZIO_FLAG_DONT_CACHE |

Post by Andriy Gapon
ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY,

B_FALSE);

Post by Andriy Gapon
(kgdb) p *(arc_buf_hdr_t*)0xfffff8001d7cfe58
$6 = {b_dva = {dva_word = {6, 394971608}}, b_birth = 827690, b_cksum0 =
590276882470, b_freeze_lock = {lock_object = {lo_name =

0xffffffff80e84eb6

Post by Andriy Gapon
"buf->b_freeze_lock", lo_flags = 40960000, lo_data = 0, lo_witness =

0x0},

Post by Andriy Gapon
sx_lock = 1}, b_freeze_cksum = 0xfffff8001d0d7e20, b_thawed = 0x0,
b_hash_next = 0x0, b_buf = 0xfffff80e2c61f168, b_flags = 132640,

b_datacnt = 1,

Post by Andriy Gapon
b_acb = 0xfffff80f89ba3680, b_cv = {
cv_description = 0xffffffff80e84eab "buf->b_cv", cv_waiters = 0},

b_type =

b_l2node =

and

Post by Andriy Gapon
l2arc_buflist_mtx.
On the other hand, it seems that the arc_read code in question runs

without

Post by Andriy Gapon
protection of any of these locks (or any other ARC-level lock).
What do you think?
Thank you!

Hi Andriy,
You are right. It appears this was in the L2ARC right from the start - I
have no idea how this could have persisted for so long.
As Boris said he's already reworking some of the locking logic in the
ARC we should coordinate on the fix with him, to make sure we're not
conflicting somewhere.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/23993377-dc3c8716
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

--
Brendan Gregg, Joyent http://dtrace.org/blogs/brendan

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-30 22:21:05 UTC

Permalink

Post by Brendan Gregg
Was this introduced in the L2ARC compression wad? Previously size was
stashed before the mutex_exit(hash_lock). Which should work again (as
Boris said).

Yep, rereading the code, it was me who introduced this. When I wrote
that bit I assumed that we were holding the hash lock here. I understand
it was a mistake to make that assumption without checking, but I see no
reason why drop the hash lock before issuing l2arc reads... What was the
rationale here?

Cheers,

--
Saso

Boris Protopopov

2013-08-30 23:24:34 UTC

Permalink

My guess the authors did not want to hold locks during IO operations, which could throttle l2arc eviction

Typos courtesy of my iPhone

Post by Saso Kiselkov

Post by Brendan Gregg
Was this introduced in the L2ARC compression wad? Previously size was
stashed before the mutex_exit(hash_lock). Which should work again (as
Boris said).

that bit I assumed that we were holding the hash lock here. I understand
it was a mistake to make that assumption without checking, but I see no
reason why drop the hash lock before issuing l2arc reads... What was the
rationale here?
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23052084-8e2408bc
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Saso Kiselkov

2013-08-30 23:32:52 UTC

Permalink

Post by Boris Protopopov
My guess the authors did not want to hold locks during IO operations, which could throttle l2arc eviction

zio_read does not block. zio_wait does.

--
Saso

Boris Protopopov

2013-08-31 00:37:50 UTC

Permalink

Yes, that is true.

Still, heap allocations can block, and blocking while holding locks is not advisable :)

Typos courtesy of my iPhone

Post by Saso Kiselkov

Post by Boris Protopopov
My guess the authors did not want to hold locks during IO operations, which could throttle l2arc eviction

zio_read does not block. zio_wait does.
--
Saso

Brendan Gregg

2013-09-02 19:48:59 UTC

Permalink

From memory it was to keep hold duration minimal and simple. Holding the hash lock while calling into zio isn't simple. There were some possible nasty deadlocks involving various locks and L2ARC device removal, but I don't remember if this location was an issue.

Brendan

Sent from my iPhone

My guess the authors did not want to hold locks during IO operations, which could throttle l2arc eviction
Typos courtesy of my iPhone

Post by Saso Kiselkov

Post by Brendan Gregg
Was this introduced in the L2ARC compression wad? Previously size was
stashed before the mutex_exit(hash_lock). Which should work again (as
Boris said).