Ying Zhu
2013-07-04 12:23:42 UTC
Hello list,
I had issued this bug on illumos gate(
https://www.illumos.org/issues/3852)
but got no response, so I'm sending this to the illumos zfs mailing list
again.
This bug was firstly found on the Linux port of ZFS namely ZFS on
Linux.
Based on the comments in arc.c we know that buffers can exist both in arc
and l2arc, under this circumstance both arc_buf_hdr_t and l2arc_buf_hdr_t
will be allocated. However the current logic only cares for memory that
l2arc_buf_hdr takes up when the buffer's state transfers from or to
arc_l2c_only. This will cause obvious deviations for illumos's zfs version
since the sizeof(l2arc_buf_hdr) is larger than ZOL's. We can implement
the calcuation in the following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption instantly
and subtract it when we free or evict the l2arc buf.
2. According to the code in l2arc_hdr_stat_add and l2arc_hdr_stat_remove,
if the buffer only stays in l2arc we should also add the memory its
arc_buf_hdr_t
consumes, so we only need to add HDR_SIZE to arcstat_l2_hdr_size since we
already
concerned with L2HDR_SIZE in step 1 and the same for transfering arc bufs
from
l2arc only state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory, OS is
Linux and tests were
set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool storage
and
the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool prepared in
step 1
using iozone.
3. Read them all using md5sum and watched the l2arc related statistics in
/proc/spl/kstat/zfs/arcstats. After the reading ended the l2_hdr_size and
l2_size
were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying the patch in the attachments and reran step 1-3, the
results were as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
this numbers made sense, on 64-bit systems the sizeof(l2arc_buf_hdr_t)(On
ZOL) is 16 bytes.
Assue all blocks cached by l2arc are 128KB, so
535600/16*128*1024=4387635200, since
not all blocks are equal-sized, the theoretical result will be a little
bigger, as
we can see.
Since I'm familiar with systemtap instrumentation tool(dtrace for linux) I
used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if
the argument new_state is arc_l2_only.
I gathered the trace logs and found that none of the arc bufs ran into arc
state
arc_l2_only when the tests was running, this was the reason why l2_hdr_size
in
step 3 was 0. The arc bufs fell into arc_l2_only when the pool or the
filesystem
was offlined.
For your convenience I put the patch's content here, will anyone spend some
time looking
into it?
diff --git a/usr/src/uts/common/fs/zfs/arc.c
b/usr/src/uts/common/fs/zfs/arc.c
index 7072a17..8a46a2b 100644
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -1579,6 +1579,7 @@ arc_hdr_destroy(arc_buf_hdr_t *hdr)
ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
ARCSTAT_INCR(arcstat_l2_asize, -l2hdr->b_asize);
kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
if (hdr->b_state == arc_l2c_only)
l2arc_hdr_stat_remove();
hdr->b_l2hdr = NULL;
@@ -3360,6 +3361,7 @@ arc_release(arc_buf_t *buf, void *tag)
ARCSTAT_INCR(arcstat_l2_asize, -l2hdr->b_asize);
list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
ARCSTAT_INCR(arcstat_l2_size, -buf_size);
mutex_exit(&l2arc_buflist_mtx);
}
@@ -4041,14 +4043,14 @@ l2arc_write_interval(clock_t began, uint64_t
wanted, uint64_t wrote)
static void
l2arc_hdr_stat_add(void)
{
- ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
+ ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE);
ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
}
static void
l2arc_hdr_stat_remove(void)
{
- ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
+ ARCSTAT_INCR(arcstat_l2_hdr_size, -HDR_SIZE);
ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
}
@@ -4200,6 +4202,7 @@ l2arc_write_done(zio_t *zio)
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
@@ -4454,6 +4457,7 @@ top:
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
list_remove(buflist, ab);
@@ -4600,6 +4604,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev,
uint64_t target_sz,
l2hdr = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
l2hdr->b_dev = dev;
ab->b_flags |= ARC_L2_WRITING;
+ arc_space_consume(L2HDR_SIZE, ARC_SPACE_L2HDRS);
/*
* Temporarily stash the data buffer in b_tmp_cdata.
--
Thanks,
Ying Zhu
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
I had issued this bug on illumos gate(
https://www.illumos.org/issues/3852)
but got no response, so I'm sending this to the illumos zfs mailing list
again.
This bug was firstly found on the Linux port of ZFS namely ZFS on
Linux.
Based on the comments in arc.c we know that buffers can exist both in arc
and l2arc, under this circumstance both arc_buf_hdr_t and l2arc_buf_hdr_t
will be allocated. However the current logic only cares for memory that
l2arc_buf_hdr takes up when the buffer's state transfers from or to
arc_l2c_only. This will cause obvious deviations for illumos's zfs version
since the sizeof(l2arc_buf_hdr) is larger than ZOL's. We can implement
the calcuation in the following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption instantly
and subtract it when we free or evict the l2arc buf.
2. According to the code in l2arc_hdr_stat_add and l2arc_hdr_stat_remove,
if the buffer only stays in l2arc we should also add the memory its
arc_buf_hdr_t
consumes, so we only need to add HDR_SIZE to arcstat_l2_hdr_size since we
already
concerned with L2HDR_SIZE in step 1 and the same for transfering arc bufs
from
l2arc only state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory, OS is
Linux and tests were
set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool storage
and
the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool prepared in
step 1
using iozone.
3. Read them all using md5sum and watched the l2arc related statistics in
/proc/spl/kstat/zfs/arcstats. After the reading ended the l2_hdr_size and
l2_size
were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying the patch in the attachments and reran step 1-3, the
results were as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
this numbers made sense, on 64-bit systems the sizeof(l2arc_buf_hdr_t)(On
ZOL) is 16 bytes.
Assue all blocks cached by l2arc are 128KB, so
535600/16*128*1024=4387635200, since
not all blocks are equal-sized, the theoretical result will be a little
bigger, as
we can see.
Since I'm familiar with systemtap instrumentation tool(dtrace for linux) I
used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if
the argument new_state is arc_l2_only.
I gathered the trace logs and found that none of the arc bufs ran into arc
state
arc_l2_only when the tests was running, this was the reason why l2_hdr_size
in
step 3 was 0. The arc bufs fell into arc_l2_only when the pool or the
filesystem
was offlined.
For your convenience I put the patch's content here, will anyone spend some
time looking
into it?
diff --git a/usr/src/uts/common/fs/zfs/arc.c
b/usr/src/uts/common/fs/zfs/arc.c
index 7072a17..8a46a2b 100644
--- a/usr/src/uts/common/fs/zfs/arc.c
+++ b/usr/src/uts/common/fs/zfs/arc.c
@@ -1579,6 +1579,7 @@ arc_hdr_destroy(arc_buf_hdr_t *hdr)
ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
ARCSTAT_INCR(arcstat_l2_asize, -l2hdr->b_asize);
kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
if (hdr->b_state == arc_l2c_only)
l2arc_hdr_stat_remove();
hdr->b_l2hdr = NULL;
@@ -3360,6 +3361,7 @@ arc_release(arc_buf_t *buf, void *tag)
ARCSTAT_INCR(arcstat_l2_asize, -l2hdr->b_asize);
list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
ARCSTAT_INCR(arcstat_l2_size, -buf_size);
mutex_exit(&l2arc_buflist_mtx);
}
@@ -4041,14 +4043,14 @@ l2arc_write_interval(clock_t began, uint64_t
wanted, uint64_t wrote)
static void
l2arc_hdr_stat_add(void)
{
- ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
+ ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE);
ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
}
static void
l2arc_hdr_stat_remove(void)
{
- ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
+ ARCSTAT_INCR(arcstat_l2_hdr_size, -HDR_SIZE);
ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
}
@@ -4200,6 +4202,7 @@ l2arc_write_done(zio_t *zio)
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
@@ -4454,6 +4457,7 @@ top:
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
+ arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
list_remove(buflist, ab);
@@ -4600,6 +4604,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev,
uint64_t target_sz,
l2hdr = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
l2hdr->b_dev = dev;
ab->b_flags |= ARC_L2_WRITING;
+ arc_space_consume(L2HDR_SIZE, ARC_SPACE_L2HDRS);
/*
* Temporarily stash the data buffer in b_tmp_cdata.
--
Thanks,
Ying Zhu
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com