Andriy Gapon
2013-07-22 16:16:47 UTC
First, a note that this report is about ZFS / FreeBSD.
Second, a note that the following is based on a post-mortem analysis of a system
that was forcefully panic-ed when it was apparently hung.
I found many threads in txg_wait_open while txg_sync thread appeared to be in a
good state (not blocked on anything and not sleeping for too long).
There was apparently some spike in memory usage before the hang. The system has
64GB, but ARC size was at 12GB (its size was not capped). There was "only" 2GB
of available system memory.
I also noticed that arcstat_memory_throttle_count was more than 10^9.
I see that the amount of available memory was large enough to not trigger either
page daemon or ARC reclaim thread.
On the other hand, looking at the values that are used to calculate anon_size here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3625
and the values used in calculations here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3576
I see that the adjusted available_memory adds up to ~5GB while inflight_data
(equal to arc_anon->arcs_size) is ~1.5GB and so
inflight_data > available_memory / 4
So it appears that ARC was stuck in some sort of a steady state where it would
deny all arc_tempreserve_space requests because arc_memory_throttle would return
ERESTART. On the other hand there was no pressure to change ARC size or
composition, or any overall kernel memory pressure.
Some raw numbers:
(kgdb) p dp->dp_throughput
$27 = 1979096
(kgdb) p dp->dp_write_limit
$28 = 33554432
(kgdb) p zfs_write_limit_min
$29 = 33554432
(kgdb) p zfs_txg_synctime_ms
$30 = 1000
(kgdb) p dp->dp_throughput * zfs_txg_synctime_ms
$31 = 1979096000
(kgdb) p zfs_write_limit_inflated
$36 = 205978398720
We can see that dp_write_limit was at its minimum possible value.
(kgdb) p physmem
$37 = 16762565 # in pages
(kgdb) p cnt.v_free_reserved + cnt.v_cache_min
$55 = 452368
(kgdb) p cnt.v_free_count + cnt.v_cache_count
$56 = 475444 # this is available memory in pages, analogous to 'freemem'
v_free_count + v_cache_count is greater than v_free_reserved + v_cache_min, so
there was no paging and arc_reclaim_needed was false. (This is from FreeBSD VM
logic which is borrowed by FreeBSD port of ARC logic as well)
(kgdb) p arc_stats.arcstat_c.value.ui64
$42 = 12689616774
(kgdb) p arc_stats.arcstat_c_max.value.ui64
$43 = 65504890880
(kgdb) p arc_stats.arcstat_size.value.ui64
$44 = 12688714872
(kgdb) p arc_stats.arcstat_c_min.value.ui64
$45 = 8188111360
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
(kgdb) p arc_loaned_bytes
$47 = 0
(kgdb) p arc_stats.arcstat_memory_throttle_count.value.ui64
$48 = 1110265877
(kgdb) p arc_tempreserve
$49 = 0
(kgdb) p arc_mru->arcs_lsize[ARC_BUFC_DATA] +
arc_mru->arcs_lsize[ARC_BUFC_METADATA] + arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
arc_mfu->arcs_lsize[ARC_BUFC_METADATA]
$58 = 3202388992
So, available_memory is 4096 * 475444 + 3202388992 == 5149807616.
This number divided by 4 equals 1287451904 is less than inflight_data which is
equal to arc_anon->arcs_size (because arc_loaned_bytes is zero) which is 1541423104.
I would appreciate any suggestions on how to fix this problem as well as on how
to work around it. Perhaps there should be some nudge to reclaim code whenever
arc_tempreserve_space does not succeed...
Also, maybe there are additional things that I should examine to either confirm
or reject my conclusions. After all, I am not 100% sure in them.
P.S.
I think that this issue gets regularly reproduced by a few days of our
stress-testing that models particular workloads.
Second, a note that the following is based on a post-mortem analysis of a system
that was forcefully panic-ed when it was apparently hung.
I found many threads in txg_wait_open while txg_sync thread appeared to be in a
good state (not blocked on anything and not sleeping for too long).
There was apparently some spike in memory usage before the hang. The system has
64GB, but ARC size was at 12GB (its size was not capped). There was "only" 2GB
of available system memory.
I also noticed that arcstat_memory_throttle_count was more than 10^9.
I see that the amount of available memory was large enough to not trigger either
page daemon or ARC reclaim thread.
On the other hand, looking at the values that are used to calculate anon_size here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3625
and the values used in calculations here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3576
I see that the adjusted available_memory adds up to ~5GB while inflight_data
(equal to arc_anon->arcs_size) is ~1.5GB and so
inflight_data > available_memory / 4
So it appears that ARC was stuck in some sort of a steady state where it would
deny all arc_tempreserve_space requests because arc_memory_throttle would return
ERESTART. On the other hand there was no pressure to change ARC size or
composition, or any overall kernel memory pressure.
Some raw numbers:
(kgdb) p dp->dp_throughput
$27 = 1979096
(kgdb) p dp->dp_write_limit
$28 = 33554432
(kgdb) p zfs_write_limit_min
$29 = 33554432
(kgdb) p zfs_txg_synctime_ms
$30 = 1000
(kgdb) p dp->dp_throughput * zfs_txg_synctime_ms
$31 = 1979096000
(kgdb) p zfs_write_limit_inflated
$36 = 205978398720
We can see that dp_write_limit was at its minimum possible value.
(kgdb) p physmem
$37 = 16762565 # in pages
(kgdb) p cnt.v_free_reserved + cnt.v_cache_min
$55 = 452368
(kgdb) p cnt.v_free_count + cnt.v_cache_count
$56 = 475444 # this is available memory in pages, analogous to 'freemem'
v_free_count + v_cache_count is greater than v_free_reserved + v_cache_min, so
there was no paging and arc_reclaim_needed was false. (This is from FreeBSD VM
logic which is borrowed by FreeBSD port of ARC logic as well)
(kgdb) p arc_stats.arcstat_c.value.ui64
$42 = 12689616774
(kgdb) p arc_stats.arcstat_c_max.value.ui64
$43 = 65504890880
(kgdb) p arc_stats.arcstat_size.value.ui64
$44 = 12688714872
(kgdb) p arc_stats.arcstat_c_min.value.ui64
$45 = 8188111360
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
(kgdb) p arc_loaned_bytes
$47 = 0
(kgdb) p arc_stats.arcstat_memory_throttle_count.value.ui64
$48 = 1110265877
(kgdb) p arc_tempreserve
$49 = 0
(kgdb) p arc_mru->arcs_lsize[ARC_BUFC_DATA] +
arc_mru->arcs_lsize[ARC_BUFC_METADATA] + arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
arc_mfu->arcs_lsize[ARC_BUFC_METADATA]
$58 = 3202388992
So, available_memory is 4096 * 475444 + 3202388992 == 5149807616.
This number divided by 4 equals 1287451904 is less than inflight_data which is
equal to arc_anon->arcs_size (because arc_loaned_bytes is zero) which is 1541423104.
I would appreciate any suggestions on how to fix this problem as well as on how
to work around it. Perhaps there should be some nudge to reclaim code whenever
arc_tempreserve_space does not succeed...
Also, maybe there are additional things that I should examine to either confirm
or reject my conclusions. After all, I am not 100% sure in them.
P.S.
I think that this issue gets regularly reproduced by a few days of our
stress-testing that models particular workloads.
--
Andriy Gapon
Andriy Gapon