Discussion:
curious / edge case in arc_memory_throttle
Andriy Gapon
2013-07-22 16:16:47 UTC
Permalink
First, a note that this report is about ZFS / FreeBSD.
Second, a note that the following is based on a post-mortem analysis of a system
that was forcefully panic-ed when it was apparently hung.

I found many threads in txg_wait_open while txg_sync thread appeared to be in a
good state (not blocked on anything and not sleeping for too long).
There was apparently some spike in memory usage before the hang. The system has
64GB, but ARC size was at 12GB (its size was not capped). There was "only" 2GB
of available system memory.
I also noticed that arcstat_memory_throttle_count was more than 10^9.

I see that the amount of available memory was large enough to not trigger either
page daemon or ARC reclaim thread.

On the other hand, looking at the values that are used to calculate anon_size here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3625
and the values used in calculations here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3576
I see that the adjusted available_memory adds up to ~5GB while inflight_data
(equal to arc_anon->arcs_size) is ~1.5GB and so
inflight_data > available_memory / 4

So it appears that ARC was stuck in some sort of a steady state where it would
deny all arc_tempreserve_space requests because arc_memory_throttle would return
ERESTART. On the other hand there was no pressure to change ARC size or
composition, or any overall kernel memory pressure.

Some raw numbers:
(kgdb) p dp->dp_throughput
$27 = 1979096
(kgdb) p dp->dp_write_limit
$28 = 33554432
(kgdb) p zfs_write_limit_min
$29 = 33554432
(kgdb) p zfs_txg_synctime_ms
$30 = 1000
(kgdb) p dp->dp_throughput * zfs_txg_synctime_ms
$31 = 1979096000
(kgdb) p zfs_write_limit_inflated
$36 = 205978398720

We can see that dp_write_limit was at its minimum possible value.

(kgdb) p physmem
$37 = 16762565 # in pages

(kgdb) p cnt.v_free_reserved + cnt.v_cache_min
$55 = 452368
(kgdb) p cnt.v_free_count + cnt.v_cache_count
$56 = 475444 # this is available memory in pages, analogous to 'freemem'

v_free_count + v_cache_count is greater than v_free_reserved + v_cache_min, so
there was no paging and arc_reclaim_needed was false. (This is from FreeBSD VM
logic which is borrowed by FreeBSD port of ARC logic as well)

(kgdb) p arc_stats.arcstat_c.value.ui64
$42 = 12689616774
(kgdb) p arc_stats.arcstat_c_max.value.ui64
$43 = 65504890880
(kgdb) p arc_stats.arcstat_size.value.ui64
$44 = 12688714872
(kgdb) p arc_stats.arcstat_c_min.value.ui64
$45 = 8188111360
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
(kgdb) p arc_loaned_bytes
$47 = 0
(kgdb) p arc_stats.arcstat_memory_throttle_count.value.ui64
$48 = 1110265877
(kgdb) p arc_tempreserve
$49 = 0

(kgdb) p arc_mru->arcs_lsize[ARC_BUFC_DATA] +
arc_mru->arcs_lsize[ARC_BUFC_METADATA] + arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
arc_mfu->arcs_lsize[ARC_BUFC_METADATA]
$58 = 3202388992

So, available_memory is 4096 * 475444 + 3202388992 == 5149807616.
This number divided by 4 equals 1287451904 is less than inflight_data which is
equal to arc_anon->arcs_size (because arc_loaned_bytes is zero) which is 1541423104.

I would appreciate any suggestions on how to fix this problem as well as on how
to work around it. Perhaps there should be some nudge to reclaim code whenever
arc_tempreserve_space does not succeed...

Also, maybe there are additional things that I should examine to either confirm
or reject my conclusions. After all, I am not 100% sure in them.

P.S.
I think that this issue gets regularly reproduced by a few days of our
stress-testing that models particular workloads.
--
Andriy Gapon
Andriy Gapon
2013-07-25 18:30:48 UTC
Permalink
Post by Andriy Gapon
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
I still wonder what could lead to this large proportion of anonymous ARC.
--
Andriy Gapon
Dan McDonald
2013-07-25 18:51:10 UTC
Permalink
Post by Andriy Gapon
Post by Andriy Gapon
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
I still wonder what could lead to this large proportion of anonymous ARC.
Does this recent fix illumos-gate:

commit 6e6d5868f52089b9026785bd90257a3d3f6e5ee2
Author: Matthew Ahrens <***@delphix.com>
Date: Fri Jun 7 20:29:06 2013 -0800

3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <***@delphix.com>
Reviewed by: Christopher Siden <***@delphix.com>
Reviewed by: Richard Elling <***@dey-sys.com>
Reviewed by: Will Andrews <***@firepipe.net>
Approved by: Dan McDonald <***@nexenta.com>

have any relevance?

Dan
Andriy Gapon
2013-07-25 20:33:00 UTC
Permalink
Post by Dan McDonald
Post by Andriy Gapon
Post by Andriy Gapon
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
I still wonder what could lead to this large proportion of anonymous ARC.
commit 6e6d5868f52089b9026785bd90257a3d3f6e5ee2
Date: Fri Jun 7 20:29:06 2013 -0800
3805 arc shouldn't cache freed blocks
have any relevance?
This commit is in our tree. So I am not sure if it has relevance or not, given
the context of your question.
--
Andriy Gapon
Dan McDonald
2013-07-25 20:40:21 UTC
Permalink
<SNIP!>
Post by Andriy Gapon
Post by Dan McDonald
commit 6e6d5868f52089b9026785bd90257a3d3f6e5ee2
Date: Fri Jun 7 20:29:06 2013 -0800
3805 arc shouldn't cache freed blocks
have any relevance?
This commit is in our tree. So I am not sure if it has relevance or not,
given the context of your question.
Okay, so you're seeing the problem even with that fix in place. Sorry for
the distraction.

Dan
Will Andrews
2013-07-25 20:50:02 UTC
Permalink
That commit involved blocks in the MRU/MFU that had been freed later.
My understanding is that anonymous blocks are only supposed to involve
in-flight (i.e. being modified by an user) blocks.

arcs_size should be the total size of blocks in the given state,
right? ~1.5GB seems like a lot in that context, for most systems.

Andriy: Is that arcs_size figure steady or steadily increasing? If
so, perhaps you're seeing ARC buffer leaks -- is your build running
with DEBUG turned on during your stress tests? If not, I suggest
trying that to see what happens.

--Will.
Post by Dan McDonald
Post by Andriy Gapon
Post by Andriy Gapon
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
I still wonder what could lead to this large proportion of anonymous ARC.
commit 6e6d5868f52089b9026785bd90257a3d3f6e5ee2
Date: Fri Jun 7 20:29:06 2013 -0800
3805 arc shouldn't cache freed blocks
have any relevance?
Dan
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22812380-5ee856a2
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Andriy Gapon
2013-07-26 06:36:31 UTC
Permalink
Post by Will Andrews
Andriy: Is that arcs_size figure steady or steadily increasing? If
so, perhaps you're seeing ARC buffer leaks -- is your build running
with DEBUG turned on during your stress tests? If not, I suggest
trying that to see what happens.
Will,

unfortunately, the systems in questions were not monitored for these things and
all I have now is a coredump. Also, the builds were not DEBUG builds.
I will try to reproduce the problem in a more controlled fashion. It seems that
I can not learn much more from the cores alone.

Thank you.
--
Andriy Gapon
Matthew Ahrens
2013-08-01 23:00:24 UTC
Permalink
Post by Will Andrews
That commit involved blocks in the MRU/MFU that had been freed later.
My understanding is that anonymous blocks are only supposed to involve
in-flight (i.e. being modified by an user) blocks.
That's right.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-08-01 23:10:43 UTC
Permalink
We've also seen problems with these metrics. We have changed the way that
dirty data (anonymous / "inflight" buffers) provides pressure on memory in
DelphixOS and plan to put it up for review to integrate into illumos soon.
This is the most relevant commit:

https://github.com/delphix/delphix-os/commit/e80fef6bbc55f585a7cc106340ad2476ea759ade

--matt
Post by Andriy Gapon
First, a note that this report is about ZFS / FreeBSD.
Second, a note that the following is based on a post-mortem analysis of a system
that was forcefully panic-ed when it was apparently hung.
I found many threads in txg_wait_open while txg_sync thread appeared to be in a
good state (not blocked on anything and not sleeping for too long).
There was apparently some spike in memory usage before the hang. The system has
64GB, but ARC size was at 12GB (its size was not capped). There was "only" 2GB
of available system memory.
I also noticed that arcstat_memory_throttle_count was more than 10^9.
I see that the amount of available memory was large enough to not trigger either
page daemon or ARC reclaim thread.
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3625
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#3576
I see that the adjusted available_memory adds up to ~5GB while
inflight_data
(equal to arc_anon->arcs_size) is ~1.5GB and so
inflight_data > available_memory / 4
So it appears that ARC was stuck in some sort of a steady state where it would
deny all arc_tempreserve_space requests because arc_memory_throttle would return
ERESTART. On the other hand there was no pressure to change ARC size or
composition, or any overall kernel memory pressure.
(kgdb) p dp->dp_throughput
$27 = 1979096
(kgdb) p dp->dp_write_limit
$28 = 33554432
(kgdb) p zfs_write_limit_min
$29 = 33554432
(kgdb) p zfs_txg_synctime_ms
$30 = 1000
(kgdb) p dp->dp_throughput * zfs_txg_synctime_ms
$31 = 1979096000
(kgdb) p zfs_write_limit_inflated
$36 = 205978398720
We can see that dp_write_limit was at its minimum possible value.
(kgdb) p physmem
$37 = 16762565 # in pages
(kgdb) p cnt.v_free_reserved + cnt.v_cache_min
$55 = 452368
(kgdb) p cnt.v_free_count + cnt.v_cache_count
$56 = 475444 # this is available memory in pages, analogous to 'freemem'
v_free_count + v_cache_count is greater than v_free_reserved +
v_cache_min, so
there was no paging and arc_reclaim_needed was false. (This is from FreeBSD VM
logic which is borrowed by FreeBSD port of ARC logic as well)
(kgdb) p arc_stats.arcstat_c.value.ui64
$42 = 12689616774
(kgdb) p arc_stats.arcstat_c_max.value.ui64
$43 = 65504890880
(kgdb) p arc_stats.arcstat_size.value.ui64
$44 = 12688714872
(kgdb) p arc_stats.arcstat_c_min.value.ui64
$45 = 8188111360
(kgdb) p arc_anon->arcs_size
$46 = 1541423104
(kgdb) p arc_loaned_bytes
$47 = 0
(kgdb) p arc_stats.arcstat_memory_throttle_count.value.ui64
$48 = 1110265877
(kgdb) p arc_tempreserve
$49 = 0
(kgdb) p arc_mru->arcs_lsize[ARC_BUFC_DATA] +
arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
arc_mfu->arcs_lsize[ARC_BUFC_METADATA]
$58 = 3202388992
So, available_memory is 4096 * 475444 + 3202388992 == 5149807616.
This number divided by 4 equals 1287451904 is less than inflight_data which is
equal to arc_anon->arcs_size (because arc_loaned_bytes is zero) which is 1541423104.
I would appreciate any suggestions on how to fix this problem as well as on how
to work around it. Perhaps there should be some nudge to reclaim code whenever
arc_tempreserve_space does not succeed...
Also, maybe there are additional things that I should examine to either confirm
or reject my conclusions. After all, I am not 100% sure in them.
P.S.
I think that this issue gets regularly reproduced by a few days of our
stress-testing that models particular workloads.
--
Andriy Gapon
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...