Discussion:
ZFS Write Throttle Dirty Data Limit Interation with Free Memory
Steven Hartland via illumos-zfs
2014-09-14 17:00:57 UTC
Permalink
We've been investigating a problem with stalls on FreeBSD when
using ZFS and one of the current theories which is producing some
promising results is within the new IO scheduler, specifically
around how the dirty data limit being static limit.

The stalls occur when memory is close to the low water mark around
where paging will be triggered. At this time if there is a burst of
write IO, such as a copy from a remote location, ZFS can rapidly
allocate memory until the dirty data limit is hit.

This rapid memory consumption exacerbates the low memory situation
resulting in increased swapping and more stalls to the point where
the machine can be essentially become unusable for a good period
of time.

I will say its not clear if this only effects FreeBSD due to the
variations in how the VM interacts with ZFS or not.

Karl one of the FreeBSD community members who has been suffering
from this issue on his production environments, has been playing
with recalculating zfs_dirty_data_max at the start of
dmu_tx_assign(..) to take into account free memory.

While this has produced good results in his environment, eliminating
the stalls totally while keep IO usage high, its not clear if the
variation of zfs_dirty_data_max could have undesired side effects.

Given both Adam and Matt read these lists I thought it would be an
ideal place to raise this issue and get expert feedback on this
problem and potential ways of addressing it.

So the questions:
1. Is this a FreeBSD only issue or could other implementations
suffer from similar memory starvation situation due to rapid
consumption until dirty data max is hit?
2. Should dirty max or its consumers be made memory availability
aware to ensure that swapping due to IO busts are avoided?

Regards
Steve
Matthew Ahrens via illumos-zfs
2014-09-14 17:31:54 UTC
Permalink
On Sun, Sep 14, 2014 at 10:00 AM, Steven Hartland via illumos-zfs <
Post by Steven Hartland via illumos-zfs
We've been investigating a problem with stalls on FreeBSD when
using ZFS and one of the current theories which is producing some
promising results is within the new IO scheduler, specifically
around how the dirty data limit being static limit.
The stalls occur when memory is close to the low water mark around
where paging will be triggered. At this time if there is a burst of
write IO, such as a copy from a remote location, ZFS can rapidly
allocate memory until the dirty data limit is hit.
This rapid memory consumption exacerbates the low memory situation
resulting in increased swapping and more stalls to the point where
the machine can be essentially become unusable for a good period
of time.
I will say its not clear if this only effects FreeBSD due to the
variations in how the VM interacts with ZFS or not.
Karl one of the FreeBSD community members who has been suffering
from this issue on his production environments, has been playing
with recalculating zfs_dirty_data_max at the start of
dmu_tx_assign(..) to take into account free memory.
While this has produced good results in his environment, eliminating
the stalls totally while keep IO usage high, its not clear if the
variation of zfs_dirty_data_max could have undesired side effects.
Given both Adam and Matt read these lists I thought it would be an
ideal place to raise this issue and get expert feedback on this
problem and potential ways of addressing it.
1. Is this a FreeBSD only issue or could other implementations
suffer from similar memory starvation situation due to rapid
consumption until dirty data max is hit?
2. Should dirty max or its consumers be made memory availability
aware to ensure that swapping due to IO busts are avoided?
This is probably the wrong solution.

Are you sure that this only happens when writing, and not when reading?
All arc buffer allocation (including for writing) should go through
arc_get_data_buf(), which will evict from the ARC to make room for the new
buffer if necessary, based on arc_evict_needed().

--matt
Post by Steven Hartland via illumos-zfs
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-09-14 19:30:05 UTC
Permalink
----- Original Message -----
From: "Matthew Ahrens" <***@delphix.com>
To: "illumos-zfs" <***@lists.illumos.org>; "Steven Hartland" <***@multiplay.co.uk>
Cc: "developer" <***@open-zfs.org>
Sent: Sunday, September 14, 2014 6:31 PM
Subject: Re: [zfs] ZFS Write Throttle Dirty Data Limit Interation with Free Memory
Post by Matthew Ahrens via illumos-zfs
On Sun, Sep 14, 2014 at 10:00 AM, Steven Hartland via illumos-zfs <
Post by Steven Hartland via illumos-zfs
We've been investigating a problem with stalls on FreeBSD when
using ZFS and one of the current theories which is producing some
promising results is within the new IO scheduler, specifically
around how the dirty data limit being static limit.
The stalls occur when memory is close to the low water mark around
where paging will be triggered. At this time if there is a burst of
write IO, such as a copy from a remote location, ZFS can rapidly
allocate memory until the dirty data limit is hit.
This rapid memory consumption exacerbates the low memory situation
resulting in increased swapping and more stalls to the point where
the machine can be essentially become unusable for a good period
of time.
I will say its not clear if this only effects FreeBSD due to the
variations in how the VM interacts with ZFS or not.
Karl one of the FreeBSD community members who has been suffering
from this issue on his production environments, has been playing
with recalculating zfs_dirty_data_max at the start of
dmu_tx_assign(..) to take into account free memory.
While this has produced good results in his environment, eliminating
the stalls totally while keep IO usage high, its not clear if the
variation of zfs_dirty_data_max could have undesired side effects.
Given both Adam and Matt read these lists I thought it would be an
ideal place to raise this issue and get expert feedback on this
problem and potential ways of addressing it.
1. Is this a FreeBSD only issue or could other implementations
suffer from similar memory starvation situation due to rapid
consumption until dirty data max is hit?
2. Should dirty max or its consumers be made memory availability
aware to ensure that swapping due to IO busts are avoided?
This is probably the wrong solution.
Are you sure that this only happens when writing, and not when reading?
All arc buffer allocation (including for writing) should go through
arc_get_data_buf(), which will evict from the ARC to make room for the new
buffer if necessary, based on arc_evict_needed().
The load is a mixture of reads and write, with the trigger in this test being
a large amount of writes over samba by a backup process, so that doesn't
mean that reads aren't a trigger for this ever.

We've been investigating ARC allocation quite a bit and ARC does indeed
get pushed back. Adjusting ARC's target for fee has helped but any
significant adjustment on that has been demonstrated to cause other
issues, such as ARC pushed back to min for a considerable amount of
time, if not indefinitely as the VM never sees any pressure hence doesn't
scan INACT entries.

With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().

If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.

Regards
Steve
Matthew Ahrens via illumos-zfs
2014-09-14 20:32:11 UTC
Permalink
Post by Steven Hartland via illumos-zfs
Sent: Sunday, September 14, 2014 6:31 PM
Subject: Re: [zfs] ZFS Write Throttle Dirty Data Limit Interation with Free Memory
On Sun, Sep 14, 2014 at 10:00 AM, Steven Hartland via illumos-zfs <
Post by Steven Hartland via illumos-zfs
We've been investigating a problem with stalls on FreeBSD when
Post by Steven Hartland via illumos-zfs
using ZFS and one of the current theories which is producing some
promising results is within the new IO scheduler, specifically
around how the dirty data limit being static limit.
The stalls occur when memory is close to the low water mark around
where paging will be triggered. At this time if there is a burst of
write IO, such as a copy from a remote location, ZFS can rapidly
allocate memory until the dirty data limit is hit.
This rapid memory consumption exacerbates the low memory situation
resulting in increased swapping and more stalls to the point where
the machine can be essentially become unusable for a good period
of time.
I will say its not clear if this only effects FreeBSD due to the
variations in how the VM interacts with ZFS or not.
Karl one of the FreeBSD community members who has been suffering
from this issue on his production environments, has been playing
with recalculating zfs_dirty_data_max at the start of
dmu_tx_assign(..) to take into account free memory.
While this has produced good results in his environment, eliminating
the stalls totally while keep IO usage high, its not clear if the
variation of zfs_dirty_data_max could have undesired side effects.
Given both Adam and Matt read these lists I thought it would be an
ideal place to raise this issue and get expert feedback on this
problem and potential ways of addressing it.
1. Is this a FreeBSD only issue or could other implementations
suffer from similar memory starvation situation due to rapid
consumption until dirty data max is hit?
2. Should dirty max or its consumers be made memory availability
aware to ensure that swapping due to IO busts are avoided?
This is probably the wrong solution.
Are you sure that this only happens when writing, and not when reading?
All arc buffer allocation (including for writing) should go through
arc_get_data_buf(), which will evict from the ARC to make room for the new
buffer if necessary, based on arc_evict_needed().
The load is a mixture of reads and write, with the trigger in this test being
a large amount of writes over samba by a backup process, so that doesn't
mean that reads aren't a trigger for this ever.
We've been investigating ARC allocation quite a bit and ARC does indeed
get pushed back. Adjusting ARC's target for fee has helped but any
significant adjustment on that has been demonstrated to cause other
issues, such as ARC pushed back to min for a considerable amount of
time, if not indefinitely as the VM never sees any pressure hence doesn't
scan INACT entries.
With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
Post by Steven Hartland via illumos-zfs
If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without restriction. Is your
ARC below the minimum size?

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-14 21:36:02 UTC
Permalink
This post might be inappropriate. Click to display it.
Berend de Boer via illumos-zfs
2014-09-14 21:42:43 UTC
Permalink
Karl> What appears to be happening (after much dtrace'ing and work
Karl> with Steve) is this:

This is a brilliant analysis. Would it be helpful to others if you
just dumped the dtrace scripts somewhere? No need to cleanup, it might
help folk to get ideas and have a look at what can be done.


--
All the best,

Berend de Boer




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-14 22:03:51 UTC
Permalink
Post by Berend de Boer via illumos-zfs
Karl> What appears to be happening (after much dtrace'ing and work
This is a brilliant analysis. Would it be helpful to others if you
just dumped the dtrace scripts somewhere? No need to cleanup, it might
help folk to get ideas and have a look at what can be done.
Steve can chime in with his stuff; the one I used for the dirty pool
data status was the txg-syncing one off Adam's blog here:
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/

It was that script in particular along with the commentary in that blog
post that provided the illumination.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Steven Hartland via illumos-zfs
2014-09-14 22:28:21 UTC
Permalink
----- Original Message -----
Post by Berend de Boer via illumos-zfs
Karl> What appears to be happening (after much dtrace'ing and work
This is a brilliant analysis. Would it be helpful to others if you
just dumped the dtrace scripts somewhere? No need to cleanup, it might
help folk to get ideas and have a look at what can be done.
The main ones I created on are on the bug report here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

Under FreeBSD currently if we hit the equivalent of minfree it triggers
ARC reclaim in agreessive mode. With the ARC reclaim trigger point of
vm.pageout_wakeup_thresh then I can easily peg ARC back to its min.

I found that increasing this to (vm.pageout_wakeup_thresh / 2) * 3
I could no longer reproduce this behavour as it gives ARC chance
to react in normal mode, but there is concern that this will prevent
the VM from recailming INACT memory when it should.

Peter has demonstrated this on the package cluster machines with
the ARC reclaim set to vm.v_free_target, a value we found to help
significantly with the main stall issue.

Another resource of information about this and some work in progress
can be found here: https://reviews.freebsd.org/D702

Be warned in the two links there's a lot history most of which is
quite old due to additional changes / fixes in the VM and ZFS
code, so the older stuff should be either ignored or taken with
a large pinch of salt.

Regards
Steve
Steven Hartland via illumos-zfs
2014-09-18 00:13:49 UTC
Permalink
----- Original Message -----
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
So just to confirm, the expected behaviour given a sudden burst of writes
when we're already tight on memory is that arc_get_data_buf calls
arc_evict(...) which removes cached data, which then gets reused
as a dirty data buffer?

If so we should see a large amount hits to the arc_evict:entry probe
with recycle followed by arc__evict hits.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without restriction. Is your
ARC below the minimum size?
Not sure Karl, could you confirm the size of ARC when the issue triggers?

If the ARC size is at min or above, I'm wondering if we're simply not successfully
evicting cache data in preference for write data when we're expecting possibly
to due to hash lock misses?

Karl when you see stalls are you seeing increasing mutex and
recycle misses?
sysctl kstat.zfs.misc.arcstats.mutex_miss
sysctl kstat.zfs.misc.arcstats.recycle_miss

If this is the case could be that we're hitting contention between writes
triggering arc_evict directly and arc_reclaim_thread doing a cleanup?

Regards
Steve
Karl Denninger via illumos-zfs
2014-09-18 04:11:17 UTC
Permalink
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
So just to confirm, the expected behaviour given a sudden burst of writes
when we're already tight on memory is that arc_get_data_buf calls
arc_evict(...) which removes cached data, which then gets reused
as a dirty data buffer?
If so we should see a large amount hits to the arc_evict:entry probe
with recycle followed by arc__evict hits.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without restriction.
Is your
ARC below the minimum size?
Not sure Karl, could you confirm the size of ARC when the issue triggers?
It is at the adaptive target size when the issue triggers.
If the ARC size is at min or above, I'm wondering if we're simply not successfully
evicting cache data in preference for write data when we're expecting possibly
to due to hash lock misses?
Karl when you see stalls are you seeing increasing mutex and
recycle misses?
sysctl kstat.zfs.misc.arcstats.mutex_miss
sysctl kstat.zfs.misc.arcstats.recycle_miss
I will look at this tomorrow.
If this is the case could be that we're hitting contention between writes
triggering arc_evict directly and arc_reclaim_thread doing a cleanup?
Regards
Steve
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-18 17:18:18 UTC
Permalink
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
So just to confirm, the expected behaviour given a sudden burst of writes
when we're already tight on memory is that arc_get_data_buf calls
arc_evict(...) which removes cached data, which then gets reused
as a dirty data buffer?
If so we should see a large amount hits to the arc_evict:entry probe
with recycle followed by arc__evict hits.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without restriction.
Is your
ARC below the minimum size?
Not sure Karl, could you confirm the size of ARC when the issue triggers?
It is at the adaptive target size when the issue triggers.
If the ARC size is at min or above, I'm wondering if we're simply not successfully
evicting cache data in preference for write data when we're expecting possibly
to due to hash lock misses?
Karl when you see stalls are you seeing increasing mutex and
recycle misses?
sysctl kstat.zfs.misc.arcstats.mutex_miss
sysctl kstat.zfs.misc.arcstats.recycle_miss
I will look at this tomorrow.
If this is the case could be that we're hitting contention between writes
triggering arc_evict directly and arc_reclaim_thread doing a cleanup?
Regards
Steve
I am, on a new checkout, getting some extremely interesting behavior
that is completely inconsistent with what has been seen before.

The system, instead of being driven into hard paging and stalls, is
instead being *very* aggressive about evicting ARC when the txg dirty
pool is slammed at maximum. In fact on a sum-of-the-allocation basis
it's dramatically too aggressive, leaving roughly a quarter of system
RAM unused and driving ARC down to about half of the max computed size.
It's aggressive enough that were my dynamic dirty_max resizing code in
(as it's a new pull from the repository the patch is not there) it would
never trigger as the probe I inserted to monitor free memory under
loaded conditions (and which I expected to show a negative margin over
the paging threshold) instead never gets anywhere near the paging
threshold where this would come into play.

That's infinitely preferable to taking system stalls -- but what caused
that change is a mystery at this point given that I'm unaware of
anything going into the source tree that would have produced this sort
of behavioral change, and further it's so aggressive that I suspect
there's a decent cache hit performance penalty that's operative here
instead.

The only change I'm aware of related to ZFS is an MFC for TRIM not being
available and causing the resilver restarts. I don't *think* that comes
into play in regard to this behavior, but.....

I'm looking very closely into trying to account for the difference in
system behavior as at present I have no explanation for it.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Denninger via illumos-zfs
2014-09-18 17:45:07 UTC
Permalink
Please disregard -- Steve had a patch he wanted on the base code so
we're doing apples-to-apples, and I missed it.
Post by Karl Denninger via illumos-zfs
Post by Karl Denninger via illumos-zfs
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs"
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
So just to confirm, the expected behaviour given a sudden burst of writes
when we're already tight on memory is that arc_get_data_buf calls
arc_evict(...) which removes cached data, which then gets reused
as a dirty data buffer?
If so we should see a large amount hits to the arc_evict:entry probe
with recycle followed by arc__evict hits.
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without restriction.
Is your
ARC below the minimum size?
Not sure Karl, could you confirm the size of ARC when the issue triggers?
It is at the adaptive target size when the issue triggers.
If the ARC size is at min or above, I'm wondering if we're simply not successfully
evicting cache data in preference for write data when we're
expecting possibly
to due to hash lock misses?
Karl when you see stalls are you seeing increasing mutex and
recycle misses?
sysctl kstat.zfs.misc.arcstats.mutex_miss
sysctl kstat.zfs.misc.arcstats.recycle_miss
I will look at this tomorrow.
If this is the case could be that we're hitting contention between writes
triggering arc_evict directly and arc_reclaim_thread doing a cleanup?
Regards
Steve
I am, on a new checkout, getting some extremely interesting behavior
that is completely inconsistent with what has been seen before.
The system, instead of being driven into hard paging and stalls, is
instead being *very* aggressive about evicting ARC when the txg dirty
pool is slammed at maximum. In fact on a sum-of-the-allocation basis
it's dramatically too aggressive, leaving roughly a quarter of system
RAM unused and driving ARC down to about half of the max computed
size. It's aggressive enough that were my dynamic dirty_max resizing
code in (as it's a new pull from the repository the patch is not
there) it would never trigger as the probe I inserted to monitor free
memory under loaded conditions (and which I expected to show a
negative margin over the paging threshold) instead never gets anywhere
near the paging threshold where this would come into play.
That's infinitely preferable to taking system stalls -- but what
caused that change is a mystery at this point given that I'm unaware
of anything going into the source tree that would have produced this
sort of behavioral change, and further it's so aggressive that I
suspect there's a decent cache hit performance penalty that's
operative here instead.
The only change I'm aware of related to ZFS is an MFC for TRIM not
being available and causing the resilver restarts. I don't *think*
that comes into play in regard to this behavior, but.....
I'm looking very closely into trying to account for the difference in
system behavior as at present I have no explanation for it.
--
Karl Denninger
/The Market Ticker/
*illumos-zfs* | Archives
<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/26604278-6a3d91ec>
| Modify
<https://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2014-09-18 18:10:17 UTC
Permalink
----- Original Message ----- From: "Matthew Ahrens via illumos-zfs" <
With regards to buffers being allocated by arc_get_data_buf() I can't see
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
So just to confirm, the expected behaviour given a sudden burst of writes
when we're already tight on memory is that arc_get_data_buf calls
arc_evict(...) which removes cached data, which then gets reused
as a dirty data buffer?
Yes.
If so we should see a large amount hits to the arc_evict:entry probe
with recycle followed by arc__evict hits.
Yes.
If thats the case can't we hit min ARC but yet still claim new buffers? If
Post by Matthew Ahrens via illumos-zfs
Post by Steven Hartland via illumos-zfs
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without restriction. Is
your
ARC below the minimum size?
Not sure Karl, could you confirm the size of ARC when the issue triggers?
If the ARC size is at min or above, I'm wondering if we're simply not successfully
evicting cache data in preference for write data when we're expecting possibly
to due to hash lock misses?
Hash lock misses shouldn't be able to prevent us from evicting enough data.
If one buffer's hash lock is held, we will move on to the next one. It
would be extremely surprising if a large percent of all buffers are locked
at the same time.

--matt
Karl when you see stalls are you seeing increasing mutex and
recycle misses?
sysctl kstat.zfs.misc.arcstats.mutex_miss
sysctl kstat.zfs.misc.arcstats.recycle_miss
If this is the case could be that we're hitting contention between writes
triggering arc_evict directly and arc_reclaim_thread doing a cleanup?
Regards
Steve
Karl Denninger via illumos-zfs
2014-09-19 01:27:47 UTC
Permalink
On Wed, Sep 17, 2014 at 5:13 PM, Steven Hartland
----- Original Message ----- From: "Matthew Ahrens via
With regards to buffers being allocated by
arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being
allocated even when
arc_evict_needed().
It won't, but it will evict an existing buffer, thus freeing up memory for
the new one.
So just to confirm, the expected behaviour given a sudden burst of writes
when we're already tight on memory is that arc_get_data_buf calls
arc_evict(...) which removes cached data, which then gets reused
as a dirty data buffer?
Yes.
If so we should see a large amount hits to the arc_evict:entry probe
with recycle followed by arc__evict hits.
Yes.
If thats the case can't we hit min ARC but yet still claim
new buffers? If
so we can suddenly demand up to 10% of the system memory
all of which
may required VM to page before it can provide said memory.
Sure, the ARC can grow up to the minimum size without
restriction. Is your
ARC below the minimum size?
Not sure Karl, could you confirm the size of ARC when the issue triggers?
If the ARC size is at min or above, I'm wondering if we're simply not successfully
evicting cache data in preference for write data when we're expecting possibly
to due to hash lock misses?
Hash lock misses shouldn't be able to prevent us from evicting enough
data. If one buffer's hash lock is held, we will move on to the next
one. It would be extremely surprising if a large percent of all
buffers are locked at the same time.
--matt
Karl when you see stalls are you seeing increasing mutex and
recycle misses?
sysctl kstat.zfs.misc.arcstats.mutex_miss
sysctl kstat.zfs.misc.arcstats.recycle_miss
If this is the case could be that we're hitting contention between writes
triggering arc_evict directly and arc_reclaim_thread doing a cleanup?
Regards
Steve
[Powered by Listbox] <http://www.listbox.com>
The root of the issue appears to be severe breakage within the uma
allocator code; I was able to repeatedly trap it grabbing 2GB RAM chunks
when the system was low on memory (more than enough to provoke
pathological behavior) and also "sitting" on large blocks of wired RAM
that had been allegedly released.
--
Karl Denninger
***@denninger.net <mailto:***@denninger.net>
/The Market Ticker/



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Loading...