Too much RAM for ARC and fragmentation

This is poor advice. Where did you read it? The authors need to be enlightened.

Post by aurfalien
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive on my readings other then that.
Is there any merit to this?

Post by aurfalien
What tools could one use to monitor this phenomena, I assume arcstat.py and zpool iostat?

ARC kstats and the arcstat tool will show the current ARC size and targets.

Post by aurfalien
Fragmentation;
I've read that if one goes over 70% storage utilization, that fragmentation will occur and be noticeable.

This is true of all file systems. There is a point at which the allocation algorithms must make
hard decisions. However, there is nothing magic about 70% for ZFS. There is some magic
that occurs at 96% and a well-managed datacenter often has a policy about going over
80% for capacity planning purposes.

Post by aurfalien
Do tools exist to measure ZFS fragmentation?

Can you define "fragmentation" in this context? It is an overloaded term.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-11-17 21:04:59 UTC

Inline.

- Andrew

On Sun, Nov 17, 2013 at 10:56 AM, Richard Elling

Actually, at the moment, I'd stand by that advice. There are a number of
problems identified on 'large memory systems' (> 128-192 GB or so) that
have culminated in Nexenta forcing the ARC max to 128 GB on many builds
today. Identification and resolution of the bugs is ongoing. AFAIK, at
least some of the identified issues will effect any illumos-based OS, not
just Nexenta [which is a bit older], but I can't speak for Linux or FreeBSD
ZFS. The common wisdom in the field is 128-192, and also that it is enough
to limit ARC. We have systems in production with 256, 512, and more, that
are fine with ARC limited to 128-192 GB, and the hope is once the 'bugs'
are resolved, they could remove the artificial limit. We also have systems
in production with 512+ that are fine /without/ any limit -- the issues
have to do with more than /just/ the amount of RAM in the system (in fact,
AFAIK, one could argue they have nothing to do with the amount of RAM in
the system, but with other things.. however, the amount of RAM in the
system makes the symptoms of the 'bugs' go from manageable to unmanageable,
and it can be hard to impossible beforehand to tell you if you will hit the
problems, thus the field rule of thumb to limit to 128 GB of RAM).

Post by aurfalien
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive on
my readings other then that.
Is there any merit to this?
Not really. There are behaviours that can occur on various OSes and how they deal with
large memory. ZFS uses the kernel memory allocation, so if the kernel can scale well,
then ZFS can scale well. If the kernel doesn't scale well... ZFS can't do
it better than the
kernel.

True and untrue. ZFS IS beholden to the kernel when it comes to memory
stuff, to some degree. However, that it is all about kernel would assume
there are zero inefficiencies in ZFS and how it handles its own in-memory
structures and that all perceived latency or issues regarding memory (ARC)
are a result of kernel code. That's just patently false. Random example:
ZFS decides it needs to 'free' 100 GB of RAM from ARC this very moment, on
a box where average block sized on the pool is 4-8 KB. AFAIK, act of
'freeing' that RAM isn't really so much a kernel task (it doesn't just go
to the kernel and say, hey dude, this huge block of RAM, I don't need it
anymore) as it is ZFS running through its own in-memory structures, freeing
up tiny pieces of RAM after it identifies where they are, a process that
takes considerable time (and in some circumstances seems to completely
freeze the box until it is done). But I'm ill-equipped to explain this
better than that and (thankfully for us all!) am not one of the developers
working on identifying and fixing these issues. I know enough to know that
the above statement might be technically accurate in some lights, but it
falls short of field-usable truth.

Post by aurfalien
What tools could one use to monitor this phenomena, I assume arcstat.py and zpool iostat?
ARC kstats and the arcstat tool will show the current ARC size and targets.
Fragmentation;
I've read that if one goes over 70% storage utilization, that
fragmentation will occur and be noticeable.
This is true of all file systems. There is a point at which the allocation
algorithms must make
hard decisions. However, there is nothing magic about 70% for ZFS. There is some magic
that occurs at 96% and a well-managed datacenter often has a policy about going over
80% for capacity planning purposes.
Do tools exist to measure ZFS fragmentation?
Can you define "fragmentation" in this context? It is an overloaded term.

While the above is all technically true, it again seems to presume there's
no problem with going over 70, 80%, that 96% is the point of problem. This
is also factually untrue. Field experience tells us that 70-80% as a hard
cap on your utilization is 'good enough' to prevent significant performance
slowdown on a large percentage of deployed systems (of which we have
1000's). WHY this is has a couple of answers depending on your environment
and workload, and it ISN'T a hard & fast safety number (it's possible to
cause significant "fragmentation"-related slowdown with well under 50% of
the pool ever used, especially if you design a test load specifically to
cause issues like this), and some get bit even obeying a 70% rule, but the
majority do not, and so it makes a good field rule of thumb.

There's a big difference between what is academically true, and what is
best advice for field. Academically, there should be no problem with a ZFS
solution on a 1+ PB RAM box with quad-proc, hex-core processors with 7 x8
PCI-e slots each with a dual-port SAS HBA each in turn plugged into
separate SAS switches each in turn plugged into a total of let's say 60 or
so JBOD's, each containing 24 hard disks, for a total of 5,760 TB (w/4 TB
drives) raw space. However, field experience tells us you would be an
astronomical idiot to actually do this, and the resulting system would be a
disaster in the making. One that the owner of would never be happy or even
satisfied with, and would definitely regret buying. We know this without
ever having seen such a system, because we've seen a number of builds
anywhere from 1/5 to 1/3 this size, and they've all been problems.

Post by aurfalien
-- richard
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-11-17 23:07:32 UTC

Post by Andrew Galloway
Inline.
- Andrew

This is poor advice. Where did you read it? The authors need to be enlightened.
Actually, at the moment, I'd stand by that advice. There are a number of problems identified on 'large memory systems' (> 128-192 GB or so) that have culminated in Nexenta forcing the ARC max to 128 GB on many builds today.

Yes, NexentaStor 3.x suffers as an old OS that does not scale well wrt large memory.
By contrast, the Oracle Z3 has rather impressive performance results with 1TB of memory and
we know that Solaris 11.1 has a very different use of ARC memory than current illumos. In
other words, you reiterate my point, but please don't blame ZFS for the OS shortcomings.

Post by Andrew Galloway
Identification and resolution of the bugs is ongoing. AFAIK, at least some of the identified issues will effect any illumos-based OS, not just Nexenta [which is a bit older], but I can't speak for Linux or FreeBSD ZFS. The common wisdom in the field is 128-192, and also that it is enough to limit ARC. We have systems in production with 256, 512, and more, that are fine with ARC limited to 128-192 GB, and the hope is once the 'bugs' are resolved, they could remove the artificial limit. We also have systems in production with 512+ that are fine /without/ any limit -- the issues have to do with more than /just/ the amount of RAM in the system (in fact, AFAIK, one could argue they have nothing to do with the amount of RAM in the system, but with other things.. however, the amount of RAM in the system makes the symptoms of the 'bugs' go from manageable to unmanageable, and it can be hard to impossible beforehand to tell you if you will hit the problems, thus the field rule of thumb to limit to 128 GB of RAM).

It is unfortunate that Nexenta cannot offer better advice to their customers.

Post by Andrew Galloway

Post by aurfalien
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive on my readings other then that.
Is there any merit to this?

Not really. There are behaviours that can occur on various OSes and how they deal with
large memory. ZFS uses the kernel memory allocation, so if the kernel can scale well,
then ZFS can scale well. If the kernel doesn't scale well... ZFS can't do it better than the
kernel.
True and untrue. ZFS IS beholden to the kernel when it comes to memory stuff, to some degree. However, that it is all about kernel would assume there are zero inefficiencies in ZFS and how it handles its own in-memory structures and that all perceived latency or issues regarding memory (ARC) are a result of kernel code. That's just patently false. Random example: ZFS decides it needs to 'free' 100 GB of RAM from ARC this very moment, on a box where average block sized on the pool is 4-8 KB. AFAIK, act of 'freeing' that RAM isn't really so much a kernel task (it doesn't just go to the kernel and say, hey dude, this huge block of RAM, I don't need it anymore) as it is ZFS running through its own in-memory structures, freeing up tiny pieces of RAM after it identifies where they are, a process that takes considerable time (and in some circumstances seems to completely freeze the box until it is done). But I'm ill-equipped to explain this better than that and (thankfully for us all!) am not one of the developers working on identifying and fixing these issues. I know enough to know that the above statement might be technically accurate in some lights, but it falls short of field-usable truth.

Yes, this is true for NexentaStor. Please don't project it on other OSes.

Post by Andrew Galloway

Post by aurfalien
What tools could one use to monitor this phenomena, I assume arcstat.py and zpool iostat?

ARC kstats and the arcstat tool will show the current ARC size and targets.

Post by aurfalien
Fragmentation;
I've read that if one goes over 70% storage utilization, that fragmentation will occur and be noticeable.

Post by aurfalien
Do tools exist to measure ZFS fragmentation?

Can you define "fragmentation" in this context? It is an overloaded term.
While the above is all technically true, it again seems to presume there's no problem with going over 70, 80%, that 96% is the point of problem. This is also factually untrue. Field experience tells us that 70-80% as a hard cap on your utilization is 'good enough' to prevent significant performance slowdown on a large percentage of deployed systems (of which we have 1000's). WHY this is has a couple of answers depending on your environment and workload, and it ISN'T a hard & fast safety number (it's possible to cause significant "fragmentation"-related slowdown with well under 50% of the pool ever used, especially if you design a test load specifically to cause issues like this), and some get bit even obeying a 70% rule, but the majority do not, and so it makes a good field rule of thumb.

The rule of thumb for capacity planning is 80%. This has been true for decades. But that
has nothing to do with "fragmentation." My request is for the OP to specify what they are
asking for.

As I said previously, the default allocation algorithm changes from first fit to best fit at 96%.
For your workload, it might work better with a change at a different percentage. The source
is open, you can use one or more of the existing allocators and adjust the change percentage.
Or, write your own allocator... the possibilities are endless :-)
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/metaslab.c

Note: some distros use the new dynamic fit (ndf) allocator rather than the dynamic fit (df) allocator.
Current illumos-gate code uses the df allocator and George Wilson has been doing a lot of work
in this area for illumos. But do not assume all ZFS implementations use the same allocator, they
do not.

Post by Andrew Galloway
There's a big difference between what is academically true, and what is best advice for field. Academically, there should be no problem with a ZFS solution on a 1+ PB RAM box with quad-proc, hex-core processors with 7 x8 PCI-e slots each with a dual-port SAS HBA each in turn plugged into separate SAS switches each in turn plugged into a total of let's say 60 or so JBOD's, each containing 24 hard disks, for a total of 5,760 TB (w/4 TB drives) raw space. However, field experience tells us you would be an astronomical idiot to actually do this, and the resulting system would be a disaster in the making. One that the owner of would never be happy or even satisfied with, and would definitely regret buying. We know this without ever having seen such a system, because we've seen a number of builds anywhere from 1/5 to 1/3 this size, and they've all been problems.

Once again, you are projecting random customer requirements and referencing nebulous
"problems" on the situation as you have experienced at Nexenta. If you have a specific bug
in mind or use case, please reference. Otherwise, IMHO, you are simply doing a disservice
to the ZFS community by propagating rules with no basis.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-11-17 23:57:57 UTC

<snip>

Once again, you are projecting random customer requirements and referencing

nebulous
"problems" on the situation as you have experienced at Nexenta. If you have a specific bug
in mind or use case, please reference. Otherwise, IMHO, you are simply doing a disservice
to the ZFS community by propagating rules with no basis.
-- richard

Unfortunately, the 'problem' with 'field advice' is often that it is
designed to cover lots of use-cases and get around potentially lots of bugs
(or simple inefficiencies, and/or simple learned behavior). It is not
always specific. Everyone loves it when it is, but sometimes it just isn't.
It is 'best rule of thumb', designed to avoid the majority of corner cases
and so on. Often it is simply advice based on what did or did not hurt in
the past, with no specifics available.

The basis for my 'rules' (which are really just suggestions, that anyone is
free to disregard if they wish) is really simple: does not following this
suggestion routinely cause me more pain than following this suggestion?
Yes? Then follow the suggestion. Much of this applies regardless of the
version of ZFS or OS you're using, as it's more architectural than anything.

IMHO, you have and continue to do a disservice to the ZFS community by
routinely making comments and suggestions that while academically true go
against my field experience and even system administration common sense.
Pushing the envelope, especially in production, is not something you
suggest without stating as such and explaining caveats - certainly not to
someone who hasn't specifically said they're OK with being a guinea pig.

"Nebulous" (in that I don't cite specific line numbers of code) is simply
because I /am/ a field guy, not a developer. I may not have nearly the
understanding a doctor does as to why it hurts to smash my face walking
into a wall, but I do quickly learn not to do it. If it /doesn't/ hurt to
walk through something, and I need to, I assure you I will - all of my
advice is based on real-world experience of actual pain, not just
fear-mongering. Some of those painful experiences were and are on on
systems /you/ designed/approved, and continue to cause pain to this day.
Ones where what academically sounded fine proved a problem in the field. I
can list off 5 off the top of my head. But now I'm getting into ad hominem,
so I'm going to simply say that we're going to have to agree to disagree on
this. Field experience tells me one thing, and I'm sorry if that's at odds
with what the code would suggest should work. "Should" and "do" are,
unfortunately, not always in sync.

It is fair to say all of my advice and field experience is on Nexenta. It
is fair to say NexentaStor 3.1.x is 'old'. Your comments seem a little
harsh, since I know factually that any number of problems identified
internally are the same or worse in latest code (such as ARC evict of L2ARC
entries on pool export, for example), as people who do code have shown me
the comparable bits, or asserted to me there's no difference between us and
illumos. NexentaStor has done quite a few back-ports, as well. There is not
so large a difference between an all-ZFS use-case NexentaStor and illumos
that you can just wave your hands and claim everything is fixed in illumos.
Nor can you claim that there's such a large delta between the two that
common sense rules on production system limitations are radically
different. Well, I suppose you can, since you basically just did - but
we'll have to agree to disagree on it.

It's not like I'm suggesting you not use ZFS. I love ZFS. I'd continue to
encourage its adoption and continued development even if it wasn't paying
my salary, yet it also is doing that, so I'm super motivated to love ZFS.
However, as with basically everything, there are often field limitations to
what is possible, feasible, and safe. My last car could, on paper and in
the lab (closed track), go about 180 MPH - common sense kept me from
attempting it in 5 PM rush hour traffic on the freeway, and field
experience suggested I'd likely never actually get 180 MPH out of it in the
city (not in the lab), ever. I don't need to understand every last nuance
of the engine, and where every last screw goes, to have learned through
experience that veering into a wall and crashing causing physical pain, and
I feel pretty confident I'm not doing anyone a disservice passing that
advice on to others, but to each their own. :)

--

+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Richard Elling

2013-11-18 02:20:59 UTC

I apologize for distracting this thread. Just one comment below...

<snip>
Once again, you are projecting random customer requirements and referencing nebulous
"problems" on the situation as you have experienced at Nexenta. If you have a specific bug
in mind or use case, please reference. Otherwise, IMHO, you are simply doing a disservice
to the ZFS community by propagating rules with no basis.
-- richard
Unfortunately, the 'problem' with 'field advice' is often that it is designed to cover lots of use-cases and get around potentially lots of bugs (or simple inefficiencies, and/or simple learned behavior). It is not always specific. Everyone loves it when it is, but sometimes it just isn't. It is 'best rule of thumb', designed to avoid the majority of corner cases and so on. Often it is simply advice based on what did or did not hurt in the past, with no specifics available.
The basis for my 'rules' (which are really just suggestions, that anyone is free to disregard if they wish) is really simple: does not following this suggestion routinely cause me more pain than following this suggestion? Yes? Then follow the suggestion. Much of this applies regardless of the version of ZFS or OS you're using, as it's more architectural than anything.
IMHO, you have and continue to do a disservice to the ZFS community by routinely making comments and suggestions that while academically true go against my field experience and even system administration common sense. Pushing the envelope, especially in production, is not something you suggest without stating as such and explaining caveats - certainly not to someone who hasn't specifically said they're OK with being a guinea pig.

If the scope of your experience is NexentaStor 3, then you should say so in the rules. Other
distros do have other experiences.
-- richard

"Nebulous" (in that I don't cite specific line numbers of code) is simply because I /am/ a field guy, not a developer. I may not have nearly the understanding a doctor does as to why it hurts to smash my face walking into a wall, but I do quickly learn not to do it. If it /doesn't/ hurt to walk through something, and I need to, I assure you I will - all of my advice is based on real-world experience of actual pain, not just fear-mongering. Some of those painful experiences were and are on on systems /you/ designed/approved, and continue to cause pain to this day. Ones where what academically sounded fine proved a problem in the field. I can list off 5 off the top of my head. But now I'm getting into ad hominem, so I'm going to simply say that we're going to have to agree to disagree on this. Field experience tells me one thing, and I'm sorry if that's at odds with what the code would suggest should work. "Should" and "do" are, unfortunately, not always in sync.
It is fair to say all of my advice and field experience is on Nexenta. It is fair to say NexentaStor 3.1.x is 'old'. Your comments seem a little harsh, since I know factually that any number of problems identified internally are the same or worse in latest code (such as ARC evict of L2ARC entries on pool export, for example), as people who do code have shown me the comparable bits, or asserted to me there's no difference between us and illumos. NexentaStor has done quite a few back-ports, as well. There is not so large a difference between an all-ZFS use-case NexentaStor and illumos that you can just wave your hands and claim everything is fixed in illumos. Nor can you claim that there's such a large delta between the two that common sense rules on production system limitations are radically different. Well, I suppose you can, since you basically just did - but we'll have to agree to disagree on it.
It's not like I'm suggesting you not use ZFS. I love ZFS. I'd continue to encourage its adoption and continued development even if it wasn't paying my salary, yet it also is doing that, so I'm super motivated to love ZFS. However, as with basically everything, there are often field limitations to what is possible, feasible, and safe. My last car could, on paper and in the lab (closed track), go about 180 MPH - common sense kept me from attempting it in 5 PM rush hour traffic on the freeway, and field experience suggested I'd likely never actually get 180 MPH out of it in the city (not in the lab), ever. I don't need to understand every last nuance of the engine, and where every last screw goes, to have learned through experience that veering into a wall and crashing causing physical pain, and I feel pretty confident I'm not doing anyone a disservice passing that advice on to others, but to each their own. :)
--
+1-760-896-4422
illumos-zfs | Archives | Modify Your Subscription
illumos-zfs | Archives | Modify Your Subscription

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Andrew Galloway

2013-11-18 03:08:07 UTC

Really? This is getting out of hand. I'm not going to argue with you that
my experience is primarily on a specific distribution. I'll also admit it
/might/ not be applicable on every combination of distro/ZFS version/kernel
version. But I'm definitely not going to argue it with you when you're
making sweeping generalizations of your own with statements like:

"This is poor advice. Where did you read it? The authors need to be
enlightened."

That would lead a reader to infer that any amount of RAM should be fine on
any system of any OS at any version level, since you didn't bother to limit
your response in any way by stating 'on latest ZFS' or 'on illumos' or 'on
Oracle Solaris 11.1', or even to state what a sane rule of thumb limit
should be in your vaunted opinion. Which implies you don't think any limit
is necessary.

OP hadn't yet clarified what OS he had in mind, and had he never replied
and taken your initial response as gospel and been planning to use Nexenta,
or an older distro of any number of Solaris derivatives, or even an older
version of Solaris itself, or I suspect ZFS On Linux (not to hate on ZoL,
but my understanding of memory management & ARC there today is it would
probably be unwise to throw a TB of RAM at an ARC on ZoL today; certainly
it shouldn't be suggested as totally OK and not without risk, which you
just did by way of omission), he'd be ill-served.

So yes, I'm guilty of making some generalizations, but mine were made to
protect against the lowest common denominator - take my advice and you
won't get burned. You, on the other hand, are making generalizations that
they're somehow using the exact magical mix of versions and distributions
that is immune to any and all problems wrt large memory. Take your advice
as originally stated before I responded and started this thread derailment,
and 9 out of 10 were going to get burned.

- Andrew

Post by Richard Elling
I apologize for distracting this thread. Just one comment below...

<snip>

Once again, you are projecting random customer requirements and

referencing nebulous
"problems" on the situation as you have experienced at Nexenta. If you
have a specific bug
in mind or use case, please reference. Otherwise, IMHO, you are simply doing a disservice
to the ZFS community by propagating rules with no basis.
-- richard

Unfortunately, the 'problem' with 'field advice' is often that it is
designed to cover lots of use-cases and get around potentially lots of bugs
(or simple inefficiencies, and/or simple learned behavior). It is not
always specific. Everyone loves it when it is, but sometimes it just isn't.
It is 'best rule of thumb', designed to avoid the majority of corner cases
and so on. Often it is simply advice based on what did or did not hurt in
the past, with no specifics available.
The basis for my 'rules' (which are really just suggestions, that anyone
is free to disregard if they wish) is really simple: does not following
this suggestion routinely cause me more pain than following this
suggestion? Yes? Then follow the suggestion. Much of this applies
regardless of the version of ZFS or OS you're using, as it's more
architectural than anything.
IMHO, you have and continue to do a disservice to the ZFS community by
routinely making comments and suggestions that while academically true go
against my field experience and even system administration common sense.
Pushing the envelope, especially in production, is not something you
suggest without stating as such and explaining caveats - certainly not to
someone who hasn't specifically said they're OK with being a guinea pig.
If the scope of your experience is NexentaStor 3, then you should say so
in the rules. Other
distros do have other experiences.
-- richard
"Nebulous" (in that I don't cite specific line numbers of code) is simply
because I /am/ a field guy, not a developer. I may not have nearly the
understanding a doctor does as to why it hurts to smash my face walking
into a wall, but I do quickly learn not to do it. If it /doesn't/ hurt to
walk through something, and I need to, I assure you I will - all of my
advice is based on real-world experience of actual pain, not just
fear-mongering. Some of those painful experiences were and are on on
systems /you/ designed/approved, and continue to cause pain to this day.
Ones where what academically sounded fine proved a problem in the field. I
can list off 5 off the top of my head. But now I'm getting into ad hominem,
so I'm going to simply say that we're going to have to agree to disagree on
this. Field experience tells me one thing, and I'm sorry if that's at odds
with what the code would suggest should work. "Should" and "do" are,
unfortunately, not always in sync.
It is fair to say all of my advice and field experience is on Nexenta. It
is fair to say NexentaStor 3.1.x is 'old'. Your comments seem a little
harsh, since I know factually that any number of problems identified
internally are the same or worse in latest code (such as ARC evict of L2ARC
entries on pool export, for example), as people who do code have shown me
the comparable bits, or asserted to me there's no difference between us and
illumos. NexentaStor has done quite a few back-ports, as well. There is not
so large a difference between an all-ZFS use-case NexentaStor and illumos
that you can just wave your hands and claim everything is fixed in illumos.
Nor can you claim that there's such a large delta between the two that
common sense rules on production system limitations are radically
different. Well, I suppose you can, since you basically just did - but
we'll have to agree to disagree on it.
It's not like I'm suggesting you not use ZFS. I love ZFS. I'd continue to
encourage its adoption and continued development even if it wasn't paying
my salary, yet it also is doing that, so I'm super motivated to love ZFS.
However, as with basically everything, there are often field limitations to
what is possible, feasible, and safe. My last car could, on paper and in
the lab (closed track), go about 180 MPH - common sense kept me from
attempting it in 5 PM rush hour traffic on the freeway, and field
experience suggested I'd likely never actually get 180 MPH out of it in the
city (not in the lab), ever. I don't need to understand every last nuance
of the engine, and where every last screw goes, to have learned through
experience that veering into a wall and crashing causing physical pain, and
I feel pretty confident I'm not doing anyone a disservice passing that
advice on to others, but to each their own. :)
--

+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>

*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com/>
--
+1-760-896-4422
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24484421-62d25f20> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

jason matthews

2013-11-18 08:17:30 UTC

That would lead a reader to infer that any amount of RAM should be fine on any system of any OS at any version level, since you didn't bother to limit your response in any way by stating 'on latest ZFS' or 'on illumos' or 'on Oracle Solaris 11.1', or even to state what a sane rule of thumb limit should be in your vaunted opinion. Which implies you don't think any limit is necessary.

i am sitting at 576gb of ram 151a8 in an eval system that has been burning a production load for about four weeks. So far so good. Until you came along I thought RAM scaled well. The larger problem for me has been processor affinity on 64 vcores. i nail our database instances down to processor sets in the same lgroup. This effectively blocks the databases from experiencing a lot of unneeded context switches which has proven to have a large impact on latency. However, scaling the RAM to 576GB has been a no brainer.

Andrew, you have problems > 128GB of RAM?

I am ordering some systems for the data sci guys that have 1.5TB of RAM, we'll see how it holds up.

here are my stats tonight

***@db233:~# echo ::memstat |mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 22283236 87043 15%
ZFS File Data 104136736 406784 69%
Anon 4353388 17005 3%
Exec and libs 10347 40 0%
Page cache 57350 224 0%
Free (cachelist) 564877 2206 0%
Free (freelist) 19575325 76466 13%

Total 150981259 589770
Physical 150981258 589770

***@db233:/home/jason# ./arcstat.pl 10 10000
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
00:04:56 0 0 0 0 0 0 0 0 0 432G 437G
00:05:06 17.9K 68 0 66 1 2 0 0 0 426G 437G
00:05:16 13.8K 13 0 13 0 0 0 0 0 426G 437G
00:05:26 13.4K 10 0 10 0 0 0 0 0 426G 437G
00:05:36 13.2K 8 0 8 0 0 0 0 0 426G 437G

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

aurfalien

2013-11-18 00:29:17 UTC

This is poor advice. Where did you read it? The authors need to be enlightened.

Was hope you'd chime in. Well, I don't want to cause some kind of flame war however a simple google will result in 2 particular articles, one that sites 128G as a number to stick with.

I'm ramping up from 128 to 256 this week sticking with the notion that my env will determine when I see benefits, not so much if.

Post by aurfalien
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive on my readings other then that.
Is there any merit to this?

Well FreeBSD+ZFS has seemed to be a very good marriage so I expect the underlying mechanism to be sound. In fact most OS's that I've touched with more then 64GB RAM seem to manage well including Windows 7.

Post by aurfalien
What tools could one use to monitor this phenomena, I assume arcstat.py and zpool iostat?

ARC kstats and the arcstat tool will show the current ARC size and targets.

Post by aurfalien
Fragmentation;
I've read that if one goes over 70% storage utilization, that fragmentation will occur and be noticeable.

Yes, my own general rule is not to exceed 80% regardless of file system or NAS technology.

I thought 70% was a bit extreme.

Post by aurfalien
Do tools exist to measure ZFS fragmentation?

Can you define "fragmentation" in this context? It is an overloaded term.

Well unsure how specific I need to be however articles I've read mention something about ZFS needing block pointer rewrite...

Googling this too will yield a few articles, one were some one claimed 30% utilization was all it took for degradation of performance to occur.

This all seems convoluted and vague which is why I posted in the first place and was hoping for some traction.

I believe that in the end there envs were to blame as there are many ways to implement technology. Some good while others not so good.

- aurf

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-11-18 03:43:55 UTC

This is poor advice. Where did you read it? The authors need to be enlightened.

Was hope you'd chime in. Well, I don't want to cause some kind of flame war however a simple google will result in 2 particular articles, one that sites 128G as a number to stick with.

Yes, there are such articles written by people with very specific cases that might not apply
to the general case. You don't need more than 640k of RAM. 'I think there is a world
market for about five computers'. And so on.

Post by aurfalien
I'm ramping up from 128 to 256 this week sticking with the notion that my env will determine when I see benefits, not so much if.

Yes, you'll want to test how effective it is for your workload.

Post by aurfalien
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive on my readings other then that.
Is there any merit to this?

Let us know how it works out :-)

Post by aurfalien
What tools could one use to monitor this phenomena, I assume arcstat.py and zpool iostat?

ARC kstats and the arcstat tool will show the current ARC size and targets.

Post by aurfalien
Fragmentation;
I've read that if one goes over 70% storage utilization, that fragmentation will occur and be noticeable.

Yes, my own general rule is not to exceed 80% regardless of file system or NAS technology.
I thought 70% was a bit extreme.

I agree with 70% being extreme. 80% is more commonly used as a rule of thumb for systems.

Post by aurfalien
Do tools exist to measure ZFS fragmentation?

Can you define "fragmentation" in this context? It is an overloaded term.

Well unsure how specific I need to be however articles I've read mention something about ZFS needing block pointer rewrite...

BP rewrite will likely have no affect on ARC sizing. It is presumed to be the answer for on-disk
fragmentation.

Post by aurfalien
Googling this too will yield a few articles, one were some one claimed 30% utilization was all it took for degradation of performance to occur.

You can get 30% performance degradation just by using the inner cylinders of an HDD with
zero fragmentation :-P. There really is no substitute for testing your workload.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Phil Harman

2013-11-18 09:48:42 UTC

Don't discount the effect of memory size on the ARC's MFU:MRU split.
Your mileage will vary, independent of OS (which is just another factor).

A larger ARC gives more time for buffers to be promoted from an
initially uncontrained MRU into the MFU. And so the MFU grows. And as it
does so, buffers have more time to be promoted back to the front of the
MFU. And so the MFU grows. And as it does so, the MRU becomes constrained.

I'm currently playing with a non-NAS mixed online/batch workload. One of
the overnight batch tasks has multiple phases, spanning several hours,
and works best with a large MRU.

A recent hardware upgrade took the ARC from 70GB to 256GB, and with this
particular workload (i.e. YMWV), the MFU:MRU tipped from a 65GB:15GB
split to 11GB:245GB split (i.e. the MRU was reduced by 83%), with
catastrophic consequences for the one MRU-dependent task.

My problem is that the majority of this workload favours MFU, so the one
MRU-dependent task gets out-voted. Many of the MFU hogs do so because
they are poorly written (e.g. they lazily re-read data that they could
have easily cached for themselves). However, it would take many engineer
years effort to fix them all.

In an ideal world, adding memory would only ever make things better. So
for the real world, I'd quite like to be able to set arc_mru_min for
such cases (I've tried setting arc_p, but to no avail on my S11.1 system).

Phil Harman

2013-11-18 10:06:46 UTC

Oops... too many late nights ... for the past couple of weeks I've been
writing "MRU:MFU", which is what I meant, not "MFU:MRU" :)

Don't discount the effect of memory size on the ARC's MRU:MFU
[CORRECTED] split. Your mileage will vary, independent of OS (which is
just another factor).
A larger ARC gives more time for buffers to be promoted from an
initially uncontrained MRU into the MFU. And so the MFU grows. And as
it does so, buffers have more time to be promoted back to the front of
the MFU. And so the MFU grows. And as it does so, the MRU becomes
constrained.
I'm currently playing with a non-NAS mixed online/batch workload. One
of the overnight batch tasks has multiple phases, spanning several
hours, and works best with a large MRU.
A recent hardware upgrade took the ARC from 70GB to 256GB, and with
this particular workload (i.e. YMWV), the MRU:MFU [CORRECTED] tipped
from a 65GB:15GB split to 11GB:245GB split (i.e. the MRU was reduced
by 83%), with catastrophic consequences for the one MRU-dependent task.
My problem is that the majority of this workload favours MFU, so the
one MRU-dependent task gets out-voted. Many of the MFU hogs do so
because they are poorly written (e.g. they lazily re-read data that
they could have easily cached for themselves). However, it would take
many engineer years effort to fix them all.
In an ideal world, adding memory would only ever make things better.
So for the real world, I'd quite like to be able to set arc_mru_min
for such cases (I've tried setting arc_p, but to no avail on my S11.1
system).

Evaldas Auryla

2013-11-18 09:20:24 UTC

This is poor advice. Where did you read it? The authors need to be enlightened.

Was hope you'd chime in. Well, I don't want to cause some kind of
flame war however a simple google will result in 2 particular
articles, one that sites 128G as a number to stick with.
I'm ramping up from 128 to 256 this week sticking with the notion that
my env will determine when I see benefits, not so much if.

Post by aurfalien
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing
definitive on my readings other then that.

Hi all, FWIW, just to report field experience, we're running OpenIndiana
151a7 with 256GB ram (Dell PER620 head). ZFS data pool has 19 3xway
mirrors (DataON 1660 JBOD), serving vmware NFS datasets for 120 VMs.

***@fensalir:~$ uname -a && uptime && zpool list
SunOS fensalir 5.11 oi_151a7 i86pc i386 i86pc Solaris
10:03am up 296 days 18:20, 1 user, load average: 0.97, 0.91, 0.77
NAME SIZE ALLOC FREE EXPANDSZ CAP DEDUP HEALTH ALTROOT
cuve 34.4T 8.74T 25.7T - 25% 1.00x ONLINE -
rpool 68G 19.0G 49.0G - 27% 1.00x ONLINE -

Best regards,
Evaldas

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Felix Nielsen

2013-11-18 11:22:49 UTC

Hi All,

My field experience is a tough one, my system is based on NexentaStor
3.1.x, Supermicro, 2xE5-2620, 256GB RAM, 2xJBODs, two pools : 10xNL-SAS
1way mirror, 12x10K-SAS 1way mirror, two ZeusRAM ZIL/SLOGs, and 4xTALOS
L2ARC

4xESXi hosts, 10ge NFS, roughly 40xVMs, no real hardcore IO systems.

I was migrating from another storage system and needed to storage migrate
all VMs to the NexentaStor - and reuse the disks as the second pool

When migration had been pumping for a good while, I got latency alarms and
some times an unresponsive system for many seconds. While running dtrace
performnce gathering scripts in the migration phase, I even saw healthy
disks "failing" and began to re-silver :(

After lots and lots of analyzing from Nexenta, there was no real solution -
besides getting lots of striped ZeusRAMs and maybe more vdevs :(

So now I have removed half the memory and "hope" that things are better
now. I don't have the courage to test the performance again - I am careful
when migrating things

Evaldas are u not using a ZIL/SLOG device and do you have issues when
storage migrating VMs?

Thanks
Felix

Post by aurfalien
Hi,
I thought to combine 2 subjects as at first glance I sort of find them odd.
RAM and ARC;
I've read a bit on how 128GB on avg seems to be the sweet spot for ARC and
not to go over that.
This is poor advice. Where did you read it? The authors need to be enlightened.
Was hope you'd chime in. Well, I don't want to cause some kind of flame
war however a simple google will result in 2 particular articles, one that
sites 128G as a number to stick with.
I'm ramping up from 128 to 256 this week sticking with the notion that my
env will determine when I see benefits, not so much if.
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive on
my readings other then that.
Hi all, FWIW, just to report field experience, we're running OpenIndiana
151a7 with 256GB ram (Dell PER620 head). ZFS data pool has 19 3xway mirrors
(DataON 1660 JBOD), serving vmware NFS datasets for 120 VMs.
SunOS fensalir 5.11 oi_151a7 i86pc i386 i86pc Solaris
10:03am up 296 days 18:20, 1 user, load average: 0.97, 0.91, 0.77
NAME SIZE ALLOC FREE EXPANDSZ CAP DEDUP HEALTH ALTROOT
cuve 34.4T 8.74T 25.7T - 25% 1.00x ONLINE -
rpool 68G 19.0G 49.0G - 27% 1.00x ONLINE -
Best regards,
Evaldas
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24984252-c6fe5e82> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Evaldas Auryla

2013-11-18 14:24:34 UTC

Post by Felix Nielsen
Hi All,
My field experience is a tough one, my system is based on NexentaStor
10xNL-SAS 1way mirror, 12x10K-SAS 1way mirror, two ZeusRAM ZIL/SLOGs,
and 4xTALOS L2ARC
4xESXi hosts, 10ge NFS, roughly 40xVMs, no real hardcore IO systems.
I was migrating from another storage system and needed to storage
migrate all VMs to the NexentaStor - and reuse the disks as the second
pool
When migration had been pumping for a good while, I got latency alarms
and some times an unresponsive system for many seconds. While running
dtrace performnce gathering scripts in the migration phase, I even saw
healthy disks "failing" and began to re-silver :(
After lots and lots of analyzing from Nexenta, there was no real
solution - besides getting lots of striped ZeusRAMs and maybe more
vdevs :(
So now I have removed half the memory and "hope" that things are
better now. I don't have the courage to test the performance again - I
am careful when migrating things
Evaldas are u not using a ZIL/SLOG device and do you have issues when
storage migrating VMs?
Thanks
Felix

Felix, yes we use ZIL, it's ZeusRAM, no issues with VMs storage
migration, datastores are on NFS 10 GbE network (Arista switches), 12
ESXi 5.1 hosts.

The only instability we had was few years ago when we enabled
snapdir=visible on NFS exported ZFS datasets, lesson learned since, we
leave the default "hidden" now.

Best regards,
Evaldas

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Robert Milkowski

2013-11-19 14:25:58 UTC

-----Original Message-----
Sent: 17 November 2013 18:43
Subject: [zfs] Too much RAM for ARC and fragmentation
Hi,
I thought to combine 2 subjects as at first glance I sort of find them odd.
RAM and ARC;
I've read a bit on how 128GB on avg seems to be the sweet spot for ARC
and not to go over that.
I've also read the more RAM for ARC the better.
If one does go over 128GB, slow downs could occur. Nothing definitive
on my readings other then that.
Is there any merit to this?

I've been deploying servers with ZFS with 256GB (and more) for quite some
time now, with no issues you are referring to. Obviously YMMV.

--
Robert Milkowski
http://milek.blogspot.com

Ilya Usvyatsky

2013-11-19 16:54:35 UTC

One of the issues that has been addressed in the latest 3.x Nexenta
releases (3.1.4.2 and newer) was a hang we observed in ARC with large (>
128GB) of memory. This hang has been caused by a day-one bug in ZFS that we
have identified and fixed.
Customers that are running newer Nexenta versions therefore should be able
to utilize larger memory configurations.

Post by Robert Milkowski

I've been deploying servers with ZFS with 256GB (and more) for quite some
time now, with no issues you are referring to. Obviously YMMV.
--
Robert Milkowski
http://milek.blogspot.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24013066-c76c6d4d
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Steven Hartland

2013-11-19 17:19:23 UTC

Be interested to know more details on this?

Is this something thats been commited upstream to illumos?

Regards
Steve
----- Original Message -----
From: "Ilya Usvyatsky" <***@nexenta.com>
To: <***@lists.illumos.org>
Sent: Tuesday, November 19, 2013 4:54 PM
Subject: Re: [zfs] Too much RAM for ARC and fragmentation

Post by Ilya Usvyatsky
One of the issues that has been addressed in the latest 3.x Nexenta
releases (3.1.4.2 and newer) was a hang we observed in ARC with large (>
128GB) of memory. This hang has been caused by a day-one bug in ZFS that we
have identified and fixed.
Customers that are running newer Nexenta versions therefore should be able
to utilize larger memory configurations.

Post by Robert Milkowski

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24401717-fdfe502b
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Ilya Usvyatsky

2013-11-19 17:40:33 UTC