Ray Van Dolson
2013-10-18 05:14:33 UTC
Seeing periodic sustained increases in both CPU load as well as
utilization.
We're running Nexenta 3.1.3.5 serving out files off a 300TB pool
containing 20 or so file systems (each of which have a number of
snapshots and clones).
Typically the CPU utilization on the system is around 2-5% and load is
around 2. We're generally not pushing more than a gigabit of CIFS
traffic.
However, we will occasionally see periods where CPU utilization jumps
up to around 40-50% (system time not user time) and system load spikes
up to 15. During this time the system still seems responsive from the
console, but users report delays in CIFS reads (oddly, not to
all content it seems).
In the past, we were able to "resolve" this by initiating a cluster
failover (RSF-1 w/ Nexenta). Our theory is that this solved the issue
by cycling the smbd daemon which we speculate might be buggy...
The issue has occurred again (or at least it looks very similar), and
this time grapped a FlameGraph:
https://esri.box.com/shared/static/5inwy8bxxpv1c2sui3zz.svg
Seems that much of the time is being spent in "wait/idle" ?? Very
strange.
Yes -- we are working with Nexenta on this (I think their sense is that
this is a smbd issue and are recommending we upgrade which of course we
plan to do). With that said, curious if anyone knows what might be
going on and has suggestions on short-term workarounds as we plan
through an upgrade...
Thanks,
Ray
utilization.
We're running Nexenta 3.1.3.5 serving out files off a 300TB pool
containing 20 or so file systems (each of which have a number of
snapshots and clones).
Typically the CPU utilization on the system is around 2-5% and load is
around 2. We're generally not pushing more than a gigabit of CIFS
traffic.
However, we will occasionally see periods where CPU utilization jumps
up to around 40-50% (system time not user time) and system load spikes
up to 15. During this time the system still seems responsive from the
console, but users report delays in CIFS reads (oddly, not to
all content it seems).
In the past, we were able to "resolve" this by initiating a cluster
failover (RSF-1 w/ Nexenta). Our theory is that this solved the issue
by cycling the smbd daemon which we speculate might be buggy...
The issue has occurred again (or at least it looks very similar), and
this time grapped a FlameGraph:
https://esri.box.com/shared/static/5inwy8bxxpv1c2sui3zz.svg
Seems that much of the time is being spent in "wait/idle" ?? Very
strange.
Yes -- we are working with Nexenta on this (I think their sense is that
this is a smbd issue and are recommending we upgrade which of course we
plan to do). With that said, curious if anyone knows what might be
going on and has suggestions on short-term workarounds as we plan
through an upgrade...
Thanks,
Ray