Odd high CPU/load

Discussion:

Odd high CPU/load

Ray Van Dolson

2013-10-18 05:14:33 UTC

Seeing periodic sustained increases in both CPU load as well as
utilization.

We're running Nexenta 3.1.3.5 serving out files off a 300TB pool
containing 20 or so file systems (each of which have a number of
snapshots and clones).

Typically the CPU utilization on the system is around 2-5% and load is
around 2. We're generally not pushing more than a gigabit of CIFS
traffic.

However, we will occasionally see periods where CPU utilization jumps
up to around 40-50% (system time not user time) and system load spikes
up to 15. During this time the system still seems responsive from the
console, but users report delays in CIFS reads (oddly, not to
all content it seems).

In the past, we were able to "resolve" this by initiating a cluster
failover (RSF-1 w/ Nexenta). Our theory is that this solved the issue
by cycling the smbd daemon which we speculate might be buggy...

The issue has occurred again (or at least it looks very similar), and
this time grapped a FlameGraph:

https://esri.box.com/shared/static/5inwy8bxxpv1c2sui3zz.svg

Seems that much of the time is being spent in "wait/idle" ?? Very
strange.

Yes -- we are working with Nexenta on this (I think their sense is that
this is a smbd issue and are recommending we upgrade which of course we
plan to do). With that said, curious if anyone knows what might be
going on and has suggestions on short-term workarounds as we plan
through an upgrade...

Thanks,
Ray

Matthew Ahrens

2013-10-18 05:26:06 UTC

Permalink

Post by Ray Van Dolson
Seeing periodic sustained increases in both CPU load as well as
utilization.
We're running Nexenta 3.1.3.5 serving out files off a 300TB pool
containing 20 or so file systems (each of which have a number of
snapshots and clones).
Typically the CPU utilization on the system is around 2-5% and load is
around 2. We're generally not pushing more than a gigabit of CIFS
traffic.
However, we will occasionally see periods where CPU utilization jumps
up to around 40-50% (system time not user time) and system load spikes
up to 15. During this time the system still seems responsive from the
console, but users report delays in CIFS reads (oddly, not to
all content it seems).

How long is the spike in CPU usage? (how many seconds, approximately)

Do you see an increase in network or disk traffic at the same time?

This is consistent with your observation that the system is 50-60% idle
(40-50% system time).

Remaining uses of CPU are:

1. compressing data and generating parity for writes. Is the machine
servicing more writes during this time of heavy CPU usage?

2. a zfs send, which looks like it is writing its output to a local file.
Could this be generating an increased write load, thus causing use #1?

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Ray Van Dolson

2013-10-18 06:59:07 UTC

Permalink

It can actually be fairly sustained. Our last event lasted for several
days until we initiated the failover.

Nothing really noticeable change in the network traffic patterns --
at least that we can see from network graphs. Total throughput stays
the same as when things are "working fine". Would need to dig in
further and see if packets per second or something else changes...

Post by Ray Van Dolson
In the past, we were able to "resolve" this by initiating a cluster
failover (RSF-1 w/ Nexenta). Our theory is that this solved the issue
by cycling the smbd daemon which we speculate might be buggy...
The issue has occurred again (or at least it looks very similar), and
https://esri.box.com/shared/static/5inwy8bxxpv1c2sui3zz.svg
Seems that much of the time is being spent in "wait/idle" ?? Very
strange.
This is consistent with your observation that the system is 50-60% idle (40-50%
system time).
1. compressing data and generating parity for writes. Is the machine servicing
more writes during this time of heavy CPU usage?
2. a zfs send, which looks like it is writing its output to a local file.
Could this be generating an increased write load, thus causing use #1?

D'oh. I'd missed the ZFS send going on. This is likely a planned
event and very well could be the culprit here. We're currently
generating a snapshot delta and writing it off to another device for
sneakernet transport to a failover site.

There does seem to be a lot of write activity going on currently --
maybe related to that zfs send (though that seems a bit strange to me
since it's piping its output to an NFS mount).

I'm still somewhat surprised that the load is so high, but this all
could very well tie back to the zfs send. Sorry for the noise if so...

Ray