Discussion:
OpenZFS write throttle tuning
Adam Leventhal via illumos-zfs
2014-09-03 05:18:13 UTC
Permalink
Hey folks,

I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.

http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/

Adam
--
Adam Leventhal
CTO, Delphix
http://blog.delphix.com/ahl
Steven Hartland via illumos-zfs
2014-09-03 08:30:43 UTC
Permalink
----- Original Message -----
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Nice Adam thanks for this, very informative :)

For those interested FreeBSD has one extension to this in that
it also min_active and max_active for TRIM specifically:
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 64

The size of trim_max_active may seem high compared to the
others but this is due to the fact that up to 64 individual
TRIM requests can be combined at the CAM layer into a single
device request, hence the threshold at which it starts to
impact latency is higher than that for reads and writes.

Regards
Steve
Brian Menges via illumos-zfs
2014-09-03 15:51:41 UTC
Permalink
Are these dtrace tools available to FreeBSD? I'm trying to solve/improve some throughput performance on zvol across iscsi. I have observed that the log device isn't in use at all across iscsi... very sad indeed.

Anyone have some other tuning and settings for a high-ram freebsd system? We're running both fbsd9 and fbsd10. I'm trying to tune for random write performance.

Thanks for this article. It's a good read.

- Brian Menges

-----Original Message-----
From: Steven Hartland via illumos-zfs [mailto:***@lists.illumos.org]
Sent: Wednesday, September 03, 2014 1:31 AM
To: Adam Leventhal; developer; ***@lists.illumos.org
Subject: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning

----- Original Message -----
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Nice Adam thanks for this, very informative :)

For those interested FreeBSD has one extension to this in that
it also min_active and max_active for TRIM specifically:
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 64

The size of trim_max_active may seem high compared to the
others but this is due to the fact that up to 64 individual
TRIM requests can be combined at the CAM layer into a single
device request, hence the threshold at which it starts to
impact latency is higher than that for reads and writes.

Regards
Steve


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24656803-110351fd
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

________________________________

The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error, please contact the sender and delete the material from any computer.
Adam Leventhal via illumos-zfs
2014-09-03 16:51:54 UTC
Permalink
Hey Brian

The log device is not used and you're seeing high latency from iSCSI? I've actually seen that log devices can be a choke point for throughput oriented workloads if for example your disks in aggregate can stream data faster than your small number of log devices. This is why we created the logbias property.

Adam

--
Adam Leventhal
CTO, Delphix
Sent from my mobile
Post by Brian Menges via illumos-zfs
Are these dtrace tools available to FreeBSD? I'm trying to solve/improve some throughput performance on zvol across iscsi. I have observed that the log device isn't in use at all across iscsi... very sad indeed.
Anyone have some other tuning and settings for a high-ram freebsd system? We're running both fbsd9 and fbsd10. I'm trying to tune for random write performance.
Thanks for this article. It's a good read.
- Brian Menges
-----Original Message-----
Sent: Wednesday, September 03, 2014 1:31 AM
Subject: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning
----- Original Message -----
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Nice Adam thanks for this, very informative :)
For those interested FreeBSD has one extension to this in that
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 64
The size of trim_max_active may seem high compared to the
others but this is due to the fact that up to 64 individual
TRIM requests can be combined at the CAM layer into a single
device request, hence the threshold at which it starts to
impact latency is higher than that for reads and writes.
Regards
Steve
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24656803-110351fd
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
________________________________
The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error, please contact the sender and delete the material from any computer.
Steven Hartland via illumos-zfs
2014-09-03 17:00:00 UTC
Permalink
dtrace is available on FreeBSD yes.

If your on FreeBSD 10 make sure you're running stable/10 and not
10.0-RELEASE as there are many improvements there.

Regards
Steve

----- Original Message -----
From: "Brian Menges" <***@gogrid.com>
To: <***@lists.illumos.org>; "Steven Hartland" <***@multiplay.co.uk>; "Adam Leventhal" <***@delphix.com>; "developer"
<***@open-zfs.org>
Sent: Wednesday, September 03, 2014 4:51 PM
Subject: RE: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning


Are these dtrace tools available to FreeBSD? I'm trying to solve/improve some throughput performance on zvol across iscsi. I have
observed that the log device isn't in use at all across iscsi... very sad indeed.

Anyone have some other tuning and settings for a high-ram freebsd system? We're running both fbsd9 and fbsd10. I'm trying to tune
for random write performance.

Thanks for this article. It's a good read.

- Brian Menges

-----Original Message-----
From: Steven Hartland via illumos-zfs [mailto:***@lists.illumos.org]
Sent: Wednesday, September 03, 2014 1:31 AM
To: Adam Leventhal; developer; ***@lists.illumos.org
Subject: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning

----- Original Message -----
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Nice Adam thanks for this, very informative :)

For those interested FreeBSD has one extension to this in that
it also min_active and max_active for TRIM specifically:
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 64

The size of trim_max_active may seem high compared to the
others but this is due to the fact that up to 64 individual
TRIM requests can be combined at the CAM layer into a single
device request, hence the threshold at which it starts to
impact latency is higher than that for reads and writes.

Regards
Steve


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24656803-110351fd
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

________________________________

The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is
solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in
reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error,
please contact the sender and delete the material from any computer.
Brian Menges via illumos-zfs
2014-09-03 18:18:41 UTC
Permalink
We are running RELEASE patch7, which is a production system. That said, STABLE becomes a non-option with management. I should check into 10.1-RELEASE and whether or not that will address what you say is available in STABLE.

What changes/differences between STABLE and 10.0-RELEASE-p7 with regards to write performance are we talking about? Are these also relative to write throttling improvements/tuning?

- Brian Menges

-----Original Message-----
From: Steven Hartland [mailto:***@multiplay.co.uk]
Sent: Wednesday, September 03, 2014 10:01 AM
To: Brian Menges; ***@lists.illumos.org; Adam Leventhal; developer
Subject: Re: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning

dtrace is available on FreeBSD yes.

If your on FreeBSD 10 make sure you're running stable/10 and not 10.0-RELEASE as there are many improvements there.

Regards
Steve

----- Original Message -----
From: "Brian Menges" <***@gogrid.com>
To: <***@lists.illumos.org>; "Steven Hartland" <***@multiplay.co.uk>; "Adam Leventhal" <***@delphix.com>; "developer"
<***@open-zfs.org>
Sent: Wednesday, September 03, 2014 4:51 PM
Subject: RE: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning


Are these dtrace tools available to FreeBSD? I'm trying to solve/improve some throughput performance on zvol across iscsi. I have
observed that the log device isn't in use at all across iscsi... very sad indeed.

Anyone have some other tuning and settings for a high-ram freebsd system? We're running both fbsd9 and fbsd10. I'm trying to tune
for random write performance.

Thanks for this article. It's a good read.

- Brian Menges

-----Original Message-----
From: Steven Hartland via illumos-zfs [mailto:***@lists.illumos.org]
Sent: Wednesday, September 03, 2014 1:31 AM
To: Adam Leventhal; developer; ***@lists.illumos.org
Subject: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning

----- Original Message -----
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Nice Adam thanks for this, very informative :)

For those interested FreeBSD has one extension to this in that
it also min_active and max_active for TRIM specifically:
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 64

The size of trim_max_active may seem high compared to the
others but this is due to the fact that up to 64 individual
TRIM requests can be combined at the CAM layer into a single
device request, hence the threshold at which it starts to
impact latency is higher than that for reads and writes.

Regards
Steve


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24656803-110351fd
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

________________________________

The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is
solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in
reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error,
please contact the sender and delete the material from any computer.


________________________________

The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error, please contact the sender and delete the material from any computer.
Steven Hartland via illumos-zfs
2014-09-03 22:57:28 UTC
Permalink
stable/10 will be frozen soon and 10.1-RELEASE created from it so
now is a good time to check that in your workload all is good.

I have one outstanding ZFS change which I'd like to get into 10.1
but there's some debate about the final revision of that at this
time.

Regards
Steve

----- Original Message -----
From: "Brian Menges" <***@gogrid.com>
To: "Steven Hartland" <***@multiplay.co.uk>; <***@lists.illumos.org>; "Adam Leventhal" <***@delphix.com>; "developer"
<***@open-zfs.org>
Sent: Wednesday, September 03, 2014 7:18 PM
Subject: RE: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning


We are running RELEASE patch7, which is a production system. That said, STABLE becomes a non-option with management. I should check
into 10.1-RELEASE and whether or not that will address what you say is available in STABLE.

What changes/differences between STABLE and 10.0-RELEASE-p7 with regards to write performance are we talking about? Are these also
relative to write throttling improvements/tuning?

- Brian Menges

-----Original Message-----
From: Steven Hartland [mailto:***@multiplay.co.uk]
Sent: Wednesday, September 03, 2014 10:01 AM
To: Brian Menges; ***@lists.illumos.org; Adam Leventhal; developer
Subject: Re: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning

dtrace is available on FreeBSD yes.

If your on FreeBSD 10 make sure you're running stable/10 and not 10.0-RELEASE as there are many improvements there.

Regards
Steve

----- Original Message -----
From: "Brian Menges" <***@gogrid.com>
To: <***@lists.illumos.org>; "Steven Hartland" <***@multiplay.co.uk>; "Adam Leventhal" <***@delphix.com>; "developer"
<***@open-zfs.org>
Sent: Wednesday, September 03, 2014 4:51 PM
Subject: RE: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning


Are these dtrace tools available to FreeBSD? I'm trying to solve/improve some throughput performance on zvol across iscsi. I have
observed that the log device isn't in use at all across iscsi... very sad indeed.

Anyone have some other tuning and settings for a high-ram freebsd system? We're running both fbsd9 and fbsd10. I'm trying to tune
for random write performance.

Thanks for this article. It's a good read.

- Brian Menges

-----Original Message-----
From: Steven Hartland via illumos-zfs [mailto:***@lists.illumos.org]
Sent: Wednesday, September 03, 2014 1:31 AM
To: Adam Leventhal; developer; ***@lists.illumos.org
Subject: [zfs] Re: [OpenZFS Developer] OpenZFS write throttle tuning

----- Original Message -----
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Nice Adam thanks for this, very informative :)

For those interested FreeBSD has one extension to this in that
it also min_active and max_active for TRIM specifically:
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 64

The size of trim_max_active may seem high compared to the
others but this is due to the fact that up to 64 individual
TRIM requests can be combined at the CAM layer into a single
device request, hence the threshold at which it starts to
impact latency is higher than that for reads and writes.

Regards
Steve


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24656803-110351fd
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

________________________________

The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is
solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in
reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error,
please contact the sender and delete the material from any computer.


________________________________

The information contained in this message, and any attachments, may contain confidential and legally privileged material. It is
solely for the use of the person or entity to which it is addressed. Any review, retransmission, dissemination, or action taken in
reliance upon this information by persons or entities other than the intended recipient is prohibited. If you receive this in error,
please contact the sender and delete the material from any computer.
jason matthews via illumos-zfs
2014-09-04 06:25:10 UTC
Permalink
I have a misbehaving system that has 6-32 second spa_sync() times while all of its similarly loaded sister systems run with spa_sync() 5-180ms. all running on oi151a9. At the application layer, the database isn’t keeping up by a long shot. Looking at iostat, it looks like storage brown out with reporting nearly all goose eggs for several seconds and then large bursts of throughput only to go back to goose eggs. Conversely, the healthy systems write 80-400MB every second.

how can i track down the root cause? there is no indication from iostat that we are dealing with a bad disk.
i looked at fsflush.d and noticed that on a healthy system there were periods of time where there were no releases but on the sick system there are always 1.5k-5k releases. Not sure what this could mean however :)

thoughts?

j.
jason matthews via illumos-zfs
2014-09-04 15:27:01 UTC
Permalink
Make sure your pools are no more than 80% full.
thanks, but the pool is about 50% of capacity - i should have mentioned that.

this morning, the performance problem has magically gone away. i had a similar problem with this installation on Monday. At that time, all the disks were transferred to a new chassis as a hail mary approach at fixing the problem. Initally, the transfer made no impact. Some time later the system straightened out and started flying level. I have no idea what is going on with it.


j.

PS - speaking of the “80% rule” — i could understand why this might be a problem for spinning rust where you have to wait for the disk to spin around to find some empty blocks but why do we see the same behavior on SSDs?
Simon Casady via illumos-zfs
2014-09-04 16:10:13 UTC
Permalink
I feel in the need of abuse today so I will answer this even though it is
just a guess and to be taken for what you paid for it.
For not full pools only one spacemap is needed and is in memory. When the
pool is close to full a single spacemap has limited free space so to meet
the needs of the system spacemaps are paged in and out of memory, therefore
slow.


On Thu, Sep 4, 2014 at 10:27 AM, jason matthews via illumos-zfs <
Post by jason matthews via illumos-zfs
Make sure your pools are no more than 80% full.
thanks, but the pool is about 50% of capacity - i should have mentioned that.
this morning, the performance problem has magically gone away. i had a
similar problem with this installation on Monday. At that time, all the
disks were transferred to a new chassis as a hail mary approach at fixing
the problem. Initally, the transfer made no impact. Some time later the
system straightened out and started flying level. I have no idea what is
going on with it.
j.
PS - speaking of the “80% rule” — i could understand why this might be a
problem for spinning rust where you have to wait for the disk to spin
around to find some empty blocks but why do we see the same behavior on
SSDs?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24018577-4d8b86e0
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
jason matthews via illumos-zfs
2014-09-04 17:11:47 UTC
Permalink
I feel in the need of abuse today so I will answer this even though it is just a guess and to be taken for what you paid for it.
Don’t sweat it. I appreciate the attempt.
For not full pools only one spacemap is needed and is in memory.
I believe, there is actually one spaceman per metaslab and each vdev is sliced into several hundred metaslabs. if memory serves, all space maps are held in an in memory avl tree so i don’t think space maps are paged in and out of memory. in any case, there are hundreds of gigabytes in arc and free lists to hold it.

thanks!
Jason Matthews via illumos-zfs
2014-09-05 00:41:27 UTC
Permalink
I am not sure CPU utilization was particularly high.

How can we apply the scientific method to this hypothesis?

This sounds like something to be measured to gauge over all health of the pool.

How do you gather fragmentation data at the meta slab level? Are there existing tools that give some insight?

J.

Sent from my iPhone
The issue is not really in the fact that the space map is in memory or not - it’s in the algorithm used for finding the best allocation for a write.
Suppose we need to write 100MB to a completely empty metaslab - then it’s simple, we write 100MB sequentially to it. One pass needed. Now, suppose the metaslab is fragmented into 60,000 free space segments, the largest of which is 128K (I’ve seen similar amounts of fragmentation on production pools) - in that case, the algorithm will have to traverse the not-so-small space map at least 800 times in order to allocate the 100MB. Multiply that by * of vdevs, rinse, repeat with each txg commit. It’s a CPU-intensive process.
Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.
Post by jason matthews via illumos-zfs
I feel in the need of abuse today so I will answer this even though it is just a guess and to be taken for what you paid for it.
Don’t sweat it. I appreciate the attempt.
For not full pools only one spacemap is needed and is in memory.
I believe, there is actually one spaceman per metaslab and each vdev is sliced into several hundred metaslabs. if memory serves, all space maps are held in an in memory avl tree so i don’t think space maps are paged in and out of memory. in any case, there are hundreds of gigabytes in arc and free lists to hold it.
thanks!
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22099383-fefe14de
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Kirill Davydychev via illumos-zfs
2014-09-05 22:53:38 UTC
Permalink
Forgot to mention - it is also important to look at “maxsize” reported by zdb -mm - that is the maximum contiguous free space in the metaslab. If it approaches your dataset recordsize/volblocksize, you’re probably in trouble.

Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.
Post by Jason Matthews via illumos-zfs
I am not sure CPU utilization was particularly high.
It doesn’t have to be high, just high enough. I usually gauge pool fragmentation by looking at a “hotkernel” (part of the DTrace toolkit) trace over ~60 seconds. Top ZFS functions sampled on affected systems are: metaslab_segsize_compare(), space_map_seg_compare(), space_map_remove(), and a much lower percentage of space_map_load() and space_map_add(). You should also see a high percentage of AVL-specific functions in genunix: avl_find(), avl_destroy_nodes(), avl_insert(), avl_walk() and avl_rotation(). If multiple of those are in the top10 of sampled functions, the system is almost certainly suffering from fragmentation. If they’re in Top20, you’re dangerously close.
I suspect that there’s a threading bottleneck there somewhere, making the CPU utilization less noticeable on multicore systems, but haven’t done much further research into this. George Wilson may perhaps chime in here.
Post by Jason Matthews via illumos-zfs
How can we apply the scientific method to this hypothesis?
Not too scientific, but see explanation above.
Post by Jason Matthews via illumos-zfs
This sounds like something to be measured to gauge over all health of the pool.
How do you gather fragmentation data at the meta slab level? Are there existing tools that give some insight?
zdb -mm <poolname> should give # of segments for every spacemap. Note that it doesn’t always work on busy pools, and is unreliable. Sometimes you have to run it with -e (zdb -mm -e <poolname>) even on pools that are imported.
Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.
Post by Jason Matthews via illumos-zfs
I am not sure CPU utilization was particularly high.
How can we apply the scientific method to this hypothesis?
This sounds like something to be measured to gauge over all health of the pool.
How do you gather fragmentation data at the meta slab level? Are there existing tools that give some insight?
J.
Sent from my iPhone
The issue is not really in the fact that the space map is in memory or not - it’s in the algorithm used for finding the best allocation for a write.
Suppose we need to write 100MB to a completely empty metaslab - then it’s simple, we write 100MB sequentially to it. One pass needed. Now, suppose the metaslab is fragmented into 60,000 free space segments, the largest of which is 128K (I’ve seen similar amounts of fragmentation on production pools) - in that case, the algorithm will have to traverse the not-so-small space map at least 800 times in order to allocate the 100MB. Multiply that by * of vdevs, rinse, repeat with each txg commit. It’s a CPU-intensive process.
Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.
Post by jason matthews via illumos-zfs
I feel in the need of abuse today so I will answer this even though it is just a guess and to be taken for what you paid for it.
Don’t sweat it. I appreciate the attempt.
For not full pools only one spacemap is needed and is in memory.
I believe, there is actually one spaceman per metaslab and each vdev is sliced into several hundred metaslabs. if memory serves, all space maps are held in an in memory avl tree so i don’t think space maps are paged in and out of memory. in any case, there are hundreds of gigabytes in arc and free lists to hold it.
thanks!
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22099383-fefe14de
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Kirill Davydychev via illumos-zfs
2014-09-05 22:35:29 UTC
Permalink
Post by Jason Matthews via illumos-zfs
I am not sure CPU utilization was particularly high.
It doesn’t have to be high, just high enough. I usually gauge pool fragmentation by looking at a “hotkernel” (part of the DTrace toolkit) trace over ~60 seconds. Top ZFS functions sampled on affected systems are: metaslab_segsize_compare(), space_map_seg_compare(), space_map_remove(), and a much lower percentage of space_map_load() and space_map_add(). You should also see a high percentage of AVL-specific functions in genunix: avl_find(), avl_destroy_nodes(), avl_insert(), avl_walk() and avl_rotation(). If multiple of those are in the top10 of sampled functions, the system is almost certainly suffering from fragmentation. If they’re in Top20, you’re dangerously close.

I suspect that there’s a threading bottleneck there somewhere, making the CPU utilization less noticeable on multicore systems, but haven’t done much further research into this. George Wilson may perhaps chime in here.
Post by Jason Matthews via illumos-zfs
How can we apply the scientific method to this hypothesis?
Not too scientific, but see explanation above.
Post by Jason Matthews via illumos-zfs
This sounds like something to be measured to gauge over all health of the pool.
How do you gather fragmentation data at the meta slab level? Are there existing tools that give some insight?
zdb -mm <poolname> should give # of segments for every spacemap. Note that it doesn’t always work on busy pools, and is unreliable. Sometimes you have to run it with -e (zdb -mm -e <poolname>) even on pools that are imported.

Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.
Post by Jason Matthews via illumos-zfs
I am not sure CPU utilization was particularly high.
How can we apply the scientific method to this hypothesis?
This sounds like something to be measured to gauge over all health of the pool.
How do you gather fragmentation data at the meta slab level? Are there existing tools that give some insight?
J.
Sent from my iPhone
The issue is not really in the fact that the space map is in memory or not - it’s in the algorithm used for finding the best allocation for a write.
Suppose we need to write 100MB to a completely empty metaslab - then it’s simple, we write 100MB sequentially to it. One pass needed. Now, suppose the metaslab is fragmented into 60,000 free space segments, the largest of which is 128K (I’ve seen similar amounts of fragmentation on production pools) - in that case, the algorithm will have to traverse the not-so-small space map at least 800 times in order to allocate the 100MB. Multiply that by * of vdevs, rinse, repeat with each txg commit. It’s a CPU-intensive process.
Best regards,
Kirill Davydychev
Enterprise Architect
Nexenta Systems, Inc.
Post by jason matthews via illumos-zfs
I feel in the need of abuse today so I will answer this even though it is just a guess and to be taken for what you paid for it.
Don’t sweat it. I appreciate the attempt.
For not full pools only one spacemap is needed and is in memory.
I believe, there is actually one spaceman per metaslab and each vdev is sliced into several hundred metaslabs. if memory serves, all space maps are held in an in memory avl tree so i don’t think space maps are paged in and out of memory. in any case, there are hundreds of gigabytes in arc and free lists to hold it.
thanks!
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22099383-fefe14de
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Richard Elling via illumos-zfs
2014-09-04 17:05:01 UTC
Permalink
Post by jason matthews via illumos-zfs
I have a misbehaving system that has 6-32 second spa_sync() times while all of its similarly loaded sister systems run with spa_sync() 5-180ms. all running on oi151a9. At the application layer, the database isn’t keeping up by a long shot. Looking at iostat, it looks like storage brown out with reporting nearly all goose eggs for several seconds and then large bursts of throughput only to go back to goose eggs. Conversely, the healthy systems write 80-400MB every second.
Sounds like bad cables or other unhappiness in the data path. Could also be a wounded disk.
Any errors logged against disks?
-- richard
Post by jason matthews via illumos-zfs
how can i track down the root cause? there is no indication from iostat that we are dealing with a bad disk.
i looked at fsflush.d and noticed that on a healthy system there were periods of time where there were no releases but on the sick system there are always 1.5k-5k releases. Not sure what this could mean however :)
thoughts?
j.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
jason matthews via illumos-zfs
2014-09-04 17:11:55 UTC
Permalink
Post by Richard Elling via illumos-zfs
Sounds like bad cables or other unhappiness in the data path. Could also be a wounded disk.
Any errors logged against disks?
Thanks for chiming in Richard.

No, the disks have no errors logged. In this hardware class all the drives (DC S3700) are internal & directly attached. I have moved the drives to a new chassis on Monday because I thought the same thing.

this morning, spa_sync times are back to normal. the last thing i did before i gave up last night was delete the time-slider snapshots which reduced the allocation from 54% to 46%. this shouldn’t impact the system any significant way.

thanks!

j.


-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling via illumos-zfs
2014-09-05 01:29:14 UTC
Permalink
Post by jason matthews via illumos-zfs
Post by Richard Elling via illumos-zfs
Sounds like bad cables or other unhappiness in the data path. Could also be a wounded disk.
Any errors logged against disks?
Thanks for chiming in Richard.
No, the disks have no errors logged. In this hardware class all the drives (DC S3700) are internal & directly attached. I have moved the drives to a new chassis on Monday because I thought the same thing.
Amongst the logs to check are the logs on the disk. Alas, those are SATA disks
and I'm not sure what sort of log retrieval success you might have.

For a SCSI disk, one can use "sg_logs --page=0x18" to see the port logs detail.

On the HBA side, mptsas devices have link error logs you can observe via "sasinfo hba-port -vl"

Symptoms of poor port connections include high latency with no FMA error logs or syslog
messages associated with the condition.
-- richard
Post by jason matthews via illumos-zfs
this morning, spa_sync times are back to normal. the last thing i did before i gave up last night was delete the time-slider snapshots which reduced the allocation from 54% to 46%. this shouldn’t impact the system any significant way.
thanks!
j.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling via illumos-zfs
2014-09-03 18:24:58 UTC
Permalink
Post by Brian Menges via illumos-zfs
Are these dtrace tools available to FreeBSD? I'm trying to solve/improve some throughput performance on zvol across iscsi. I have observed that the log device isn't in use at all across iscsi... very sad indeed.
[i]SCSI is a non-blocking (dare we say asynchronous?) protocol, so we don't expect slogs to be
used except for when the client issues a cache flush or the client disables the write cache. This
is a good thing, not a sad thing :-)
-- richard
jason matthews via illumos-zfs
2014-09-03 20:13:38 UTC
Permalink
This is great stuff.

I find that I have systems that are pushing up against the 60% threshold. The systems have 576gb of ram most of which goes to ARC so I am comfortable increasing the buffer.

I wanted to change zfs_dirty_data_max to 6gb from 4gb, how do i do it?

i tried echo zfs_dirty_data_max/W{2000,200000000} and both reported a value of 0 when I went to verify the size on OI151a9. As a work around, I increased the min threshold from 60% to 75% which gave the system some relief but I see this as just a work around to measure performance deltas.


thanks in advance,
j.
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Adam
--
Adam Leventhal
CTO, Delphix
http://blog.delphix.com/ahl
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22567878-8480fd5f
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Matthew Ahrens via illumos-zfs
2014-09-03 22:05:13 UTC
Permalink
On Wed, Sep 3, 2014 at 1:13 PM, jason matthews via illumos-zfs <
Post by jason matthews via illumos-zfs
This is great stuff.
I find that I have systems that are pushing up against the 60% threshold.
The systems have 576gb of ram most of which goes to ARC so I am comfortable
increasing the buffer.
I wanted to change zfs_dirty_data_max to 6gb from 4gb, how do i do it?
i tried echo zfs_dirty_data_max/W{2000,200000000} and both reported a
value of 0 when I went to verify the size on OI151a9. As a work around, I
increased the min threshold from 60% to 75% which gave the system some
relief but I see this as just a work around to measure performance deltas.
In mdb, you need to use /Z to modify a 64-bit value.

--matt
Post by jason matthews via illumos-zfs
thanks in advance,
j.
On Sep 2, 2014, at 10:18 PM, Adam Leventhal via illumos-zfs <
Post by Adam Leventhal via illumos-zfs
Hey folks,
I finished up a long-overdue post on tuning the OpenZFS write
throttle. I hope this is a useful guide for those of you optimizing
OpenZFS-based systems.
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/
Adam
--
Adam Leventhal
CTO, Delphix
http://blog.delphix.com/ahl
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22567878-8480fd5f
Post by Adam Leventhal via illumos-zfs
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
jason matthews via illumos-zfs
2014-09-08 09:13:57 UTC
Permalink
Forgot to mention - it is also important to look at “maxsize” reported by zdb -mm - that is the maximum contiguous free space in the metaslab. If it approaches your dataset recordsize/volblocksize, you’re probably in trouble.
Thanks Kirill. That was super helpful but I have some additional questions.

This looks typical, at the moment on the troubled zpool but doesn’t generally meet your criteria of approaching the record size, which is 8k.

segments 193486 maxsize 468K freepct 58%
metaslab 81 offset 5100000000 spacemap 2161 free 2.40G
segments 195824 maxsize 512K freepct 59%
metaslab 82 offset 5200000000 spacemap 2168 free 2.37G
segments 173325 maxsize 531K freepct 59%
metaslab 83 offset 5300000000 spacemap 2272 free 2.40G
segments 160853 maxsize 472K freepct 60%
metaslab 84 offset 5400000000 spacemap 2453 free 2.31G



That said, i migrated one of the postgres shards, using zfs send/recv, onto a new mirror, single vdev. Instead of following conventional wisdom and setting the record size to 8k, I left it at 128k to try to avoid the the fragmentation problem down the road. The output below looks much more tightly compacted consisting of slabs that are highly utilized (3-4% free versus 50-60% free) and other slabs that are empty (not surprising). The maxsize is fairly large at 130-170MB free versus 200-500k free.

How sure are you that we that maxsize needs to approach recordsize to suffer from fragmentation? is 50x recordsize still conceivably a problem? The original pool got down to a point where 2GB of space was available on all the slabs and the largest contiguous space was less than 600k in all instances. That seems fragmented to me. What you think? I don’t fee like i have a good bead on this yet, but it is plane to see the new pool is much healthier than the hold one.

Here the output from the new mirror:

metaslab 95 offset 5f00000000 spacemap 149 free 133M
segments 137 maxsize 133M freepct 3%
metaslab 96 offset 6000000000 spacemap 0 free 4G
segments 1 maxsize 4G freepct 100%
metaslab 97 offset 6100000000 spacemap 185 free 164M
segments 386 maxsize 133M freepct 4%
metaslab 98 offset 6200000000 spacemap 0 free 4G
segments 1 maxsize 4G freepct 100%
metaslab 99 offset 6300000000 spacemap 0 free 4G
segments 1 maxsize 4G freepct 100%
metaslab 100 offset 6400000000 spacemap 0 free 4G








-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...