Discussion:
New write throttles and fragmentation
Jim Klimov
2013-12-05 07:50:03 UTC
Permalink
Hello all,

I am pouring lots of legacy data onto a new storage box from older
computers, and this data will stay here for quite a while. I want it
to be stored as sequentially as possible to reduce the random seeks
during subsequent scrubs and other reads. The link between this new
storage and old hosts is pretty slow (*up* to 1Mbyte/sec), and I am
concerned that writes happen all the time, even with sync=disabled.
Due to compression=gzip-9 enabled on the dataset for legacy data
and a rather weak processor, local writes (copying of these files
around) are not fundamentally faster, but can reach 15-20Mbyte/sec
when larger files are processed.

My concern is that ZFS can place parts of large files that come
with TXG flushes from different time ranges into substantially different
locations on disk, causing the fragmentation as would
be harmful for later reads (I am not sure if that does happen in
practice). In fact, I do see read speeds of files from the pool
hovering around 60-120Mbyte/sec, while it was tested to be capable
of delivering at least over 300 (maybe up to 500) aggregate speed
in sequential reads on the hardware level (4 HDDs in raidz1 with
about 150+-20Mbyte/sec each).

I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.

So... the questions are:
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?

2) What are the tunables now (as distributed in oi_151a8) and is
it possible to influence the writing queue the way it was possible
before? For example, given the availability of cache here, I would
be content to have the system queue up several hundred MBytes in
RAM first and then flush them to disk as one TXG with as sequential
storage as possible (DVAs are determined at the time of flush, right?)

Thanks,
//Jim
Adam Leventhal
2013-12-05 11:20:48 UTC
Permalink
Hey Jim,
Post by Jim Klimov
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
I don't believe there's a strong rationale for keeping data in a
single txg, but I don't have data one way or the other.
Post by Jim Klimov
2) What are the tunables now (as distributed in oi_151a8) and is
it possible to influence the writing queue the way it was possible
before? For example, given the availability of cache here, I would
be content to have the system queue up several hundred MBytes in
RAM first and then flush them to disk as one TXG with as sequential
storage as possible (DVAs are determined at the time of flush, right?)
Take a look here
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dsl_pool.c#100
-- you can set zfs_dirty_data_sync, the amount of data that will
accumulate before we sync out a txg.

You should also take a look at this block comment for tuning the IO
scheduler: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_queue.c#37
Post by Jim Klimov
I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
Yes; those tunables no longer exist.

Adam
Post by Jim Klimov
Hello all,
I am pouring lots of legacy data onto a new storage box from older
computers, and this data will stay here for quite a while. I want it
to be stored as sequentially as possible to reduce the random seeks
during subsequent scrubs and other reads. The link between this new
storage and old hosts is pretty slow (*up* to 1Mbyte/sec), and I am
concerned that writes happen all the time, even with sync=disabled.
Due to compression=gzip-9 enabled on the dataset for legacy data
and a rather weak processor, local writes (copying of these files
around) are not fundamentally faster, but can reach 15-20Mbyte/sec
when larger files are processed.
My concern is that ZFS can place parts of large files that come
with TXG flushes from different time ranges into substantially different
locations on disk, causing the fragmentation as would
be harmful for later reads (I am not sure if that does happen in
practice). In fact, I do see read speeds of files from the pool
hovering around 60-120Mbyte/sec, while it was tested to be capable
of delivering at least over 300 (maybe up to 500) aggregate speed
in sequential reads on the hardware level (4 HDDs in raidz1 with
about 150+-20Mbyte/sec each).
I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
2) What are the tunables now (as distributed in oi_151a8) and is
it possible to influence the writing queue the way it was possible
before? For example, given the availability of cache here, I would
be content to have the system queue up several hundred MBytes in
RAM first and then flush them to disk as one TXG with as sequential
storage as possible (DVAs are determined at the time of flush, right?)
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008099-95c33fdc
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
Adam Leventhal
CTO, Delphix
http://blog.delphix.com/ahl
Jim Klimov
2013-12-05 12:15:16 UTC
Permalink
Hello, and thanks for the links!
Post by Adam Leventhal
Hey Jim,
Post by Jim Klimov
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
I don't believe there's a strong rationale for keeping data in a
single txg, but I don't have data one way or the other.
I may have not been clear: I do not require a large (i.e. gigabyte
scale) file to be stored on one TXG. But I do want its data blocks
to be stored consecutively in a monotonously increasing progression,
preferably as one large fragment in the DVA allocations, or at least
as a series of fragments big enough that the disk spends more time
reading than seeking from piece to piece.

If the ZFS-writing code does try to ensure this sort of unfragmented
allocation, when bits of the file come in over many minutes - good.
I just don't know if it does (or doesn't bother to) allocate with
minimal fragmentation of individual files when possible.

Also, I try to upload from only one old resource at a time, so that
allocations from different streams don't interleave with each other
on-disk. Is this a reasonable precaution or overkill? It would be
administratively convenient to post several uploads, if only I knew
this wouldn't tank resilvers and other reads later on?.. ;)
Post by Adam Leventhal
Post by Jim Klimov
I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
Yes; those tunables no longer exist.
Hmmm... Not in illumos-gate, but they do exist in oi_151a8 release.
And zfs_dirty_data_sync does not, according to mdb.

Then it seems the old options no longer did what I expected of them,
perhaps something else interferes...

My last tuning in this area was with SXCE, back when TXG sync was
30 sec by default, and they did make a noticeable difference.


Thanks,
//Jim
Andrew Galloway
2013-12-05 12:44:58 UTC
Permalink
I do not believe there is any expectation that if you're writing files over
the course of many minutes, with multiple files being simultaneously
ingested, that ZFS is going to intentionally stick the pieces of the
individual files in a sequential order on the disk, but I may be wrong.

However:

" My concern is that ZFS can place parts of large files that come
with TXG flushes from different time ranges into substantially different
locations on disk, causing the fragmentation as would
be harmful for later reads (I am not sure if that does happen in
practice). In fact, I do see read speeds of files from the pool
hovering around 60-120Mbyte/sec, while it was tested to be capable
of delivering at least over 300 (maybe up to 500) aggregate speed
in sequential reads on the hardware level (4 HDDs in raidz1 with
about 150+-20Mbyte/sec each)."

I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz vdev
even if you /weren't/ doing gzip-9. Not on raw read (no cache hit). Even on
sequential read operations. With gzip-9, there's just no way, especially
not on what you state is a lower-end CPU. The 60-120 MB/s you're presently
getting sounds about par, about middle of the road, IMHO. Even sequential
(and again no ARC hit), I'd not expect 300-500 MB/s unless either all the
stars were in alignment, or you had a minimum of 3 of those same size
vdevs, preferably more like 5 or 6.

300 - 500 MB/s @ 128K (max record size) = 2,400 - 4,000 IOPS needing
serviced (more than this, thanks to the need to also do some metadata
lookups as well). Yes, I know you're looking and hoping for truly
sequential access, but I wouldn't expect a single 4-disk raidz vdev would
ever actually hit this figure, not on a sequential read workload, not raw
(where ARC hit was nil).

- Andrew
Post by Jim Klimov
Hello, and thanks for the links!
Post by Adam Leventhal
Hey Jim,
1) Should I worry in the first place? Or does ZFS try its best to
Post by Jim Klimov
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
I don't believe there's a strong rationale for keeping data in a
single txg, but I don't have data one way or the other.
I may have not been clear: I do not require a large (i.e. gigabyte
scale) file to be stored on one TXG. But I do want its data blocks
to be stored consecutively in a monotonously increasing progression,
preferably as one large fragment in the DVA allocations, or at least
as a series of fragments big enough that the disk spends more time
reading than seeking from piece to piece.
If the ZFS-writing code does try to ensure this sort of unfragmented
allocation, when bits of the file come in over many minutes - good.
I just don't know if it does (or doesn't bother to) allocate with
minimal fragmentation of individual files when possible.
Also, I try to upload from only one old resource at a time, so that
allocations from different streams don't interleave with each other
on-disk. Is this a reasonable precaution or overkill? It would be
administratively convenient to post several uploads, if only I knew
this wouldn't tank resilvers and other reads later on?.. ;)
I tried to tune the old tunables - zfs_write_limit_override
Post by Adam Leventhal
Post by Jim Klimov
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
Yes; those tunables no longer exist.
Hmmm... Not in illumos-gate, but they do exist in oi_151a8 release.
And zfs_dirty_data_sync does not, according to mdb.
Then it seems the old options no longer did what I expected of them,
perhaps something else interferes...
My last tuning in this area was with SXCE, back when TXG sync was
30 sec by default, and they did make a noticeable difference.
Thanks,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
24484421-62d25f20
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-12-05 13:32:12 UTC
Permalink
Post by Andrew Galloway
I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz vdev
even if you /weren't/ doing gzip-9.
Actually these numbers are pretty doable. I've got a 4-disk SATA raidz
with a tiny CPU on the machine and I can easily hit ~250MB/s on a
single-threaded read or write. I suspect it's the decompression that's
the pain point here. On my small 1.3GHz Athlon II ZFS with gzip maxes
out at around ~120 MB/s on read (and that's from a ramdisk, not the
physical drives).

Cheers,
--
Saso
Andrew Galloway
2013-12-05 14:39:05 UTC
Permalink
Post by Saso Kiselkov
Post by Andrew Galloway
I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz vdev
even if you /weren't/ doing gzip-9.
Actually these numbers are pretty doable. I've got a 4-disk SATA raidz
with a tiny CPU on the machine and I can easily hit ~250MB/s on a
single-threaded read or write.
Me, too, which is not 300 MB/s, and nowhere near 500 MB/s. :)
Post by Saso Kiselkov
I suspect it's the decompression that's
the pain point here. On my small 1.3GHz Athlon II ZFS with gzip maxes
out at around ~120 MB/s on read (and that's from a ramdisk, not the
physical drives).
I'm only managing 90 MB/s on an ancient Opteron. :(
Post by Saso Kiselkov
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/25266191-a7f57e86
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--
*Andrew Galloway*
Twitter: @nexseven
Skype: andrew.w.galloway
Blog: www.nex7.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Saso Kiselkov
2013-12-05 15:04:46 UTC
Permalink
Post by Saso Kiselkov
Post by Andrew Galloway
I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz
vdev
Post by Andrew Galloway
even if you /weren't/ doing gzip-9.
Actually these numbers are pretty doable. I've got a 4-disk SATA raidz
with a tiny CPU on the machine and I can easily hit ~250MB/s on a
single-threaded read or write.
Me, too, which is not 300 MB/s, and nowhere near 500 MB/s. :)
That's single-threaded, you may note. Multiple reads at the same time
can give higher aggregates, plus he didn't mention the drive type. I'm
testing on pretty slow consumer SATA here. High-end SATA (especially 10k
rpm) and SAS drives could easily break through 300 MB/s and maybe even
500 MB/s.

My point is: don't underestimate modern drives in brute linear
performance. There's still a lot of headroom left for optimizing the
raidz read algorithms (kind of a pet peeve of mine, but I'll leave that
aside).
Post by Saso Kiselkov
I suspect it's the decompression that's
the pain point here. On my small 1.3GHz Athlon II ZFS with gzip maxes
out at around ~120 MB/s on read (and that's from a ramdisk, not the
physical drives).
I'm only managing 90 MB/s on an ancient Opteron. :(
Bummer. That Athlon of mine is a 13W TDP part (and is itself already
quite old).

Cheers,
--
Saso
Saso Kiselkov
2013-12-05 15:12:41 UTC
Permalink
Post by Saso Kiselkov
Post by Saso Kiselkov
Post by Andrew Galloway
I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz
vdev
Post by Andrew Galloway
even if you /weren't/ doing gzip-9.
Actually these numbers are pretty doable. I've got a 4-disk SATA raidz
with a tiny CPU on the machine and I can easily hit ~250MB/s on a
single-threaded read or write.
Me, too, which is not 300 MB/s, and nowhere near 500 MB/s. :)
That's single-threaded, you may note. Multiple reads at the same time
can give higher aggregates, plus he didn't mention the drive type. I'm
testing on pretty slow consumer SATA here. High-end SATA (especially 10k
rpm) and SAS drives could easily break through 300 MB/s and maybe even
500 MB/s.
As a quick addendum, I just ran a fully ARC-cached read test and it
maxes out at 500 MB/s, so it appears my CPU might also be a bottleneck
here. Looking at device utilization when reading at ~250MB/s I'm seeing
only around 50-70% busy on the drives, so my guess is that with a faster
CPU they would be able to do 300 MB/s or more.

Cheers,
--
Saso
Jim Klimov
2013-12-05 14:13:41 UTC
Permalink
Post by Andrew Galloway
I do not believe there is any expectation that if you're writing files
over the course of many minutes, with multiple files being
simultaneously ingested, that ZFS is going to intentionally stick the
pieces of the individual files in a sequential order on the disk, but I
may be wrong.
What about single-file ingestion? ;)

Since much of the copying goes via rsync, I opted to use a fixed
partial-dir pointing to an uncompressed dataset, so that when the
file is received, it would be re-read and re-written to final
storage as fast as it can within the machine.
Post by Andrew Galloway
I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz vdev
even if you /weren't/ doing gzip-9. Not on raw read (no cache hit). Even
on sequential read operations. With gzip-9, there's just no way,
especially not on what you state is a lower-end CPU. The 60-120 MB/s
you're presently getting sounds about par, about middle of the road,
Well, at least, this is not a very bad figure... But during early tests
when the pool was populated with various data (lots of large photos
stored without dataset compression) for example, I think I saw bursts
of higher speeds (like 300 MB/s) on reads of data that should not have
been cached in advance, or during scrubs.
Post by Andrew Galloway
serviced (more than this, thanks to the need to also do some metadata
If these are random IO's - I'm screwed (with one disk doing 150-200
of those intermixed with seeks). If data is stored sequentially, these
IO's get coalesced into one large reading stroke so it does not matter
much how many formal IO's there are.

As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)

So far the dataset in test has primarycache=all secondarycache=metadata
and hopefully much of the recent metadata does end up on SSD L2ARC
(as is relevant for reads of files at least).

//Jim
Andrew Galloway
2013-12-05 14:45:20 UTC
Permalink
Post by Jim Klimov
Post by Andrew Galloway
I do not believe there is any expectation that if you're writing files
over the course of many minutes, with multiple files being
simultaneously ingested, that ZFS is going to intentionally stick the
pieces of the individual files in a sequential order on the disk, but I
may be wrong.
What about single-file ingestion? ;)
Better!
Post by Jim Klimov
Since much of the copying goes via rsync, I opted to use a fixed
partial-dir pointing to an uncompressed dataset, so that when the
file is received, it would be re-read and re-written to final
storage as fast as it can within the machine.
I wouldn't expect to get 300-500 MB/s out of a single 4-disk raidz vdev
Post by Andrew Galloway
even if you /weren't/ doing gzip-9. Not on raw read (no cache hit). Even
on sequential read operations. With gzip-9, there's just no way,
especially not on what you state is a lower-end CPU. The 60-120 MB/s
you're presently getting sounds about par, about middle of the road,
Well, at least, this is not a very bad figure... But during early tests
when the pool was populated with various data (lots of large photos
stored without dataset compression) for example, I think I saw bursts
of higher speeds (like 300 MB/s) on reads of data that should not have
been cached in advance, or during scrubs.
Some of it was likely cached "in advance" (via prefetch). And that was
uncompressed, and you still only saw 300 MB/s, which is not 500 MB/s. :)
Post by Jim Klimov
Post by Andrew Galloway
serviced (more than this, thanks to the need to also do some metadata
If these are random IO's - I'm screwed (with one disk doing 150-200
of those intermixed with seeks). If data is stored sequentially, these
IO's get coalesced into one large reading stroke so it does not matter
much how many formal IO's there are.
It still matters how many formal IO's there are. The coalescing that
happens isn't magic. It can't turn 4,000 IOPS into 4. There's a limit on
what drives can do, even on completely sequential patterns -- the highest
I've seen off a 7,200 RPM drive (as shown via iostat, anyway) is about 900.
Post by Jim Klimov
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
So far the dataset in test has primarycache=all secondarycache=metadata
and hopefully much of the recent metadata does end up on SSD L2ARC
(as is relevant for reads of files at least).
Maybe I'm weird, but I'd rather do the opposite -- primarycache=metadata
(and be sure to up the max metadata the ARC can handle up near the max
ARC), and secondarycache=data or all. That way the metadata is hopefully
always a quick RAM hit, and the data comes from L2ARC SSD instead of
spinning disks. But that's just me, and not based on anything but a gut
reaction.
Post by Jim Klimov
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
25266191-a7f57e86
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
--
*Andrew Galloway*
Twitter: @nexseven
Skype: andrew.w.galloway
Blog: www.nex7.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Timothy Coalson
2013-12-05 20:15:17 UTC
Permalink
On Thu, Dec 5, 2013 at 8:45 AM, Andrew Galloway
Post by Jim Klimov
Post by Jim Klimov
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
I thought each metadata block that contained block pointers of file data
blocks contained as many as the file has, up to the limit of the metadata
block size, such that fetching them ahead of time would not perceptibly
speed up full-file reading. Specifically, I thought this was handled by
the indirect block mechanism, where a 128k indirect block contains 1024
block pointers, which would mean for 128k blocksize, it would need to read
only one new block of blockpointers every 128MB (so, 3 iops for metadata
for 384MB/s single file reading). However, I don't claim to fully
understand the on-disk format, so I could be on the wrong track here.
Post by Jim Klimov
So far the dataset in test has primarycache=all secondarycache=metadata
Post by Jim Klimov
and hopefully much of the recent metadata does end up on SSD L2ARC
(as is relevant for reads of files at least).
Maybe I'm weird, but I'd rather do the opposite -- primarycache=metadata
(and be sure to up the max metadata the ARC can handle up near the max
ARC), and secondarycache=data or all. That way the metadata is hopefully
always a quick RAM hit, and the data comes from L2ARC SSD instead of
spinning disks. But that's just me, and not based on anything but a gut
reaction.
That won't work the way you want, L2ARC is fed asynchronously only from
blocks in the ARC - if your ARC has no data blocks in it, the L2ARC will
not get new data blocks.

Tim
Post by Jim Klimov
--
*Andrew Galloway*
Skype: andrew.w.galloway
Blog: www.nex7.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24077749-f7a4e87a> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Andrew Galloway
2013-12-05 22:45:46 UTC
Permalink
Post by Timothy Coalson
Post by Jim Klimov
Post by Jim Klimov
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
I thought each metadata block that contained block pointers of file data
blocks contained as many as the file has, up to the limit of the metadata
block size, such that fetching them ahead of time would not perceptibly
speed up full-file reading. Specifically, I thought this was handled by
the indirect block mechanism, where a 128k indirect block contains 1024
block pointers, which would mean for 128k blocksize, it would need to read
only one new block of blockpointers every 128MB (so, 3 iops for metadata
for 384MB/s single file reading). However, I don't claim to fully
understand the on-disk format, so I could be on the wrong track here.
Post by Jim Klimov
So far the dataset in test has primarycache=all secondarycache=metadata
Post by Jim Klimov
and hopefully much of the recent metadata does end up on SSD L2ARC
(as is relevant for reads of files at least).
Maybe I'm weird, but I'd rather do the opposite -- primarycache=metadata
(and be sure to up the max metadata the ARC can handle up near the max
ARC), and secondarycache=data or all. That way the metadata is hopefully
always a quick RAM hit, and the data comes from L2ARC SSD instead of
spinning disks. But that's just me, and not based on anything but a gut
reaction.
That won't work the way you want, L2ARC is fed asynchronously only from
blocks in the ARC - if your ARC has no data blocks in it, the L2ARC will
not get new data blocks.
Realized like right after I posted it how dumb that was, but wasn't able to
reply before now. I was wondering if someone would catch that. :D -- Yes,
in light of that flow problem, the suggestion of meta on l2arc makes sense,
even if it isn't optimal, it's the only way that works.
Post by Timothy Coalson
Tim
Post by Jim Klimov
--
*Andrew Galloway*
Skype: andrew.w.galloway
Blog: www.nex7.com
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/24077749-f7a4e87a> |
Modify <https://www.listbox.com/member/?&> Your Subscription
<http://www.listbox.com>
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/25266191-a7f57e86> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>
--
*Andrew Galloway*
Twitter: @nexseven
Skype: andrew.w.galloway
Blog: www.nex7.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-12-06 01:26:14 UTC
Permalink
[lots of good ideas in this thread :-)]
Post by Jim Klimov
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
So far the dataset in test has primarycache=all secondarycache=metadata
and hopefully much of the recent metadata does end up on SSD L2ARC
(as is relevant for reads of files at least).
Metadata is allocated along with data. So for a single-writer workload, there is not likely
to be a significant penalty for reading metadata and prefetching will be very efficient.
(yet another reason to resurrect my spacemaps from space project :-)
Post by Jim Klimov
Maybe I'm weird, but I'd rather do the opposite -- primarycache=metadata (and be sure to up the max metadata the ARC can handle up near the max ARC), and secondarycache=data or all. That way the metadata is hopefully always a quick RAM hit, and the data comes from L2ARC SSD instead of spinning disks. But that's just me, and not based on anything but a gut reaction.
Is there an secondarycache=data option? There isn't such an option on illumos, at this time.
It is my understanding that if there is no data stored in ARC, then there will be no data stored
in L2ARC either.

NB, prior to illumos bug #3805 being fixed, the ARC could store freed blocks. So any ARC
measurements done prior to that fix will likely be useless for design decisions since the fix.
We'll need to re-measure ARC usage to get a better idea of how to design systems.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Bob Friesenhahn
2013-12-05 15:06:43 UTC
Permalink
Post by Jim Klimov
If these are random IO's - I'm screwed (with one disk doing 150-200
of those intermixed with seeks). If data is stored sequentially, these
IO's get coalesced into one large reading stroke so it does not matter
much how many formal IO's there are.
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
This is not a reasonable assumption. If the file is often/recently
used then this metadata may be cached in the ARC. Remember that a
seek for a metadata block is about as expensive as a seek for a data
block (assuming data block is on one cylinder). Also remember that
COW is used and there is metadata for each COW block. You would not
want the penalty of reading all of the metadata associated with a file
simply because a program opened it and read one block.

Zfs uses a slab allocator with pre-allocation and this helps assure
that TXG data will be written reasonably close together. A slow
writer will increase fragmentation since the data is written in more
TXGs.

Using mirroring rather than raidzN will decrease effective
fragmentation. Also keeping plenty of free space will decrease
fragmentation.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2013-12-05 16:35:02 UTC
Permalink
Post by Jim Klimov
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
used and there is metadata for each COW block. You would not want the
penalty of reading all of the metadata associated with a file simply
because a program opened it and read one block.
Makes sense, thanks... But if ZFS detects a full-file read (or a long
enough read into a large enough file), won't it be smart enough to
prefetch all the metadata too?
Zfs uses a slab allocator with pre-allocation and this helps assure that
TXG data will be written reasonably close together. A slow writer will
increase fragmentation since the data is written in more TXGs.
I assume this means all the dirty userdata cached and flushed as
part of one TXG? That is, if I have several files (incoming streams)
they would all be intermixed, instead of keeping each file in its
(somehow, magically) preallocated/reserved stretch of disk space?
Likewise, an incoming single stream with single file at a time
(spreading many TXGs) would not (try to) guarantee anything close
to substantially long contiguous allocations of the same file's
pieces, either?

Actually, I thought it was like that, so I tried to enforce large
TXGs by big timeouts and big dirty buffer sizes, but unlike my
previous experience with these tunables (which seem to exist as
of oi_151a8), they seem to have no effect. At least, almost not
a second passes that writes are not directed to disk in small
portions (about 1-2-6Mbyte/sec aggregate). Apparently, these are
negligibly short amounts in terms of sequential IO and fast reads
later on, if they are stored not sequentially across different TXGs.
Using mirroring rather than raidzN will decrease effective
fragmentation. Also keeping plenty of free space will decrease
fragmentation.
Alas, the box only has 4 drive bays, and 2 disks' worth of storage
(mirror) was too little to fit everything in, so we went for raidz.
Also, at the moment the pool started as empty - lots of free space,
which I want to be filled in a WORM manner with least fragmentation
of written data (hence the thread).

Answering other peoples' questions: the box is an N54L with an
"AMD Turion(tm) II Neo" 2.2GHz, 16GB RAM (about 10GB of that is
ARC with a 15GB arc_max, according to arc_summary.pl, 1.5-2GB
consistently free, and I am not sure what the other ~4GB can be
filled with - no tasks nor GUI running).

//Jim Klimov
Bob Friesenhahn
2013-12-05 17:03:36 UTC
Permalink
Post by Jim Klimov
Post by Jim Klimov
As for metadata... I did kind of hope that it would be fetched from
disk for the whole file, or much of it, before reading the file data
itself. After all, we do need to find this file data somehow ;)
used and there is metadata for each COW block. You would not want the
penalty of reading all of the metadata associated with a file simply
because a program opened it and read one block.
Makes sense, thanks... But if ZFS detects a full-file read (or a long
enough read into a large enough file), won't it be smart enough to
prefetch all the metadata too?
It should read the block metadata when it goes to read the block. As
I have complained about before, zfs file prefetch is a simple linear
ramp which adds more prefetch for each sequential read which was not
already prefeched. After enough block read operations, zfs prefetch
does quite a lot of read-ahead, leading to excellent performance once
enough blocks have been read. The maximum read rate for a
sequentially read file depends specifically on the file size and disk
I/O latency rather than how agressive the reader is (i.e. if the
reader is always in read()). It is claimed that overlapping read I/O
requests (via multiple threads in pread() or ASYNC I/O may improve
performance but then these reads might not be sequential any more if
they are processed out of order and the program becomes much more
challenging to design. I have not tested the overlapping
reads from one file so I can't validate the claims.

It is easy enough to test how zfs file prefetch works using good old
'cpio' and directories full of files.

I believe that this prefetch problem is primarily to blame for poor
zfs filesystem to filesystem copy performance using programs like
'tar', 'cpio', and 'rsync'.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Chris Siebenmann
2013-12-05 17:38:26 UTC
Permalink
| It is claimed that overlapping read I/O requests (via multiple threads
| in pread() or ASYNC I/O may improve performance but then these reads
| might not be sequential any more if they are processed out of order
| and the program becomes much more challenging to design. I have not
| tested the overlapping reads from one file so I can't validate the
| claims.

Having tried to defeat it for random IO testing purposes, I can tell
you that ZFS prefetch is disturbingly superintelligent. It handles
multiple streams of IO to the same file, strided prefetches (where you
read X bytes every Y bytes), and both forward and backwards IO. If
there are patterns in your IO, ZFS will probably find at least some of
them (whether they're actual patterns in your code or simply emergent
behavior by your program).

If people want to read more detail about this, you can start with:
http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSHowPrefetching

(this writeup is for some mixture of Solaris 10 update 8 behavior and
OpenSolaris source, but I suspect that nothing much has changed in ZFS
prefetching since then)

- cks
Saso Kiselkov
2013-12-05 13:12:52 UTC
Permalink
Post by Jim Klimov
Hello all,
I am pouring lots of legacy data onto a new storage box from older
computers, and this data will stay here for quite a while. I want it
to be stored as sequentially as possible to reduce the random seeks
during subsequent scrubs and other reads. The link between this new
storage and old hosts is pretty slow (*up* to 1Mbyte/sec), and I am
concerned that writes happen all the time, even with sync=disabled.
Due to compression=gzip-9 enabled on the dataset for legacy data
and a rather weak processor, local writes (copying of these files
around) are not fundamentally faster, but can reach 15-20Mbyte/sec
when larger files are processed.
Use gzip-6 (aka "gzip"). gzip-9 is a very expensive placebo:
http://tukaani.org/lzma/benchmarks.html
Post by Jim Klimov
My concern is that ZFS can place parts of large files that come
with TXG flushes from different time ranges into substantially different
locations on disk, causing the fragmentation as would
be harmful for later reads (I am not sure if that does happen in
practice). In fact, I do see read speeds of files from the pool
hovering around 60-120Mbyte/sec, while it was tested to be capable
of delivering at least over 300 (maybe up to 500) aggregate speed
in sequential reads on the hardware level (4 HDDs in raidz1 with
about 150+-20Mbyte/sec each).
Check your CPU utilization to see if you're not bottlenecked there
(since you mentioned it's a slow CPU and you're running with gzip, which
isn't exactly a speed demon). Also, if you're writing sequentially and
your pool isn't near full, don't worry about fragmentation.
Post by Jim Klimov
I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
Leave the job of worrying about efficient data retrieval to prefetch -
that thing is smarter than you think.
Post by Jim Klimov
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
AFAIK no, TXGs are not preferably allocated next to each other, but
unless you're running off of a CD-ROM or some other horribly slow
medium, don't worry about. As long as your TXGs are at least reasonably
large, the seeks won't matter much and prefetch, NCQ and drive logic
will take care of it.
Post by Jim Klimov
2) What are the tunables now (as distributed in oi_151a8) and is
it possible to influence the writing queue the way it was possible
before? For example, given the availability of cache here, I would
be content to have the system queue up several hundred MBytes in
RAM first and then flush them to disk as one TXG with as sequential
storage as possible (DVAs are determined at the time of flush, right?)
If your writing is slow but continuous, tune the TXG timeout. If you're
gunning along at maximum speed of the drives (e.g. "cp /a /b"), don't
worry about it, you will already have huge TXGs which will be pretty
well ordered.

Cheers,
--
Saso
Continue reading on narkive:
Loading...