Jim Klimov
2013-12-05 07:50:03 UTC
Hello all,
I am pouring lots of legacy data onto a new storage box from older
computers, and this data will stay here for quite a while. I want it
to be stored as sequentially as possible to reduce the random seeks
during subsequent scrubs and other reads. The link between this new
storage and old hosts is pretty slow (*up* to 1Mbyte/sec), and I am
concerned that writes happen all the time, even with sync=disabled.
Due to compression=gzip-9 enabled on the dataset for legacy data
and a rather weak processor, local writes (copying of these files
around) are not fundamentally faster, but can reach 15-20Mbyte/sec
when larger files are processed.
My concern is that ZFS can place parts of large files that come
with TXG flushes from different time ranges into substantially different
locations on disk, causing the fragmentation as would
be harmful for later reads (I am not sure if that does happen in
practice). In fact, I do see read speeds of files from the pool
hovering around 60-120Mbyte/sec, while it was tested to be capable
of delivering at least over 300 (maybe up to 500) aggregate speed
in sequential reads on the hardware level (4 HDDs in raidz1 with
about 150+-20Mbyte/sec each).
I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
So... the questions are:
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
2) What are the tunables now (as distributed in oi_151a8) and is
it possible to influence the writing queue the way it was possible
before? For example, given the availability of cache here, I would
be content to have the system queue up several hundred MBytes in
RAM first and then flush them to disk as one TXG with as sequential
storage as possible (DVAs are determined at the time of flush, right?)
Thanks,
//Jim
I am pouring lots of legacy data onto a new storage box from older
computers, and this data will stay here for quite a while. I want it
to be stored as sequentially as possible to reduce the random seeks
during subsequent scrubs and other reads. The link between this new
storage and old hosts is pretty slow (*up* to 1Mbyte/sec), and I am
concerned that writes happen all the time, even with sync=disabled.
Due to compression=gzip-9 enabled on the dataset for legacy data
and a rather weak processor, local writes (copying of these files
around) are not fundamentally faster, but can reach 15-20Mbyte/sec
when larger files are processed.
My concern is that ZFS can place parts of large files that come
with TXG flushes from different time ranges into substantially different
locations on disk, causing the fragmentation as would
be harmful for later reads (I am not sure if that does happen in
practice). In fact, I do see read speeds of files from the pool
hovering around 60-120Mbyte/sec, while it was tested to be capable
of delivering at least over 300 (maybe up to 500) aggregate speed
in sequential reads on the hardware level (4 HDDs in raidz1 with
about 150+-20Mbyte/sec each).
I tried to tune the old tunables - zfs_write_limit_override
(to flush TXG when the buffer is this full, 384MB in my test)
and zfs_txg_synctime_ms (to flush on timeout, 300 sec in my test)
but this had no noticeable effect - reads and writes still happen
concurrently, and I am still worried that writes might land onto
the pool "wherever" instead of sequentially. I also know that
these tunables may be obsolete in favor of new queuing mechanisms.
So... the questions are:
1) Should I worry in the first place? Or does ZFS try its best to
append new blocks of the same file to follow its previous blocks
stored in a different TXG?
2) What are the tunables now (as distributed in oi_151a8) and is
it possible to influence the writing queue the way it was possible
before? For example, given the availability of cache here, I would
be content to have the system queue up several hundred MBytes in
RAM first and then flush them to disk as one TXG with as sequential
storage as possible (DVAs are determined at the time of flush, right?)
Thanks,
//Jim