Discussion:
ZFS updates & feature requests
(too old to reply)
Karl Wagner
2013-04-24 11:00:23 UTC
Permalink
Hi all

I have been reading through the wiki pages on Illumos ZFS and noticed some
new features I would be very interested in. They also lay the groundwork
for some features I would like to request.

I currently run the FreeBSD 9.1 release on my home file server. I'm not
sure whether this is the right place to be discussing ZFS on FreeBSD, but
it seems that improvements here are ported over. If I am on the wrong list,
please let me know.

To get to the point, I was wondering when L2ARC compression and persistence
would be available in FreeBSD. It may be that they are already available in
another branch, but I couldn't find that info. I would prefer to stick with
releases anyway.

On to the feature request. I believe that all the groundwork is there to
add a few features which would present a potentially large improvement to
the performance of ZFS. These are:

- Combined cache and log device: This would allow cache devices to
contain virtual log devices. Blocks would be allocates as needed, allowing
them to grow or shrink. Obviously you would need a pair of cache devices
for redundancy, but only the log blocks would need to be mirrored. Allowing
this more dynamic caching system would simplify the setup of a ZFS pool, as
well as allowing (potentially) more space to be available for the next
feature.
- ZIL txg commit delay and ARC eviction: I am not sure I am using the
correct terminology, but this seems to fit. What I am suggesting is that,
when it comes to committing data which is held in the ZIL, we check how
"busy" the pool is. If it is going to degrade the performance of the pool
to force a commit of that data, we skip it and wait. In addition, IIRC the
data waiting to be committed is currently held in the ARC. With this
change, we allow this data to be evicted (or pushed to the L2ARC), and then
recalled when we are ready to commit. This, along with the next (possibly
unneeded) feature, allow a LOG device to become a real write cache.
- Async ZIL push: IIRC, only sync writes cause entries in the ZIL to be
written. I may be completely wrong. However, if this is the case, I would
propose changing this in line with the above feature. Any async written
data would be allowed to be evicted from the ARC by writing an entry to the
ZIL.
- Prioritised cache devices: Allow multiple cache devices to be given
priorities/levels, such that data to be evicted from the top level L2ARC is
actually migrated down to the next level. This would basically become a
multi-level HSM system.
- DDT preference/forced load in L2ARC: Unrelated to the rest. As we all
know, ZFS dedupe is very much dependant on having enough RAM and/or L2ARC
space available. What would be nice is, especially on a persistent log
device, to be able to tell ZFS to keep the DDT in L2ARC. If not on a
persistent device, allow it to force a load of the DDT into ARC/L2ARC on
boot/import.

Finally, something on my wishlist but without a real link to anything above:

- Offline/delayed dedupe: Allow dedupe to be set in such a way that
incoming writes are not checked against the DDT immediately. Instead, they
are committed as if dedupe was off. Then, allow a background process to
examine this data and check for duplicates to be kicked off (like a scrub).
This could be manually from the command line, scheduled, or possibly
automatically by ZFS when it detects a "quiet" pool, suspending if activity
is detected. This could, possibly, allow the space savings of dedupe to be
realised on large datasets by those without the RAM required for the
current dedupe implementation.

Phew! OK, I know there is a lot there. Some (or all) may well be either
impossible or far more work than is practical. To be honest, I would love
to help out with this lot, although I would need some pointers/guidance to
understand how the source is organised and what happens where (this has
held me back from participating in several FOSS projects before, the
codebase is so large that figuring out where to begin is a nightmare!)

I'd love to hear your comments and feedback.

Regards
Karl



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-04-24 16:14:01 UTC
Permalink
Hi Karl,
Some comments below...
Post by Karl Wagner
Hi all
I have been reading through the wiki pages on Illumos ZFS and noticed some new features I would be very interested in. They also lay the groundwork for some features I would like to request.
I currently run the FreeBSD 9.1 release on my home file server. I'm not sure whether this is the right place to be discussing ZFS on FreeBSD, but it seems that improvements here are ported over. If I am on the wrong list, please let me know.
To get to the point, I was wondering when L2ARC compression and persistence would be available in FreeBSD. It may be that they are already available in another branch, but I couldn't find that info. I would prefer to stick with releases anyway.
Combined cache and log device: This would allow cache devices to contain virtual log devices. Blocks would be allocates as needed, allowing them to grow or shrink. Obviously you would need a pair of cache devices for redundancy, but only the log blocks would need to be mirrored. Allowing this more dynamic caching system would simplify the setup of a ZFS pool, as well as allowing (potentially) more space to be available for the next feature.
This is possible today using disk partitions and is frequently done in small deployments.
OTOH, creating a hybrid log/cache also adds significant complexity to the administration
and troubleshooting of the system. I'm not convinced that complexity is worth the effort,
when partitioning is already an effective way to manage devices.
Post by Karl Wagner
ZIL txg commit delay and ARC eviction: I am not sure I am using the correct terminology, but this seems to fit. What I am suggesting is that, when it comes to committing data which is held in the ZIL, we check how "busy" the pool is. If it is going to degrade the performance of the pool to force a commit of that data, we skip it and wait. In addition, IIRC the data waiting to be committed is currently held in the ARC. With this change, we allow this data to be evicted (or pushed to the L2ARC), and then recalled when we are ready to commit. This, along with the next (possibly unneeded) feature, allow a LOG device to become a real write cache.
The reason the ZIL exists is to satisfy the commit-to-persistent-media semantics of storage
protocols. The ARC itself is a write cache, so adding complexity there, or worse -- adding
disk I/O latency, is not likely to be a win.

NB, if you want something more PAM-like, then pay for a fast, nonvolatile SSD for log and set
sync=always. Voila! Just like Netapp! :-)

Otherwise, this sounds like the logbias option, or its automated cousin. In any case, L2ARC does not
apply here because the data will be long since committed to the pool before it is considered for
movement to L2ARC.

Sidebar question: how would one decide that the pool is "busy"?
Post by Karl Wagner
Async ZIL push: IIRC, only sync writes cause entries in the ZIL to be written. I may be completely wrong. However, if this is the case, I would propose changing this in line with the above feature. Any async written data would be allowed to be evicted from the ARC by writing an entry to the ZIL.
You are correct. Async data is cached in ARC, no need to write it to the ZIL. In fact, the whole idea
is to avoid writing it to the ZIL.
Post by Karl Wagner
Prioritised cache devices: Allow multiple cache devices to be given priorities/levels, such that data to be evicted from the top level L2ARC is actually migrated down to the next level. This would basically become a multi-level HSM system.
FWIW, there once was a project to integrate ZFS into SAM (Sun's HSM product). I'm not sure it ever
completed the back-of-the-napkin design phase, though, and AIUI, Oracle has taken the SAM code
back to proprietary status. In any case, I think a proper HSM design would operate on different
principles than the ARC, because the ARC only caches blocks.
Post by Karl Wagner
DDT preference/forced load in L2ARC: Unrelated to the rest. As we all know, ZFS dedupe is very much dependant on having enough RAM and/or L2ARC space available. What would be nice is, especially on a persistent log device, to be able to tell ZFS to keep the DDT in L2ARC. If not on a persistent device, allow it to force a load of the DDT into ARC/L2ARC on boot/import.
Offline/delayed dedupe: Allow dedupe to be set in such a way that incoming writes are not checked against the DDT immediately. Instead, they are committed as if dedupe was off. Then, allow a background process to examine this data and check for duplicates to be kicked off (like a scrub). This could be manually from the command line, scheduled, or possibly automatically by ZFS when it detects a "quiet" pool, suspending if activity is detected. This could, possibly, allow the space savings of dedupe to be realised on large datasets by those without the RAM required for the current dedupe implementation.
Check the archives, the DDT/metadata horse is beaten to death every few months.
Post by Karl Wagner
Phew! OK, I know there is a lot there. Some (or all) may well be either impossible or far more work than is practical. To be honest, I would love to help out with this lot, although I would need some pointers/guidance to understand how the source is organised and what happens where (this has held me back from participating in several FOSS projects before, the codebase is so large that figuring out where to begin is a nightmare!)
I'd love to hear your comments and feedback.
Regards
Karl
illumos-zfs | Archives | Modify Your Subscription
--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Karl Wagner
2013-04-25 09:16:18 UTC
Permalink
Hi Richard,

Thanks for the info. Some of this is very useful.

I have made some comments bellow.
Post by Richard Elling
Hi Karl,
Some comments below...
- Combined cache and log device: This would allow cache devices to
contain virtual log devices. Blocks would be allocates as needed, allowing
them to grow or shrink. Obviously you would need a pair of cache devices
for redundancy, but only the log blocks would need to be mirrored. Allowing
this more dynamic caching system would simplify the setup of a ZFS pool, as
well as allowing (potentially) more space to be available for the next
feature.
This is possible today using disk partitions and is frequently done in small deployments.
OTOH, creating a hybrid log/cache also adds significant complexity to the administration
and troubleshooting of the system. I'm not convinced that complexity is worth the effort,
when partitioning is already an effective way to manage devices.
WRT system administration, I believe this could be a way to massively
simplify it for small deployments. I understand that partitioning is an
effective method of placing the cache and log on the same device, but it
requires both an understanding of how ZFS cache and log work, and an
understanding of your current (and future) workloads to decide on the
sizes, and whether you even need either. Combining the 2 would not "waste"
space for a barely used log device. It would become a set-and-forget
option: Install a (pair of) SSD(s), make them into a combined cache/log,
and the system will use it as necessary.

Having had a quick read of how the persistent cache works (at
http://wiki.illumos.org/display/illumos/Persistent+L2ARC), I would suggest
the following. It would appear to me to be "reasonably" simple to
implement, although I have yet to get my head around the source code so it
may be much more complicated than I realise. Note that this is obviously
predicated on the persistent cache device, as you don't want your ZIL going
missing on a system crash.

- When a sync write comes in, we grab exclusive control of the
l2arc_feed_thread.
- This is forced to write an immediate pbuf (possibly with a flag set
saying the next block is a log block) to at least n devices (which would
either be predefined, maybe all devices, or possibly a user-defined value).
- Write the log record to the selected devices, followed by another pbuf.
- Release control of l2arc_feed_thread.

That's the writing taken care of. The locations of the logs on the cache
devices would be held in RAM, which could be rebuilt when needed (e.g.
system crash).
If the log record is too small for this to be efficient, we should probably
allocate extra space so that additional log records can be fit in the same
space. I would need to understand the inner working better to make a call
on this.
AFAIK, the log device loops back to the beginning when it reaches the end,
so there would need to be code in place to check that log records are not
overwritten before it is allowed (assuming I am correct that logs are
"removed" when they are no longer needed).
Post by Richard Elling
-
- ZIL txg commit delay and ARC eviction: I am not sure I am using the
correct terminology, but this seems to fit. What I am suggesting is that,
when it comes to committing data which is held in the ZIL, we check how
"busy" the pool is. If it is going to degrade the performance of the pool
to force a commit of that data, we skip it and wait. In addition, IIRC the
data waiting to be committed is currently held in the ARC. With this
change, we allow this data to be evicted (or pushed to the L2ARC), and then
recalled when we are ready to commit. This, along with the next (possibly
unneeded) feature, allow a LOG device to become a real write cache.
The reason the ZIL exists is to satisfy the commit-to-persistent-media
semantics of storage
protocols. The ARC itself is a write cache, so adding complexity there, or worse -- adding
disk I/O latency, is not likely to be a win.
NB, if you want something more PAM-like, then pay for a fast,
nonvolatile SSD for log and set
sync=always. Voila! Just like Netapp! :-)
Otherwise, this sounds like the logbias option, or its automated cousin.
In any case, L2ARC does not
apply here because the data will be long since committed to the pool
before it is considered for
movement to L2ARC.
The ARC may be a write cache. However, IIRC it only caches very small
amounts, committing to disk very quickly. What I am trying to propose is a
much larger cache which would deal with peak write loads way in excess of
what the pool's underlying storage can handle. Rather than slowing
everything down to a crawl, the written data goes into the log where it
could sit for some minutes. Meanwhile, the normal work is served from the
pool.

It would work the other way around, too. A heavy (sequential) read workload
is taking place, but other clients are operating lighter mostly write
workloads. The heavy sequential read is mostly coming straight from the
main pool vdevs. The writes coming in from the other clients are cached,
safely, in the combined cache/log, until the pool is able to accept the
writes.
Post by Richard Elling
Sidebar question: how would one decide that the pool is "busy"?
I don't know exactly. But I would say you would look at outstanding
transactions, both read and write. If there are too many, it's busy. This
could possibly be tunable with limits on latency. This is probably a whole
topic in itself. I admit that I don't know enough to answer this one :)


Just a quick note: One of the projects I am currently looking into is,
basically, a simplified HSM system. The underlying pool storage would be on
slow media (the 2 I am thinking of are a mechanical disk library, similar
to a tape library but using HDDs, or "cloud storage", with limited
bandwidth out to the internet). In this situation, even a hard disk is a
hell of a lot quicker than the main storage, hence why I was suggesting
additional cache levels. So, a write comes in (which could be many
gigabytes) and it is safely "buffered" in the log. It can then be "drip
fed" to the slow storage. Meanwhile, there is still an L2ARC on the combi
cache/log (SSD-based), and there is also a large cache on LnARC, which are
just HDDs, avoiding expensive pulls from the main storage.

Hope this makes sense.

Karl



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-04-25 10:14:09 UTC
Permalink
Post by Karl Wagner
AFAIK, the log device loops back to the beginning when it reaches the
end, so there would need to be code in place to check that log records
are not overwritten before it is allowed (assuming I am correct that
logs are "removed" when they are no longer needed).
I believe this is the hardest part :)

AFAIK both ZIL and L2ARC are "ring buffers", in that when new writes
reach the end of device size, they restart at offset zero. In case of
current ZIL, since it knows its preallocated size, and knows whether
the blocks at "offset zero" were flushed to disk, it can choose to
fall back to using main pool storage for ZIL, as it does if there is
no dedicated log device (or if these get broken/removed). This part
would be harder with unknown size of allocations.

In case of current L2ARC (non-persistent) it is only as valid as are
pointers from RAM to the SSD. If the ring buffer overwrites some SSD
cached data, and further reads via RAM pointer try to fetch it, an
error occurs (checksum mismatch) and read is done from main pool.
Also, L2ARC data is AFAIK effectively discarded after such read,
because its pointer can be freed and instead a full block is cached
in the RAM ARC. Later it can be opportunistically pushed out to some
other location on L2ARC leaving only a reference behind.

Persistent L2ARC builds on this, allowing to read data on the SSD
and construct the list of pointers in RAM, after that all is the
same more or less - we try to write, we try to read, we don't
despair if reads fail for any reason including blind overwrite
of older bits.

Adding precautions to not overwrite ZIL data on L2ARC SSD as long
as that ZIL data is still needed (known to be not committed to main
pool) might be a complex coding experience and costly in IO or RAM
overheads. Maybe not, make a POC to see, test and tell us ;)

Also, I do get your point about more efficient use of few SSDs on
smaller systems (being with such machines myself), but note that
L2ARC devices are generally cheaper and less resilient to wear
(we don't lose integrity if some sectors wear out or the device
breaks), while SLOG devices are written to intensively, and often
use hardware more resilient to writes (SLC chips, DDR as main
storage, etc) - and more expensive per size unit, and they are
normally never read back while all is OK. Your introduction of
write-caching/deferral via SLOG might however change the stance
on the "never read back in" part ;)

Still, I think this is an interesting proposal. Just not easy :)

RE>> Sidebar question: how would one decide that the pool is "busy"?
Post by Karl Wagner
I don't know exactly. But I would say you would look at outstanding
transactions, both read and write. If there are too many, it's busy.
This could possibly be tunable with limits on latency. This is probably
a whole topic in itself. I admit that I don't know enough to answer this
one :)
One way to look at it is to keep track of pool's devices (and/or pool
itself as an IO device) "busy-ness" via iostat (percent-busy, number of
outstanding requests, service times lately - i.e. during TXG lifetime
or a few). This would allow to detect contention of the devices as well
as possibly help route writes to less-slow pool components for example
and maintain an overall good speed and not that of the slowest device.
Also this would help see if the devices are mostly stuck with reads or
writes and choose a strategy accordingly.

//Jim Klimov
Gregg Wonderly
2013-04-25 13:43:25 UTC
Permalink
Post by Jim Klimov
Post by Karl Wagner
AFAIK, the log device loops back to the beginning when it reaches the
end, so there would need to be code in place to check that log records
are not overwritten before it is allowed (assuming I am correct that
logs are "removed" when they are no longer needed).
I believe this is the hardest part :)
AFAIK both ZIL and L2ARC are "ring buffers", in that when new writes
reach the end of device size, they restart at offset zero. In case of
current ZIL, since it knows its preallocated size, and knows whether
the blocks at "offset zero" were flushed to disk, it can choose to
fall back to using main pool storage for ZIL, as it does if there is
no dedicated log device (or if these get broken/removed). This part
would be harder with unknown size of allocations.
One of the common algorithms for doing this, is to just use a pointer to the
division point between the top and bottom "halves" of the device. The movement
of the middle pointer, then becomes the issue. The mechanisms for moving can be
simplified by keeping the ZIL ring buffer "virtualized" in RAM so that there is
a level of indirection there, so that block numbers are virtual. For SLOG, it
gets more difficult because of the persistence guarantees. But, I think that
you could do it by "virtualizing" the existence of SLOG where the transition
looks like

slog is locked
slog is flushing
slog is flushed
slog doesn't exist
slog is resized by moving the pointer to the middle
slog is available at the new size.

This could be implemented by the existing handling of "remove slog" followed by
"add new slog". Doing this quickly to handle wide and varied loads could be
throttled by a PID like mechanism that used SLOG delays as the pressure against
a sysadmin specified time interval balanced by the over all "delay" created by
the overhead of "removing and re-adding an slog".

Gregg Wonderly
Jim Klimov
2013-04-25 10:29:51 UTC
Permalink
Post by Karl Wagner
Just a quick note: One of the projects I am currently looking into is,
basically, a simplified HSM system. The underlying pool storage would be
on slow media (the 2 I am thinking of are a mechanical disk library,
similar to a tape library but using HDDs, or "cloud storage", with
limited bandwidth out to the internet). In this situation, even a hard
disk is a hell of a lot quicker than the main storage, hence why I was
suggesting additional cache levels. So, a write comes in (which could be
many gigabytes) and it is safely "buffered" in the log. It can then be
"drip fed" to the slow storage. Meanwhile, there is still an L2ARC on
the combi cache/log (SSD-based), and there is also a large cache on
LnARC, which are just HDDs, avoiding expensive pulls from the main storage.
I was also pondering about ZFS-based HSM, but rather involving some
power-saving on home-NASes containing lots of data but rather small
working sets. The idea was that main pool disks (many power-suckers)
would be idle and spun-down until really needed, and a few disks
(like a SLOG mirror along with an rpool mirror) could act like a
read/write cache and keep the main pool spun down until there is
a request for its data, or until the cache is close to overfilling.

For example, a home user wants to see a movie, his main disks spin
up to prefetch it into the cache (ARC, L2ARC, LnARC in your terms)
and while they are up, the writes are pushed onto main pool too.

Possibly, even VMs for some home infrastructure (Linux streaming,
a browsing desktop, etc.) could be served like this - their disk
images boot up from main pool, but their writes are cached into
(and re-read from) the caching devices.

Likewise, browsing and other work done on this machine which can
request and write files in a user's home directory (this currently
keeps most PC disks from spinning down, by always having non-empty
outstanding TXGs to commit) would be safe in the cache devices.

Since all such small and big writes are to be routed into the cache,
the main pool disks don't see IOs and spin down after a while (or
maybe sooner by explicit request from ZFS or PM).

This idea did not yet go beyond my requests for suggestions, among
which was that content-based prefetch strategies (i.e. if I request
a movie - read the whole file once or twice so it gets into ARC for
good; if I request an MP3 - also read others in the same catalog)
would be better implemented by a daemon that monitors VFS IO,
perhaps with DTrace, rather than plugged into common ZFS code.

The real HSM, including deferred writes, should likely be a kernel
thing however, be it part of ZFS or some virtual FS layer on top
of several storage devices and pools.

HTH,
//Jim
Karl Wagner
2013-04-25 13:46:25 UTC
Permalink
Jim

This is interesting reading, too.

It has got me thinking: Are there any userspace interfaces deeper into ZFS?
Also, I know block pointer rewrite was on the list of things to do a while
back. Was this ever implemented?

I'm starting to see a possible alternative solution, you see. Rather than
implementing at the cache device level (I can see obstacles from what has
been discussed already), implement at a vdev level with a userspace layer.

The idea would be reasonably simple if a userspace interface is available
(or can be made available).

- Add a method to mark vdevs as read-only to standard filesystem/zvol
interfaces. This means standard code will never write to them, but will
read from them when needed.
- A userspace daemon is able to "move" blocks down from the main vdevs
to the "read-only" vdevs.
- This userspace daemon could link itself to the l2arc_feed_thread to
try to ensure that data which hits the l2arc is available in the main
vdevs. This would vastly simplify the management code. Rather than it
having to keep track of access patterns and use complicated algorithms to
keep hot data available, it just piggybacks off the l2arc. As the feed
thread grabs data, the daemon is notified and puts it on a list to be
"copied up". It can also use this list to devide which block(s) should not
be migrated down (although additional work would be required to determine
what, from the remaining data, should be moved down).

IMHO this would be, to start with, a very "hacky" HSM, but it could be
improved on over time and would make a good start.
Post by Jim Klimov
Post by Karl Wagner
Just a quick note: One of the projects I am currently looking into is,
basically, a simplified HSM system. The underlying pool storage would be
on slow media (the 2 I am thinking of are a mechanical disk library,
similar to a tape library but using HDDs, or "cloud storage", with
limited bandwidth out to the internet). In this situation, even a hard
disk is a hell of a lot quicker than the main storage, hence why I was
suggesting additional cache levels. So, a write comes in (which could be
many gigabytes) and it is safely "buffered" in the log. It can then be
"drip fed" to the slow storage. Meanwhile, there is still an L2ARC on
the combi cache/log (SSD-based), and there is also a large cache on
LnARC, which are just HDDs, avoiding expensive pulls from the main storage.
I was also pondering about ZFS-based HSM, but rather involving some
power-saving on home-NASes containing lots of data but rather small
working sets. The idea was that main pool disks (many power-suckers)
would be idle and spun-down until really needed, and a few disks
(like a SLOG mirror along with an rpool mirror) could act like a
read/write cache and keep the main pool spun down until there is
a request for its data, or until the cache is close to overfilling.
For example, a home user wants to see a movie, his main disks spin
up to prefetch it into the cache (ARC, L2ARC, LnARC in your terms)
and while they are up, the writes are pushed onto main pool too.
Possibly, even VMs for some home infrastructure (Linux streaming,
a browsing desktop, etc.) could be served like this - their disk
images boot up from main pool, but their writes are cached into
(and re-read from) the caching devices.
Likewise, browsing and other work done on this machine which can
request and write files in a user's home directory (this currently
keeps most PC disks from spinning down, by always having non-empty
outstanding TXGs to commit) would be safe in the cache devices.
Since all such small and big writes are to be routed into the cache,
the main pool disks don't see IOs and spin down after a while (or
maybe sooner by explicit request from ZFS or PM).
This idea did not yet go beyond my requests for suggestions, among
which was that content-based prefetch strategies (i.e. if I request
a movie - read the whole file once or twice so it gets into ARC for
good; if I request an MP3 - also read others in the same catalog)
would be better implemented by a daemon that monitors VFS IO,
perhaps with DTrace, rather than plugged into common ZFS code.
The real HSM, including deferred writes, should likely be a kernel
thing however, be it part of ZFS or some virtual FS layer on top
of several storage devices and pools.
HTH,
//Jim
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-04-25 14:23:51 UTC
Permalink
Post by Karl Wagner
It has got me thinking: Are there any userspace interfaces deeper into
ZFS? Also, I know block pointer rewrite was on the list of things to do
a while back. Was this ever implemented?
AFAIK, no - and that is still long desired, for a decade or so.
There are many use-cases that could be solved one way or another,
such as reducing TLVDEV size, removing TLVDEVs from a pool, changing
redundancy levels, defragmentation (though there are several goals
that lead to conflicting definitions of defrag), post-processing
dedup and likely un-dedup (to remove DDT overheads for unique blocks),
and more. Whenever these pop up, people say that BPRewrite is one of
fitting solutions, and should finally be done and it would be the
"magic pill" to solve the problems.

And to outsiders like me it seems like a simple-looking task, while
people in the know - who tried to architect and do it - say that in
fact it is quite hard to do properly and reliably. One limitation
is for consistency on live pools (which receive other IOs beside
the rewrite), and a limited solution for offline/readonly pools
is much more possible but less desired by the market - which is
what ultimately funds many of the works.
Post by Karl Wagner
I'm starting to see a possible alternative solution, you see. Rather
than implementing at the cache device level (I can see obstacles from
what has been discussed already), implement at a vdev level with a
userspace layer.
The idea would be reasonably simple if a userspace interface is
available (or can be made available).
* Add a method to mark vdevs as read-only to standard filesystem/zvol
interfaces. This means standard code will never write to them, but
will read from them when needed.
* A userspace daemon is able to "move" blocks down from the main vdevs
to the "read-only" vdevs.
* This userspace daemon could link itself to the l2arc_feed_thread to
try to ensure that data which hits the l2arc is available in the
main vdevs. This would vastly simplify the management code. Rather
than it having to keep track of access patterns and use complicated
algorithms to keep hot data available, it just piggybacks off the
l2arc. As the feed thread grabs data, the daemon is notified and
puts it on a list to be "copied up". It can also use this list to
devide which block(s) should not be migrated down (although
additional work would be required to determine what, from the
remaining data, should be moved down).
This sounds suspicious for HSM usage: L2ARC is throttled so as to not
get into way of "real IO requests". The feed thread monitors tails of
the RAM ARC for blocks which are likely to soon expire and be forgotten,
and relocates those to the L2ARC. The relocation intensity is relatively
slow (8Mbyte/s by default) and it does not matter if some blocks are
not relocated and do become forgotten. Then upon next request they
would be fetched not from SSD but from main pool, that's all.

While it is a fitting strategy for a read-cache, it is of course not
good for a write-cache, which should not be so careless about losing
data (during propagations from fast tiers to slow tiers, in case of
hierarchical storage management). Such propagations should be written
into as many copies as is configured by the policy, acknowledged, and
only then may expire sooner or later from the faster smaller layer.

If your multilayer storage does intend to be a read-cache, as your
other letters suggested, then I guess it is appropriate to piggy-back
on l2arc feeder concept. However, while expiration from RAM ARC is
easily detectable - by position in the automatically sorted list of
distributions (MFU/LRU) and distance to the end of list enforced by
external conditions (free RAM) - expiration from LnARC is based on
the ring buffer overwrites. I guess it can be detectable, i.e. by
sorting RAM pointers in the order of device ID and offset, but the
mechanism to find the soon-expiring entries and feed the relocations
is different. And note that after you fill all devices, expiration
from top one (L2ARC) might require relocations of expiring blocks
all over your stack of LnARCs, getting slower and slower by design
on each layer. Likely, their feed prefetchers should take longer
tails with each step down in IO speed - and be ready to lose some
data from slower devices upon each relocation from faster ones...

HTH,
//Jim Klimov
Karl Wagner
2013-04-25 15:09:01 UTC
Permalink
Sorry, I may not have put that clearly enough. This was actually going down
a completely different route.

What you would do is make a pool up of some "fast" and some "slow" vdevs.
The fast ones could be SSDs, but they could equally just be HDDs. The point
is that they are faster than the slow vdevs, which could be "cloud"
storage, a tape or disk library, or (as you suggested) spun-down drives.

The slow drives are marked as almost read only: The normal processes will
not write any data to them. All data written will go to the fast vdevs.

A separate daemon, hooked into some sort of BP rewrite interface, will
periodically check space on the fast vdevs and move some data down to the
slow ones. It will also have hooks which can copy up any data needed. My
suggestion for this was to hook it into the L2 feed thread, so that any
data which was to be pushed to the L2ARC was also pushed to the fast vdevs
(and probably removed from the slow ones, or marked/queued for deletion).

Anyway, if BP rewrite still isn't done, this would not be possible either.
(By the way I am not having a go at anyone here. I am sure it is a much
more complicated task than I realise.)
Post by Jim Klimov
Post by Karl Wagner
It has got me thinking: Are there any userspace interfaces deeper into
ZFS? Also, I know block pointer rewrite was on the list of things to do
a while back. Was this ever implemented?
AFAIK, no - and that is still long desired, for a decade or so.
There are many use-cases that could be solved one way or another,
such as reducing TLVDEV size, removing TLVDEVs from a pool, changing
redundancy levels, defragmentation (though there are several goals
that lead to conflicting definitions of defrag), post-processing
dedup and likely un-dedup (to remove DDT overheads for unique blocks),
and more. Whenever these pop up, people say that BPRewrite is one of
fitting solutions, and should finally be done and it would be the
"magic pill" to solve the problems.
And to outsiders like me it seems like a simple-looking task, while
people in the know - who tried to architect and do it - say that in
fact it is quite hard to do properly and reliably. One limitation
is for consistency on live pools (which receive other IOs beside
the rewrite), and a limited solution for offline/readonly pools
is much more possible but less desired by the market - which is
what ultimately funds many of the works.
I'm starting to see a possible alternative solution, you see. Rather
Post by Karl Wagner
than implementing at the cache device level (I can see obstacles from
what has been discussed already), implement at a vdev level with a
userspace layer.
The idea would be reasonably simple if a userspace interface is
available (or can be made available).
* Add a method to mark vdevs as read-only to standard filesystem/zvol
interfaces. This means standard code will never write to them, but
will read from them when needed.
* A userspace daemon is able to "move" blocks down from the main vdevs
to the "read-only" vdevs.
* This userspace daemon could link itself to the l2arc_feed_thread to
try to ensure that data which hits the l2arc is available in the
main vdevs. This would vastly simplify the management code. Rather
than it having to keep track of access patterns and use complicated
algorithms to keep hot data available, it just piggybacks off the
l2arc. As the feed thread grabs data, the daemon is notified and
puts it on a list to be "copied up". It can also use this list to
devide which block(s) should not be migrated down (although
additional work would be required to determine what, from the
remaining data, should be moved down).
This sounds suspicious for HSM usage: L2ARC is throttled so as to not
get into way of "real IO requests". The feed thread monitors tails of
the RAM ARC for blocks which are likely to soon expire and be forgotten,
and relocates those to the L2ARC. The relocation intensity is relatively
slow (8Mbyte/s by default) and it does not matter if some blocks are
not relocated and do become forgotten. Then upon next request they
would be fetched not from SSD but from main pool, that's all.
While it is a fitting strategy for a read-cache, it is of course not
good for a write-cache, which should not be so careless about losing
data (during propagations from fast tiers to slow tiers, in case of
hierarchical storage management). Such propagations should be written
into as many copies as is configured by the policy, acknowledged, and
only then may expire sooner or later from the faster smaller layer.
If your multilayer storage does intend to be a read-cache, as your
other letters suggested, then I guess it is appropriate to piggy-back
on l2arc feeder concept. However, while expiration from RAM ARC is
easily detectable - by position in the automatically sorted list of
distributions (MFU/LRU) and distance to the end of list enforced by
external conditions (free RAM) - expiration from LnARC is based on
the ring buffer overwrites. I guess it can be detectable, i.e. by
sorting RAM pointers in the order of device ID and offset, but the
mechanism to find the soon-expiring entries and feed the relocations
is different. And note that after you fill all devices, expiration
from top one (L2ARC) might require relocations of expiring blocks
all over your stack of LnARCs, getting slower and slower by design
on each layer. Likely, their feed prefetchers should take longer
tails with each step down in IO speed - and be ready to lose some
data from slower devices upon each relocation from faster ones...
HTH,
//Jim Klimov
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Erik Trimble
2013-04-27 17:25:27 UTC
Permalink
Yeah, well, I'm suspecting that bp_rewrite may require a very
significant rewrite of a large chunk of the codebase, at this point.
Which I understand, since I've managed to write a defragger for ZFS
which depends on certain bp_rewrite behavior (i.e. I did a userland
daemon to defrag/realign a pool). The test code worked well in
simulation (i.e. where I could fake having a real bp_rewrite call to
move the underlying slab to new physical locations), but there were a
large number of edge case issues that made the daemon much, much more
complex than I had originally scoped. As in 75% of the codebase is for
handling the not-straight-forward situations which were nonetheless
likely to occur in typical use.
:-(

One thing I've not seen discussed here is ARC pressure, particularly
that which comes from having large L2ARC devices. The "Store DDT on
L2ARC-only" idea is pretty much a subset of this idea, which is simply:
eliminate references in ARC to L2ARC locations, and treat all RAM/cache
(NOT log) devices as a single large space. Alternately, we could keep
the distinction between ARC and L2ARC (which, probably makes sense), but
remove the requirement that all records in L2ARC have a pointer to them
in ARC.

Anything that fills the L2ARC with small block size records puts an
enormous pressure on ARC. Current worse-case is that ARC can require up
to 50% of the space in L2ARC for very small records.
Post by Karl Wagner
Sorry, I may not have put that clearly enough. This was actually going
down a completely different route.
What you would do is make a pool up of some "fast" and some "slow"
vdevs. The fast ones could be SSDs, but they could equally just be
HDDs. The point is that they are faster than the slow vdevs, which
could be "cloud" storage, a tape or disk library, or (as you
suggested) spun-down drives.
The slow drives are marked as almost read only: The normal processes
will not write any data to them. All data written will go to the fast
vdevs.
A separate daemon, hooked into some sort of BP rewrite interface, will
periodically check space on the fast vdevs and move some data down to
the slow ones. It will also have hooks which can copy up any data
needed. My suggestion for this was to hook it into the L2 feed thread,
so that any data which was to be pushed to the L2ARC was also pushed
to the fast vdevs (and probably removed from the slow ones, or
marked/queued for deletion).
Anyway, if BP rewrite still isn't done, this would not be possible
either. (By the way I am not having a go at anyone here. I am sure it
is a much more complicated task than I realise.)
That actually is really useful, particularly for remote-replication
between datacenter SANs. If one vdev is local, and one mounted as (say)
iSCSI from a device somewhere else, then having different commit
capabilities would be fabulous. You'd need some sort of different cache
requirement for the "slow" vdevs, so that you could queue up writes to
them that had already been serviced by the main "fast" vdevs. Some sort
of tunable that allows you to set a fixed amount of ARC aside for
holding the "slow" vdev's write data. Contrary to the above, this would
mean that the slow vdevs were almost "write-only", as all the read
requests for that pool would be serviced from the local "fast" vdevs.
Which would also help keep things balanced, since a commit to a remote
vdev is likely to be an order of magnitude slower than local media (on a
good day), and having the slow vdev do only streaming writes would
enable it to have some chance of keeping up with the main fast vdevs
(which are usually occupied with lots of reads, too).

HSM-like migration behavior isn't terribly interesting anymore. But
remote replication is a HUGE deal.


-Erik
Richard Elling
2013-04-28 16:48:11 UTC
Permalink
Yeah, well, I'm suspecting that bp_rewrite may require a very significant rewrite of a large chunk of the codebase, at this point. Which I understand, since I've managed to write a defragger for ZFS which depends on certain bp_rewrite behavior (i.e. I did a userland daemon to defrag/realign a pool). The test code worked well in simulation (i.e. where I could fake having a real bp_rewrite call to move the underlying slab to new physical locations), but there were a large number of edge case issues that made the daemon much, much more complex than I had originally scoped. As in 75% of the codebase is for handling the not-straight-forward situations which were nonetheless likely to occur in typical use.
:-(
yep
One thing I've not seen discussed here is ARC pressure, particularly that which comes from having large L2ARC devices. The "Store DDT on L2ARC-only" idea is pretty much a subset of this idea, which is simply: eliminate references in ARC to L2ARC locations, and treat all RAM/cache (NOT log) devices as a single large space. Alternately, we could keep the distinction between ARC and L2ARC (which, probably makes sense), but remove the requirement that all records in L2ARC have a pointer to them in ARC.
Anything that fills the L2ARC with small block size records puts an enormous pressure on ARC. Current worse-case is that ARC can require up to 50% of the space in L2ARC for very small records.
I haven't done much research here. But a new thread is warranted to keep the conversation going.
Sorry, I may not have put that clearly enough. This was actually going down a completely different route.
What you would do is make a pool up of some "fast" and some "slow" vdevs. The fast ones could be SSDs, but they could equally just be HDDs. The point is that they are faster than the slow vdevs, which could be "cloud" storage, a tape or disk library, or (as you suggested) spun-down drives.
The slow drives are marked as almost read only: The normal processes will not write any data to them. All data written will go to the fast vdevs.
A separate daemon, hooked into some sort of BP rewrite interface, will periodically check space on the fast vdevs and move some data down to the slow ones. It will also have hooks which can copy up any data needed. My suggestion for this was to hook it into the L2 feed thread, so that any data which was to be pushed to the L2ARC was also pushed to the fast vdevs (and probably removed from the slow ones, or marked/queued for deletion).
Anyway, if BP rewrite still isn't done, this would not be possible either. (By the way I am not having a go at anyone here. I am sure it is a much more complicated task than I realise.)
That actually is really useful, particularly for remote-replication between datacenter SANs. If one vdev is local, and one mounted as (say) iSCSI from a device somewhere else, then having different commit capabilities would be fabulous. You'd need some sort of different cache requirement for the "slow" vdevs, so that you could queue up writes to them that had already been serviced by the main "fast" vdevs. Some sort of tunable that allows you to set a fixed amount of ARC aside for holding the "slow" vdev's write data. Contrary to the above, this would mean that the slow vdevs were almost "write-only", as all the read requests for that pool would be serviced from the local "fast" vdevs. Which would also help keep things balanced, since a commit to a remote vdev is likely to be an order of magnitude slower than local media (on a good day), and having the slow vdev do only streaming writes would enable it to have some chance of keeping up with the main fast vdevs (which are usually occupied with lots of reads, too).
HSM-like migration behavior isn't terribly interesting anymore. But remote replication is a HUGE deal.
I know of many remote mirrors with asymmetric read behaviour, the idea is really quite old.
The trick is how to do this dynamically so that there is no burden on the admin...
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Erik Trimble
2013-04-28 19:53:22 UTC
Permalink
Post by Richard Elling
Post by Erik Trimble
Post by Karl Wagner
Sorry, I may not have put that clearly enough. This was actually
going down a completely different route.
What you would do is make a pool up of some "fast" and some "slow"
vdevs. The fast ones could be SSDs, but they could equally just be
HDDs. The point is that they are faster than the slow vdevs, which
could be "cloud" storage, a tape or disk library, or (as you
suggested) spun-down drives.
The slow drives are marked as almost read only: The normal processes
will not write any data to them. All data written will go to the
fast vdevs.
A separate daemon, hooked into some sort of BP rewrite interface,
will periodically check space on the fast vdevs and move some data
down to the slow ones. It will also have hooks which can copy up any
data needed. My suggestion for this was to hook it into the L2 feed
thread, so that any data which was to be pushed to the L2ARC was
also pushed to the fast vdevs (and probably removed from the slow
ones, or marked/queued for deletion).
Anyway, if BP rewrite still isn't done, this would not be possible
either. (By the way I am not having a go at anyone here. I am sure
it is a much more complicated task than I realise.)
That actually is really useful, particularly for remote-replication
between datacenter SANs. If one vdev is local, and one mounted as
(say) iSCSI from a device somewhere else, then having different
commit capabilities would be fabulous. You'd need some sort of
different cache requirement for the "slow" vdevs, so that you could
queue up writes to them that had already been serviced by the main
"fast" vdevs. Some sort of tunable that allows you to set a fixed
amount of ARC aside for holding the "slow" vdev's write data.
Contrary to the above, this would mean that the slow vdevs were
almost "write-only", as all the read requests for that pool would be
serviced from the local "fast" vdevs. Which would also help keep
things balanced, since a commit to a remote vdev is likely to be an
order of magnitude slower than local media (on a good day), and
having the slow vdev do only streaming writes would enable it to have
some chance of keeping up with the main fast vdevs (which are usually
occupied with lots of reads, too).
HSM-like migration behavior isn't terribly interesting anymore. But
remote replication is a HUGE deal.
I know of many remote mirrors with asymmetric read behaviour, the idea is really quite old.
The trick is how to do this dynamically so that there is no burden on the admin...
-- richard
I'd presume that whatever we want should be admin-transparent, other
than the admin indicating that a particular side of the mirror is "slow".

e.g.

zpool create tank mirror A B mirror C D

which would create a 2-wide stripe of a standard 2-drive mirror locally,
then a new syntax to do:

zpool attach tank remote X

or

zpool attach tank remote X Y

which would attach a slow device X (or a stripe of slow devices X and Y)
to the tank mirror


The system could then dynamically adjust the "write cache" requirements
for the "slow" side of the mirror, based on observations of how long a
typical commit took to return for the slow device.

You could then end up with this scenario:

1) async write comes in, and is stored in ARC
2) ZFS decides to flush waiting writes to backing store after a time
(say 30 sec)
3) write is committed to "fast" side of mirror, and is now immediately
available for re-reading/modifying
4) write is NOT removed from ARC yet; instead, marked as "waiting to
commit to remote" (or something like that)
5) at some other interval (different than #2), all such marked "remote
write" chunks are streamed to the remote device
6) remote device reports write has committed, and the write is marked as
finished in ARC (i.e. available for eviction)

Some tuning parameter should be for the interval at #5 (likely a max
time), which might be dynamically adjusted based on a history of
time-to-complete-remote-write averages. There likely should be another
tuning parameter, which is the percentage of ARC which can sit in the
"waiting to commit to remote" status.

I'd also assume that NO reads were ever done from the remote side unless
the local side of the mirror was faulted or otherwise unavailable.
That is, there never would be any attempt to balance read I/O between
the "fast" and "slow" portions - one would only ever read from one side
of the mirror.

I'm also assuming that the system should be able to report that a slow
device simply can't keep up with the I/O demand. In such a case, it
should be marked as "offline", and the admin would have to manually
"online" it again to force a resync (assumedly when the remote device
has enough capacity to resync).

-Erik








-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-04-28 23:44:51 UTC
Permalink
Post by Richard Elling
Sorry, I may not have put that clearly enough. This was actually going down a completely different route.
What you would do is make a pool up of some "fast" and some "slow" vdevs. The fast ones could be SSDs, but they could equally just be HDDs. The point is that they are faster than the slow vdevs, which could be "cloud" storage, a tape or disk library, or (as you suggested) spun-down drives.
The slow drives are marked as almost read only: The normal processes will not write any data to them. All data written will go to the fast vdevs.
A separate daemon, hooked into some sort of BP rewrite interface, will periodically check space on the fast vdevs and move some data down to the slow ones. It will also have hooks which can copy up any data needed. My suggestion for this was to hook it into the L2 feed thread, so that any data which was to be pushed to the L2ARC was also pushed to the fast vdevs (and probably removed from the slow ones, or marked/queued for deletion).
Anyway, if BP rewrite still isn't done, this would not be possible either. (By the way I am not having a go at anyone here. I am sure it is a much more complicated task than I realise.)
That actually is really useful, particularly for remote-replication between datacenter SANs. If one vdev is local, and one mounted as (say) iSCSI from a device somewhere else, then having different commit capabilities would be fabulous. You'd need some sort of different cache requirement for the "slow" vdevs, so that you could queue up writes to them that had already been serviced by the main "fast" vdevs. Some sort of tunable that allows you to set a fixed amount of ARC aside for holding the "slow" vdev's write data. Contrary to the above, this would mean that the slow vdevs were almost "write-only", as all the read requests for that pool would be serviced from the local "fast" vdevs. Which would also help keep things balanced, since a commit to a remote vdev is likely to be an order of magnitude slower than local media (on a good day), and having the slow vdev do only streaming writes would enable it to have some chance of keeping up with the main fast vdevs (which are usually occupied with lots of reads, too).
HSM-like migration behavior isn't terribly interesting anymore. But remote replication is a HUGE deal.
I know of many remote mirrors with asymmetric read behaviour, the idea is really quite old.
The trick is how to do this dynamically so that there is no burden on the admin...
-- richard
I'd presume that whatever we want should be admin-transparent, other than the admin indicating that a particular side of the mirror is "slow".
e.g.
zpool create tank mirror A B mirror C D
zpool attach tank remote X
or
zpool attach tank remote X Y
which would attach a slow device X (or a stripe of slow devices X and Y) to the tank mirror
Manually specifying which side of the mirror is to be write-mostly is clearly an ugly solution and,
indeed, when we've had this sort of thing in the past (VxVM) it became painful to manage.
Automation is really the answer, and today there isn't a good solution for this automation in
illumos. This is not an easy problem to solve, however (see my previous query in the old thread
about how can we tell if a disk is "busy")

For example, suppose we have a metro cluster and want to be able to failover to the remote
datacenter. In this case, when the pool is imported on the remote, it should prefer it's local
side of the mirror for read. To do this, we'd have to build a hostid/leaf vdev mapping. Expand
to a 3-way mirror and it gets really, really ugly.
The system could then dynamically adjust the "write cache" requirements for the "slow" side of the mirror, based on observations of how long a typical commit took to return for the slow device.
1) async write comes in, and is stored in ARC
2) ZFS decides to flush waiting writes to backing store after a time (say 30 sec)
3) write is committed to "fast" side of mirror, and is now immediately available for re-reading/modifying
4) write is NOT removed from ARC yet; instead, marked as "waiting to commit to remote" (or something like that)
To be clear, recently written data is not removed from the ARC. It is flushed as a part of the
normal ARC resize/shrink process and can take advantage of the MRU/MFU properties
of the ARC.
5) at some other interval (different than #2), all such marked "remote write" chunks are streamed to the remote device
6) remote device reports write has committed, and the write is marked as finished in ARC (i.e. available for eviction)
In essence, this is how mirroring works today. What is deficient with mirroring (trusted and
proven in remote scenarios) that requires a new form of caching that further locks data
in the ARC?
Some tuning parameter should be for the interval at #5 (likely a max time), which might be dynamically adjusted based on a history of time-to-complete-remote-write averages. There likely should be another tuning parameter, which is the percentage of ARC which can sit in the "waiting to commit to remote" status.
I'd also assume that NO reads were ever done from the remote side unless the local side of the mirror was faulted or otherwise unavailable. That is, there never would be any attempt to balance read I/O between the "fast" and "slow" portions - one would only ever read from one side of the mirror.
Yes, this is what we mean by asymmetric mirrors: writes go to all mirrors, reads are
preferred to be satisfied from the lowest cost side of the mirror, where lowest cost
could be bandwidth-constrained or affected by some other policy.
I'm also assuming that the system should be able to report that a slow device simply can't keep up with the I/O demand. In such a case, it should be marked as "offline", and the admin would have to manually "online" it again to force a resync (assumedly when the remote device has enough capacity to resync).
More work can be done here, too. I have some ideas that are proven in other industries.
Given the blocking nature of I/O in illumos, how do you propose determining that a
device is consistently slow (driving read policy decisions), or just temporarily slow
(transient faults or load conditions)?
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Erik Trimble
2013-04-29 00:46:53 UTC
Permalink
Post by Richard Elling
[snip]
Manually specifying which side of the mirror is to be write-mostly is
clearly an ugly solution and,
indeed, when we've had this sort of thing in the past (VxVM) it became painful to manage.
Automation is really the answer, and today there isn't a good solution
for this automation in
illumos. This is not an easy problem to solve, however (see my
previous query in the old thread
about how can we tell if a disk is "busy")
For example, suppose we have a metro cluster and want to be able to failover to the remote
datacenter. In this case, when the pool is imported on the remote, it
should prefer it's local
side of the mirror for read. To do this, we'd have to build a
hostid/leaf vdev mapping. Expand
to a 3-way mirror and it gets really, really ugly.
Yes, that does get ugly fast. I hadn't thought of that, though I
obviously should have.
Post by Richard Elling
The system could then dynamically adjust the "write cache"
requirements for the "slow" side of the mirror, based on observations
of how long a typical commit took to return for the slow device.
1) async write comes in, and is stored in ARC
2) ZFS decides to flush waiting writes to backing store after a time (say 30 sec)
3) write is committed to "fast" side of mirror, and is now
immediately available for re-reading/modifying
4) write is NOT removed from ARC yet; instead, marked as "waiting to
commit to remote" (or something like that)
To be clear, recently written data is not removed from the ARC. It is
flushed as a part of the
normal ARC resize/shrink process and can take advantage of the MRU/MFU properties
of the ARC.
Exactly, that's what I expected - that write data stays in ARC until
evicted for more "pressing" data. (based on MRU/MFU)
Post by Richard Elling
5) at some other interval (different than #2), all such marked
"remote write" chunks are streamed to the remote device
6) remote device reports write has committed, and the write is marked
as finished in ARC (i.e. available for eviction)
In essence, this is how mirroring works today. What is deficient with
mirroring (trusted and
proven in remote scenarios) that requires a new form of caching that further locks data
in the ARC?
I was assuming that mirrored writes today blocked on ALL sides of the
mirror completing before allowing ARC eviction (i.e. that ZFS wouldn't
allow the write to be removed from ARC until it got a success or failure
reply from all submirrors). Can it allow a write to be deleted from ARC
if only a single successful submirror commit happens?
Post by Richard Elling
Some tuning parameter should be for the interval at #5 (likely a max
time), which might be dynamically adjusted based on a history of
time-to-complete-remote-write averages. There likely should be
another tuning parameter, which is the percentage of ARC which can
sit in the "waiting to commit to remote" status.
I'd also assume that NO reads were ever done from the remote side
unless the local side of the mirror was faulted or otherwise
unavailable. That is, there never would be any attempt to balance
read I/O between the "fast" and "slow" portions - one would only ever
read from one side of the mirror.
Yes, this is what we mean by asymmetric mirrors: writes go to all mirrors, reads are
preferred to be satisfied from the lowest cost side of the mirror, where lowest cost
could be bandwidth-constrained or affected by some other policy.
That's the current policy, but I'm not sure how it's implemented. Having
a remote mirror capability would probably simplify this calculation, as
the remote side would be permanently ignored so long as the local one
existed - there's be no need to figure out which submirror was
performing better.
Post by Richard Elling
I'm also assuming that the system should be able to report that a
slow device simply can't keep up with the I/O demand. In such a case,
it should be marked as "offline", and the admin would have to
manually "online" it again to force a resync (assumedly when the
remote device has enough capacity to resync).
More work can be done here, too. I have some ideas that are proven in other industries.
Given the blocking nature of I/O in illumos, how do you propose determining that a
device is consistently slow (driving read policy decisions), or just temporarily slow
(transient faults or load conditions)?
-- richard
I'd presume to keep a counter or the like, with a running time average
indicating time-to-complete for each IOP. Or maybe something like
"bytes successfully committed over time X". Comparing these values
between submirror components should easily tell you which is
local/remote. And I'm assuming (maybe wrongly) that we're using the
Fault Management stuff to decide if a device is dead, not commit-time
values.


Thinking more into this, maybe the better idea is to simply work on
increasing the efficiency of writing to noticeably slower submirrors.
Then again, current policy usually results in multi-megabyte commit
sections (as async writes group into a single larger streaming write),
so maybe all this discussion is just beside the point?

-Erik



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Sašo Kiselkov
2013-04-29 06:19:21 UTC
Permalink
Post by Richard Elling
Manually specifying which side of the mirror is to be write-mostly is clearly an ugly solution and,
indeed, when we've had this sort of thing in the past (VxVM) it became painful to manage.
Automation is really the answer, and today there isn't a good solution for this automation in
illumos. This is not an easy problem to solve, however (see my previous query in the old thread
about how can we tell if a disk is "busy")
For example, suppose we have a metro cluster and want to be able to failover to the remote
datacenter. In this case, when the pool is imported on the remote, it should prefer it's local
side of the mirror for read. To do this, we'd have to build a hostid/leaf vdev mapping. Expand
to a 3-way mirror and it gets really, really ugly.
This is something I've been thinking about in the recent past and it
seems to me to be a relatively easily solvable problem. Here's the idea:

Kernel work:

1) Create the notion of a transient per-vdev property. This would not
be stored in the vdev nvlist in the vdev label, but would instead be
kept in kernel memory only so long as the pool is imported.

2) Create a transient "readcost" or "readmetric" property that would
associate some numeric value with the cost of accessing the device
for read.

3) Modify the read code path in vdev_mirror.c, probably
vdev_mirror_child_select(), to take this numeric value into account
when calculating some kind of score that would then further inform
its child selection algorithm.

Userspace work:

1) Modify userspace tools (e.g. zpool) to set/read these per-vdev
transient properties.

2) Create hooks in the import code paths in the zpool command to allow
calling external "geometry setup" scripts that would then configure
the pool's asymmetric access properties.

This would allow us to keep the mechanism in the kernel with the minimum
amount of effort, while keeping policy in userspace. The more
adventurous users could then write a userspace "routing daemon" that
would query the underlying storage architecture to determine the cost of
reaching some devices. This could also be made part of some cluster
suite (RSF-1, Pacemaker, etc.). Or admins could just hard-code their
geometry using a list of "zpool set transient" commands. In any case,
pool geometry layout would be kept locally on the machine that accesses
it, allowing for building differing access geometries from various
machines, or even switching the entire geometry discovery mechanism by
simply switching userspace software stacks.

Cheers,
--
Saso
Garrett D'Amore
2013-04-29 07:44:30 UTC
Permalink
I have a feeling that the approach here is substantially wrong. Instead of trying to figure out vdev path distances and optimize, its better to just direct the IOs to the next device that is available to service the queue. A device with a long latency is going to take longer, and will probably serve less IOs per second *unless* there is so much IO queued up (reads) that latency is not dominant. (In which case distance doesn't matter.)

If you've designed your mirror with latencies exceeding a couple hundred msec, then you're in trouble anyway. Otherwise the occasional 100-200 msec latency isn't going to be tragic to real world apps.

What *might* be useful, however, is to measure latencies (rolling average?) and tune the queue depths automatically - so a distance vdev could have a shorter queue than a closer one. That would tend to avoid queueing up much on that device and minimize latency caused by queue depth.

Properly done, this would self tune pretty easily. :-)

- Garrett
Post by Sašo Kiselkov
Post by Richard Elling
Manually specifying which side of the mirror is to be write-mostly is clearly an ugly solution and,
indeed, when we've had this sort of thing in the past (VxVM) it became painful to manage.
Automation is really the answer, and today there isn't a good solution for this automation in
illumos. This is not an easy problem to solve, however (see my previous query in the old thread
about how can we tell if a disk is "busy")
For example, suppose we have a metro cluster and want to be able to failover to the remote
datacenter. In this case, when the pool is imported on the remote, it should prefer it's local
side of the mirror for read. To do this, we'd have to build a hostid/leaf vdev mapping. Expand
to a 3-way mirror and it gets really, really ugly.
This is something I've been thinking about in the recent past and it
1) Create the notion of a transient per-vdev property. This would not
be stored in the vdev nvlist in the vdev label, but would instead be
kept in kernel memory only so long as the pool is imported.
2) Create a transient "readcost" or "readmetric" property that would
associate some numeric value with the cost of accessing the device
for read.
3) Modify the read code path in vdev_mirror.c, probably
vdev_mirror_child_select(), to take this numeric value into account
when calculating some kind of score that would then further inform
its child selection algorithm.
1) Modify userspace tools (e.g. zpool) to set/read these per-vdev
transient properties.
2) Create hooks in the import code paths in the zpool command to allow
calling external "geometry setup" scripts that would then configure
the pool's asymmetric access properties.
This would allow us to keep the mechanism in the kernel with the minimum
amount of effort, while keeping policy in userspace. The more
adventurous users could then write a userspace "routing daemon" that
would query the underlying storage architecture to determine the cost of
reaching some devices. This could also be made part of some cluster
suite (RSF-1, Pacemaker, etc.). Or admins could just hard-code their
geometry using a list of "zpool set transient" commands. In any case,
pool geometry layout would be kept locally on the machine that accesses
it, allowing for building differing access geometries from various
machines, or even switching the entire geometry discovery mechanism by
simply switching userspace software stacks.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Geoff Nordli
2013-04-29 16:31:38 UTC
Permalink
Post by Garrett D'Amore
I have a feeling that the approach here is substantially wrong. Instead of trying to figure out vdev path distances and optimize, its better to just direct the IOs to the next device that is available to service the queue. A device with a long latency is going to take longer, and will probably serve less IOs per second *unless* there is so much IO queued up (reads) that latency is not dominant. (In which case distance doesn't matter.)
If you've designed your mirror with latencies exceeding a couple hundred msec, then you're in trouble anyway. Otherwise the occasional 100-200 msec latency isn't going to be tragic to real world apps.
What *might* be useful, however, is to measure latencies (rolling average?) and tune the queue depths automatically - so a distance vdev could have a shorter queue than a closer one. That would tend to avoid queueing up much on that device and minimize latency caused by queue depth.
Properly done, this would self tune pretty easily. :-)
- Garrett
Post by Sašo Kiselkov
Post by Richard Elling
Manually specifying which side of the mirror is to be write-mostly is clearly an ugly solution and,
indeed, when we've had this sort of thing in the past (VxVM) it became painful to manage.
Automation is really the answer, and today there isn't a good solution for this automation in
illumos. This is not an easy problem to solve, however (see my previous query in the old thread
about how can we tell if a disk is "busy")
For example, suppose we have a metro cluster and want to be able to failover to the remote
datacenter. In this case, when the pool is imported on the remote, it should prefer it's local
side of the mirror for read. To do this, we'd have to build a hostid/leaf vdev mapping. Expand
to a 3-way mirror and it gets really, really ugly.
This is something I've been thinking about in the recent past and it
1) Create the notion of a transient per-vdev property. This would not
be stored in the vdev nvlist in the vdev label, but would instead be
kept in kernel memory only so long as the pool is imported.
2) Create a transient "readcost" or "readmetric" property that would
associate some numeric value with the cost of accessing the device
for read.
3) Modify the read code path in vdev_mirror.c, probably
vdev_mirror_child_select(), to take this numeric value into account
when calculating some kind of score that would then further inform
its child selection algorithm.
1) Modify userspace tools (e.g. zpool) to set/read these per-vdev
transient properties.
2) Create hooks in the import code paths in the zpool command to allow
calling external "geometry setup" scripts that would then configure
the pool's asymmetric access properties.
This would allow us to keep the mechanism in the kernel with the minimum
amount of effort, while keeping policy in userspace. The more
adventurous users could then write a userspace "routing daemon" that
would query the underlying storage architecture to determine the cost of
reaching some devices. This could also be made part of some cluster
suite (RSF-1, Pacemaker, etc.). Or admins could just hard-code their
geometry using a list of "zpool set transient" commands. In any case,
pool geometry layout would be kept locally on the machine that accesses
it, allowing for building differing access geometries from various
machines, or even switching the entire geometry discovery mechanism by
simply switching userspace software stacks.
Cheers,
--
Saso
Is focusing at the vdev level the right approach? Why wouldn't you focus
the replication at the pool/dataset level?

If the data is sitting at the remote site, you can't serve IO to clients
from it, even if it is in a read-only state. Doesn't all client IO need
to get coordinated through the active "head" system located at the
primary site.

The nice thing about ZFS with replication is transaction groups so if
the remote node gets disconnected, you can always sync up using txgs.
You don't need to worry about keeping a write log, or worry about write
io ordering.

Geoff
Nico Williams
2013-04-29 18:05:13 UTC
Permalink
Is focusing at the vdev level the right approach? Why wouldn't you focus the
Yes.
replication at the pool/dataset level?
That works too, but you don't get *synchronous* mirroring that way.
If you need synchronous mirroring, then this needs to happen at the
vdev level OR you have to pay for the latency of mirroring in series
with txg commit. The latter is more painful than the former.

Nico
--
Geoff Nordli
2013-04-29 20:08:41 UTC
Permalink
Post by Nico Williams
Is focusing at the vdev level the right approach? Why wouldn't you focus the
Yes.
replication at the pool/dataset level?
That works too, but you don't get *synchronous* mirroring that way.
If you need synchronous mirroring, then this needs to happen at the
vdev level OR you have to pay for the latency of mirroring in series
with txg commit. The latter is more painful than the former.
Nico
Hi Nico.

If you needed synchronous mirroring what about forcing everything
through the ZIL and then only commit transactions once it is on both sites.

I would think the txg level could be more of an async method and also
used for recovery processes in the event of a network failure/disruption.

To me adding replication at the vdev level seems like a management
headache. When you failover to the remote site you are going to need a
fully working system able to handle the load with the desired
redundancy; if your goal is business continuance.

Most businesses don't need sync level protection.

Geoff
Nico Williams
2013-04-29 20:39:10 UTC
Permalink
Post by Nico Williams
Is focusing at the vdev level the right approach? Why wouldn't you focus the
Yes.
replication at the pool/dataset level?
That works too, but you don't get *synchronous* mirroring that way.
If you need synchronous mirroring, then this needs to happen at the
vdev level OR you have to pay for the latency of mirroring in series
with txg commit. The latter is more painful than the former.
If you needed synchronous mirroring what about forcing everything through
the ZIL and then only commit transactions once it is on both sites.
I would think the txg level could be more of an async method and also used
for recovery processes in the event of a network failure/disruption.
To me adding replication at the vdev level seems like a management headache.
When you failover to the remote site you are going to need a fully working
system able to handle the load with the desired redundancy; if your goal is
business continuance.
Most businesses don't need sync level protection.
Well, to be fair, and IIUC, the Solaris 11 replication facility works
as txg replication, as you suggest.

And if you have a disaster there's really no chance that you won't
have to deal with cleaning up and getting to a known state no matter
what the filesystem did for you.

There's a question of how much (how many txgs) you should allow your
disaster recovery site fall behind by, and whether you need to hold up
write I/Os if need be on the main site to avoid falling behind (or
build wider pipes to your DR site). So at the very least I'd want txg
replication to be able to throttle writes on the primary pool if need
be.

But! ZFS already does mirroring, and it already has to deal with
split-brainedness and so on, and it'd be nice to be able to take
advantage of your DR site redundancy as part of your production,
non-disaster recovery failed drive procedures. I.e., it'd be nice to
be able to reduce the cost of redundancy by reusing some of the
redundancy you already must commit to having. I'm not saying that
this is how people should plan their redundancy, but that if you're
budget constrained it's nice to have this option.

All Sašo is proposing is that ZFS acknowledge asymmetric mirroring by
reducing read I/O load on the far mirrors, something that seems
eminently reasonable to me. Garret isn't rejecting that approach
entirely, just proposing a different method of reducing read load on
the far mirrors, so that the process automatic (fewer knobs).

I'm not sure I like Garret's proposal entirely, as subjecting random
readers to additional latency at random times could lead to
undesirable effects, like latency bubbles -- non-deterministic latency
at the very least with all other variables fixed. The nice thing
about Sašo's proposal is that latency effects will be deterministic,
while the nice thing about Garret's proposal is fewer knobs and better
bandwidth utilization. We could have both: if ZFS observes wildly
asymmetric latencies for different vdevs it could act accordingly by
not reading from the vdevs with consistently much worse latencies.
But it might be really hard to establish what real latencies (or the
additional latency from a long pipe) might be: caching on the
controllers/drives will mask latency, and some reads will be necessary
to establish latency patterns, perhaps many many reads. Thinking more
about this I think Sašo's proposal is the better one.

Nico
--
Erik Trimble
2013-04-30 02:56:39 UTC
Permalink
Nice can of worms here. :-)

The more I think of it, the more I'm drawn back to a fundamental problem
of filesystems and block devices: what level of abstraction is
important, and where in the stack is it appropriate to combine
functionality?

ZFS is one of only a few designs which combine the logical volume
manager with the filesystem, and I think it's been a great success.
(AdvFS is the only other one I can think of off the top of my head).

However, we're now talking about performance of the block device that
ZFS depends on, and I'm not sure that adding functionality into ZFS is
the right move (despite my original musing on remote replication). The
Fault Management stuff is a good example here: ZFS depends a great deal
on it to determine whether a device is dead or merely slow. However,
ZFS itself does have some fault heuristics, too, so there does appear to
be some benefit there.

For remote replication, Pawel's mentioning of DRDB merits consideration,
since there's a whole bunch of additional functionality that would be
nice to have, and maybe ZFS itself isn't the primary place to put it.

Right now, we're talking about remote replication. But what about
something like a distributed block device? Here's a good example: 10
machines, each of which exports 2 volumes. On a client system, make a
RAIDZ of 5 vdevs, but each vdev is really a block device which is a
mirror of 2 of those exported devices? There's a whole lot of
intelligence that needs to be considered around latency, queuing, and
caching for each of those underlying exported volumes. I seriously doubt
that ZFS would be the proper place to code this in.

The SCSI vs ATA layer is an analogy here, since the levels of caching
and queuing for performance are handled at the block driver level, not
inside ZFS, and I doubt there would be a sufficiently intelligent way to
do it within ZFS.

That is, I think we should consider adding features into ZFS where the
existing ZFS design provides a decided advantage (e.g. by having access
to the ZFS caching and writing and checksum-ing system makes the
implementation faster). We should probably defer to something further
down the stack when it is possible to localize the design there without
causing a ZFS performance hit. That is, I think features should go as
close to the hardware as possible, unless there's a very noticeable
performance advantage to having them combined with something further up
the I/O stack.

Thoughts?

-Erik

Oh, and it would really nice to have a port of DRDB for IllumOS, not to
mention a seriously improved Network Block Device-style thing. :-)
Schlacta, Christ
2013-04-30 03:23:36 UTC
Permalink
What we need is something like aufs, but with end to end integrity checks
(ala zfs) and some sort of copy-up mechanism. A transparent hsm designed
for modern filesystems.
Post by Erik Trimble
Nice can of worms here. :-)
The more I think of it, the more I'm drawn back to a fundamental problem
of filesystems and block devices: what level of abstraction is important,
and where in the stack is it appropriate to combine functionality?
ZFS is one of only a few designs which combine the logical volume manager
with the filesystem, and I think it's been a great success. (AdvFS is the
only other one I can think of off the top of my head).
However, we're now talking about performance of the block device that ZFS
depends on, and I'm not sure that adding functionality into ZFS is the
right move (despite my original musing on remote replication). The Fault
Management stuff is a good example here: ZFS depends a great deal on it to
determine whether a device is dead or merely slow. However, ZFS itself
does have some fault heuristics, too, so there does appear to be some
benefit there.
For remote replication, Pawel's mentioning of DRDB merits consideration,
since there's a whole bunch of additional functionality that would be nice
to have, and maybe ZFS itself isn't the primary place to put it.
Right now, we're talking about remote replication. But what about
something like a distributed block device? Here's a good example: 10
machines, each of which exports 2 volumes. On a client system, make a
RAIDZ of 5 vdevs, but each vdev is really a block device which is a mirror
of 2 of those exported devices? There's a whole lot of intelligence that
needs to be considered around latency, queuing, and caching for each of
those underlying exported volumes. I seriously doubt that ZFS would be the
proper place to code this in.
The SCSI vs ATA layer is an analogy here, since the levels of caching and
queuing for performance are handled at the block driver level, not inside
ZFS, and I doubt there would be a sufficiently intelligent way to do it
within ZFS.
That is, I think we should consider adding features into ZFS where the
existing ZFS design provides a decided advantage (e.g. by having access to
the ZFS caching and writing and checksum-ing system makes the
implementation faster). We should probably defer to something further down
the stack when it is possible to localize the design there without causing
a ZFS performance hit. That is, I think features should go as close to the
hardware as possible, unless there's a very noticeable performance
advantage to having them combined with something further up the I/O stack.
Thoughts?
-Erik
Oh, and it would really nice to have a port of DRDB for IllumOS, not to
mention a seriously improved Network Block Device-style thing. :-)
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
23054485-60ad043a<https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=23054485-335460f5<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-04-30 03:24:50 UTC
Permalink
Post by Erik Trimble
Nice can of worms here. :-)
Yes.
Post by Erik Trimble
The more I think of it, the more I'm drawn back to a fundamental problem of
filesystems and block devices: what level of abstraction is important, and
where in the stack is it appropriate to combine functionality?
ZFS is one of only a few designs which combine the logical volume manager
with the filesystem, and I think it's been a great success. (AdvFS is the
only other one I can think of off the top of my head).
Lustre too, in a way...
Post by Erik Trimble
However, we're now talking about performance of the block device that ZFS
depends on, and I'm not sure that adding functionality into ZFS is the right
move (despite my original musing on remote replication). The Fault
Management stuff is a good example here: ZFS depends a great deal on it to
determine whether a device is dead or merely slow. However, ZFS itself does
have some fault heuristics, too, so there does appear to be some benefit
there.
What is it about higher-latency devices that makes it harder to detect
faults? We're not talking about latencies in the tens of seconds or
longer.
Post by Erik Trimble
Right now, we're talking about remote replication. But what about something
like a distributed block device? Here's a good example: 10 machines, each
of which exports 2 volumes. On a client system, make a RAIDZ of 5 vdevs,
but each vdev is really a block device which is a mirror of 2 of those
exported devices? There's a whole lot of intelligence that needs to be
considered around latency, queuing, and caching for each of those underlying
exported volumes. I seriously doubt that ZFS would be the proper place to
code this in.
What's with the 10 machines? ZFS is always local... or did you mean
that the 10 machines export LUNs via iSCSI?

Anyways, on the contrary, I think ZFS *is* the place to have the
correct logic because no other layer will have the required
understanding of the entire pool layout to make proper decisions. If
this is ETOOHARD then DRDB is ETOOHARD too or ETOODANGEROUS.

Nico
--
Geoff Nordli
2013-04-30 05:52:24 UTC
Permalink
Post by Erik Trimble
Nice can of worms here. :-)
The more I think of it, the more I'm drawn back to a fundamental
problem of filesystems and block devices: what level of abstraction
is important, and where in the stack is it appropriate to combine
functionality?
ZFS is one of only a few designs which combine the logical volume
manager with the filesystem, and I think it's been a great success.
(AdvFS is the only other one I can think of off the top of my head).
However, we're now talking about performance of the block device that
ZFS depends on, and I'm not sure that adding functionality into ZFS is
the right move (despite my original musing on remote replication).
The Fault Management stuff is a good example here: ZFS depends a great
deal on it to determine whether a device is dead or merely slow.
However, ZFS itself does have some fault heuristics, too, so there
does appear to be some benefit there.
For remote replication, Pawel's mentioning of DRDB merits
consideration, since there's a whole bunch of additional functionality
that would be nice to have, and maybe ZFS itself isn't the primary
place to put it.
Right now, we're talking about remote replication. But what about
something like a distributed block device? Here's a good example: 10
machines, each of which exports 2 volumes. On a client system, make a
RAIDZ of 5 vdevs, but each vdev is really a block device which is a
mirror of 2 of those exported devices? There's a whole lot of
intelligence that needs to be considered around latency, queuing, and
caching for each of those underlying exported volumes. I seriously
doubt that ZFS would be the proper place to code this in.
The SCSI vs ATA layer is an analogy here, since the levels of caching
and queuing for performance are handled at the block driver level, not
inside ZFS, and I doubt there would be a sufficiently intelligent way
to do it within ZFS.
That is, I think we should consider adding features into ZFS where the
existing ZFS design provides a decided advantage (e.g. by having
access to the ZFS caching and writing and checksum-ing system makes
the implementation faster). We should probably defer to something
further down the stack when it is possible to localize the design
there without causing a ZFS performance hit. That is, I think features
should go as close to the hardware as possible, unless there's a very
noticeable performance advantage to having them combined with
something further up the I/O stack.
Thoughts?
-Erik
Oh, and it would really nice to have a port of DRDB for IllumOS, not
to mention a seriously improved Network Block Device-style thing. :-)
If we are focused on the replication piece, then the biggest advantage I
see for ZFS is you can do replication above the block device level.

You don't need to worry about write ordering, write log sizing, managing
what area of the disk have changed when doing resync and on-disk
consistency.

You don't need to worry about disk/vdev layouts.

It should be as simple as build the pool, link target for replication,
set some thresholds and replication type, and go. Then you need some
userspace tools to switch pool between active/passive and then all of
the other clustering goodies that gets wrapped into this.

You should be able to pre-seed the remote pool with a zfs snapshot.

People are doing this right now with zfs send/receive but something like
this would be a lot more elegant.

Have a great day/night everyone!!

Geoff
Gregg Wonderly
2013-05-01 20:52:25 UTC
Permalink
Isn't here an AWS product that is an iSCSI device running inside a VM that is
"sync'd" across the internet to storage at AWS? I think it needs a specific
VMWare vm type, but otherwise, you can just put one of those iSCSI devices up
against your physical drives, and have another physical copy at that VM, and
then a third copy on AWS...

Gregg Wonderly
Post by Erik Trimble
Nice can of worms here. :-)
The more I think of it, the more I'm drawn back to a fundamental problem of
filesystems and block devices: what level of abstraction is important, and
where in the stack is it appropriate to combine functionality?
ZFS is one of only a few designs which combine the logical volume manager
with the filesystem, and I think it's been a great success. (AdvFS is the
only other one I can think of off the top of my head).
However, we're now talking about performance of the block device that ZFS
depends on, and I'm not sure that adding functionality into ZFS is the right
move (despite my original musing on remote replication). The Fault
Management stuff is a good example here: ZFS depends a great deal on it to
determine whether a device is dead or merely slow. However, ZFS itself does
have some fault heuristics, too, so there does appear to be some benefit there.
For remote replication, Pawel's mentioning of DRDB merits consideration,
since there's a whole bunch of additional functionality that would be nice to
have, and maybe ZFS itself isn't the primary place to put it.
Right now, we're talking about remote replication. But what about something
like a distributed block device? Here's a good example: 10 machines, each of
which exports 2 volumes. On a client system, make a RAIDZ of 5 vdevs, but
each vdev is really a block device which is a mirror of 2 of those exported
devices? There's a whole lot of intelligence that needs to be considered
around latency, queuing, and caching for each of those underlying exported
volumes. I seriously doubt that ZFS would be the proper place to code this in.
The SCSI vs ATA layer is an analogy here, since the levels of caching and
queuing for performance are handled at the block driver level, not inside
ZFS, and I doubt there would be a sufficiently intelligent way to do it
within ZFS.
That is, I think we should consider adding features into ZFS where the
existing ZFS design provides a decided advantage (e.g. by having access to
the ZFS caching and writing and checksum-ing system makes the implementation
faster). We should probably defer to something further down the stack when
it is possible to localize the design there without causing a ZFS performance
hit. That is, I think features should go as close to the hardware as
possible, unless there's a very noticeable performance advantage to having
them combined with something further up the I/O stack.
Thoughts?
-Erik
Oh, and it would really nice to have a port of DRDB for IllumOS, not to
mention a seriously improved Network Block Device-style thing. :-)
If we are focused on the replication piece, then the biggest advantage I see
for ZFS is you can do replication above the block device level.
You don't need to worry about write ordering, write log sizing, managing what
area of the disk have changed when doing resync and on-disk consistency.
You don't need to worry about disk/vdev layouts.
It should be as simple as build the pool, link target for replication, set
some thresholds and replication type, and go. Then you need some userspace
tools to switch pool between active/passive and then all of the other
clustering goodies that gets wrapped into this.
You should be able to pre-seed the remote pool with a zfs snapshot.
People are doing this right now with zfs send/receive but something like this
would be a lot more elegant.
Have a great day/night everyone!!
Geoff
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24100479-cc61d2a4
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Richard Elling
2013-04-30 06:07:44 UTC
Permalink
Post by Erik Trimble
Nice can of worms here. :-)
The more I think of it, the more I'm drawn back to a fundamental problem of filesystems and block devices: what level of abstraction is important, and where in the stack is it appropriate to combine functionality?
That is easy... moving the decisions closer to the application is always a better
choice from a system dependability perspective. You pay for it in time-to-market
and having-to-hire-awesome-developers though. I observe that the guys who
do this well, can create very dependable solutions.
Post by Erik Trimble
ZFS is one of only a few designs which combine the logical volume manager with the filesystem, and I think it's been a great success. (AdvFS is the only other one I can think of off the top of my head).
ReFS is the most recent new addition to the fold. Other implementations exist that
are more block focused, but have abstractions very much like a file system: eg virsto,
ExtremeFFS
Post by Erik Trimble
However, we're now talking about performance of the block device that ZFS depends on, and I'm not sure that adding functionality into ZFS is the right move (despite my original musing on remote replication). The Fault Management stuff is a good example here: ZFS depends a great deal on it to determine whether a device is dead or merely slow. However, ZFS itself does have some fault heuristics, too, so there does appear to be some benefit there.
For remote replication, Pawel's mentioning of DRDB merits consideration, since there's a whole bunch of additional functionality that would be nice to have, and maybe ZFS itself isn't the primary place to put it.
DRBD, HAST, AVS, TrueCopy, SRDF, etc., are attempts to deal with the fact
that applications can't manage their redundancy in legacy environments. They
all suffer from the problem of not being able to provide a guarantee of a single
view of the data. To some degree, file systems have similar constraints, but there
are interfaces for managing the semantics of data synchronization available to
the developer.
Post by Erik Trimble
Right now, we're talking about remote replication. But what about something like a distributed block device? Here's a good example: 10 machines, each of which exports 2 volumes. On a client system, make a RAIDZ of 5 vdevs, but each vdev is really a block device which is a mirror of 2 of those exported devices? There's a whole lot of intelligence that needs to be considered around latency, queuing, and caching for each of those underlying exported volumes. I seriously doubt that ZFS would be the proper place to code this in.
Remote replication is really a trade-off between RPO and performance. Better RPO means
more inconsistent performance and vice versa.
Post by Erik Trimble
The SCSI vs ATA layer is an analogy here, since the levels of caching and queuing for performance are handled at the block driver level, not inside ZFS, and I doubt there would be a sufficiently intelligent way to do it within ZFS.
That is, I think we should consider adding features into ZFS where the existing ZFS design provides a decided advantage (e.g. by having access to the ZFS caching and writing and checksum-ing system makes the implementation faster). We should probably defer to something further down the stack when it is possible to localize the design there without causing a ZFS performance hit. That is, I think features should go as close to the hardware as possible, unless there's a very noticeable performance advantage to having them combined with something further up the I/O stack.
Disagree, features should move up the stack to the application. The database market is well
on its way to moving replication closer to the application.
Post by Erik Trimble
Thoughts?
-Erik
Oh, and it would really nice to have a port of DRDB for IllumOS, not to mention a seriously improved Network Block Device-style thing. :-)
Block-level replicators tend to not do well when the block devices are large. Back when disks
were 9GB, something like AVS worked well. For a 5.6TB disk, block-level replication is a non-starter,
especially in a remote scenario. Also, check your block-level replicator for volume size limitations.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-04-30 06:18:06 UTC
Permalink
On Tue, Apr 30, 2013 at 1:07 AM, Richard Elling
Post by Richard Elling
Post by Erik Trimble
The more I think of it, the more I'm drawn back to a fundamental problem
of filesystems and block devices: what level of abstraction is important,
and where in the stack is it appropriate to combine functionality?
That is easy... moving the decisions closer to the application is always a better
choice from a system dependability perspective. You pay for it in time-to-market
and having-to-hire-awesome-developers though. I observe that the guys who
do this well, can create very dependable solutions.
Bingo. I frequently repeat something like this where it comes to
authentication protocols.

Think of IPsec: it sucks for end-to-end security because there are no
APIs for it, so apps have to trust configuration.

Generally, if not always, moving more of the policy that
applications/users need up the stack is the right answer.

Nico
--
Karl Wagner
2013-04-30 06:42:17 UTC
Permalink
I won't pretend to understand all of this completely, but I agree that
moving things closer to the application is almost always the best bet.

However, when someone else has done most of the work you need, it's often
better to use that rather than skiing things from scratch. With this in
mind I have an alternative proposal. Why don't we expose some underlying
zfs APIs to the world? This would allow it to be used more freely in other
projects. Case in point: with some vdev level APIs, a preexisting
distributed block device driver could gain closer integration with zfs and
this could be handled cooperatively between the two.

As another, separate argument for this, consider a database. It knows far
more about the data in it than zfs ever will. With exposed APIs, it could
exercise more granular control over the storage of this data and optimise
it for its own use.

Hope this makes sense
Karl
Post by Nico Williams
On Tue, Apr 30, 2013 at 1:07 AM, Richard Elling
Post by Richard Elling
Post by Erik Trimble
The more I think of it, the more I'm drawn back to a fundamental problem
of filesystems and block devices: what level of abstraction is
important,
Post by Richard Elling
Post by Erik Trimble
and where in the stack is it appropriate to combine functionality?
That is easy... moving the decisions closer to the application is always
a
Post by Richard Elling
better
choice from a system dependability perspective. You pay for it in time-to-market
and having-to-hire-awesome-developers though. I observe that the guys who
do this well, can create very dependable solutions.
Bingo. I frequently repeat something like this where it comes to
authentication protocols.
Think of IPsec: it sucks for end-to-end security because there are no
APIs for it, so apps have to trust configuration.
Generally, if not always, moving more of the policy that
applications/users need up the stack is the right answer.
Nico
--
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24409195-16edb367
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-04-30 10:02:52 UTC
Permalink
In this case the highest layer that can be expected to understand the
networking/performance characteristics of the devices making up a pool
is ZFS (and its administration utilities, and via those, the
sysadmin). So that's where we should stop.

That's not to say that there aren't lots of things that should be
exported to user applications. Why shouldn't user apps get to
create/delete snapshots and so on? But that's another story.
Karl Wagner
2013-04-30 10:53:33 UTC
Permalink
What I was suggesting was a way to avoid bloating ZFS, more than anything
else.

ZFS is a great filesystem and volume manager. It already does a hell of a
lot more than you would expect from either of these combined. If we can
break out functionality in the form of both kernel and userland APIs,
additional functionality could be added to ZFS without bloating the core
product.

My initial expectations (which kicked off this discussion) of including
HSM-like functionality in ZFS are probably unrealistic. We all want
different things from ZFS, but why does it have to be in the ZFS code? Why
can we not allow ZFS to remain as "just" the filesystem/volume manager it
is now, but allow other software to use this in a more customisable way?

Taking it in the context of this discussion, one could use an existing
distributed block device project for it's purpose, ZFS for it's purpose,
and create a "bridge" project which linked the 2 into the complete solution
you need.

Similarly, for HSM an external project could implement this on top of ZFS.
It would be much more efficient than one which uses just the filesystem
and/or zvol interfaces if it could get/manipulate info from deeper within
ZFS.

Databases could have specific "drivers" to exploit ZFS features to optimise
data structure for it's purpose.

Web servers could (where appropriate) read gzipped data directly from ZFS
without going through the decompress/compress process for sending
compressed data to a client.

The applications of this functionality have incredible scope. It also keeps
ZFS doing what it is good at: storing data.
Post by Nico Williams
In this case the highest layer that can be expected to understand the
networking/performance characteristics of the devices making up a pool
is ZFS (and its administration utilities, and via those, the
sysadmin). So that's where we should stop.
That's not to say that there aren't lots of things that should be
exported to user applications. Why shouldn't user apps get to
create/delete snapshots and so on? But that's another story.
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24409195-16edb367
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-04-30 14:03:07 UTC
Permalink
Post by Erik Trimble
Right now, we're talking about remote replication. But what about
something like a distributed block device? Here's a good example: 10
machines, each of which exports 2 volumes. On a client system, make a
RAIDZ of 5 vdevs, but each vdev is really a block device which is a
mirror of 2 of those exported devices? There's a whole lot of
intelligence that needs to be considered around latency, queuing, and
caching for each of those underlying exported volumes. I seriously doubt
that ZFS would be the proper place to code this in.
One thing to consider is possible presence of errors and discrepancies
between the mirror components. If it is not ZFS managing them but some
other software (or hardware/firmware exporting a LUN, for that matter)
how can we guarantee that we read the correct data from the mirror, and
repair it to be correct if something broke (other than by forcing a
write from ZFS's top-level raidz which should hopefully propagate to
all components)?

Meaning, say sector at offset X is okay on mirror component 1 and bad on
component 0. If this were all managed by ZFS, it would read component 0,
see a checksum mismatch, re-read from component 1, fix comp0, without
making a write over good data (which can fail in itself) and does so
quite transparently to the user. In case of some other system with
legacy approach, the mirror overall would return a bad or good sector
50% of the time, and depending on consistency in the raidz stripe over
all such mirrors, recovery via parity may be or not be possible even
if there is a sufficient set of good sectors overall, but ZFS can't
request and inspect them all directly.

So, does this hypothetical mirroring (local-remote) layer have a similar
capability with checksums or some other means of determining which half
of different data to trust, if any?

Overall, I think if this solution is to be built, it should be within
ZFS which is somehow made aware of disk locality -> latencies, etc.
There would be less finger-pointing and more understanding about data
integrity and consistency.

Just like, now that there is ZFS with its integrity guarantees (or at
least verifiable good data), I have no idea whether any other RAID/NAS
solutions can be similarly trusted in a scenario like above, when no
disk returns an explicit IO error nor dies, but whatever the RAID set
returns is not really consistent - what view of "correct" data is then
chosen, how, how correct it really is, is this at all detected or even
detectable on such systems?.. Those are unanswered questions for me so
far, and I've asked a few vendors' sales technicians ;)

//Jim
Nico Williams
2013-04-29 18:03:07 UTC
Permalink
I agree with Garret that auto-tuning by measuring I/O latencies and
keeping an exponentially decaying average seems better. For one
there's no need for policy in that case.

However, when the user really wants to insist on a policy like "only
write to this mirror; don't bother reading from it, except in order to
recover from errors on other mirrors, or to verify past writes", there
ought to be a way to do that.

Nico
--
Robert Milkowski
2013-05-01 12:38:15 UTC
Permalink
-----Original Message-----
However, when the user really wants to insist on a policy like "only
write to this mirror; don't bother reading from it, except in order to
recover from errors on other mirrors, or to verify past writes", there
ought to be a way to do that.
Yes, this would definitely be useful and that's something VxVM can do.
--
Robert Milkowski
http://milek.blogspot.com
Pawel Jakub Dawidek
2013-04-29 18:26:21 UTC
Permalink
Post by Sašo Kiselkov
Post by Richard Elling
Manually specifying which side of the mirror is to be write-mostly is clearly an ugly solution and,
indeed, when we've had this sort of thing in the past (VxVM) it became painful to manage.
Automation is really the answer, and today there isn't a good solution for this automation in
illumos. This is not an easy problem to solve, however (see my previous query in the old thread
about how can we tell if a disk is "busy")
For example, suppose we have a metro cluster and want to be able to failover to the remote
datacenter. In this case, when the pool is imported on the remote, it should prefer it's local
side of the mirror for read. To do this, we'd have to build a hostid/leaf vdev mapping. Expand
to a 3-way mirror and it gets really, really ugly.
This is something I've been thinking about in the recent past and it
[...]

Why not just use software designed to mirror two block devices over the
network? Like DRBD for Linux, HAST (my baby) for FreeBSD or equivalent
for IllumOS?

ZFS was not designed for this. The first problem I see is split-brain
detection. If both nodes lose connection and make incompatible changes
ZFS will detect this as data corruption and might as well overwrite
changes of one nodes (note that both nodes will bump txg). What it
should do instead (DRBD and HAST already do that) is to report this
situation and don't allow to connect and synchronize anything. Plus DRBD
and HAST already know which device is local and which is remote to use
the local one for reading.
--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Richard Elling
2013-04-27 02:19:37 UTC
Permalink
Hi Karl,
more below...
Post by Karl Wagner
Hi Richard,
Thanks for the info. Some of this is very useful.
I have made some comments bellow.
Hi Karl,
Some comments below...
Post by Karl Wagner
Combined cache and log device: This would allow cache devices to contain virtual log devices. Blocks would be allocates as needed, allowing them to grow or shrink. Obviously you would need a pair of cache devices for redundancy, but only the log blocks would need to be mirrored. Allowing this more dynamic caching system would simplify the setup of a ZFS pool, as well as allowing (potentially) more space to be available for the next feature.
This is possible today using disk partitions and is frequently done in small deployments.
OTOH, creating a hybrid log/cache also adds significant complexity to the administration
and troubleshooting of the system. I'm not convinced that complexity is worth the effort,
when partitioning is already an effective way to manage devices.
WRT system administration, I believe this could be a way to massively simplify it for small deployments. I understand that partitioning is an effective method of placing the cache and log on the same device, but it requires both an understanding of how ZFS cache and log work, and an understanding of your current (and future) workloads to decide on the sizes, and whether you even need either. Combining the 2 would not "waste" space for a barely used log device. It would become a set-and-forget option: Install a (pair of) SSD(s), make them into a combined cache/log, and the system will use it as necessary.
Today, you need to protect the ZIL (mirror), but cannot protect the L2ARC.
The cognitive problem is that people think about these very differently and do not
expect the two behaviours to be mixed on the same devices.
Post by Karl Wagner
Having had a quick read of how the persistent cache works (at http://wiki.illumos.org/display/illumos/Persistent+L2ARC), I would suggest the following. It would appear to me to be "reasonably" simple to implement, although I have yet to get my head around the source code so it may be much more complicated than I realise. Note that this is obviously predicated on the persistent cache device, as you don't want your ZIL going missing on a system crash.
When a sync write comes in, we grab exclusive control of the l2arc_feed_thread.
This is forced to write an immediate pbuf (possibly with a flag set saying the next block is a log block) to at least n devices (which would either be predefined, maybe all devices, or possibly a user-defined value).
Write the log record to the selected devices, followed by another pbuf.
Release control of l2arc_feed_thread.
That's the writing taken care of. The locations of the logs on the cache devices would be held in RAM, which could be rebuilt when needed (e.g. system crash).
If the log record is too small for this to be efficient, we should probably allocate extra space so that additional log records can be fit in the same space. I would need to understand the inner working better to make a call on this.
AFAIK, the log device loops back to the beginning when it reaches the end, so there would need to be code in place to check that log records are not overwritten before it is allowed (assuming I am correct that logs are "removed" when they are no longer needed).
Post by Karl Wagner
ZIL txg commit delay and ARC eviction: I am not sure I am using the correct terminology, but this seems to fit. What I am suggesting is that, when it comes to committing data which is held in the ZIL, we check how "busy" the pool is. If it is going to degrade the performance of the pool to force a commit of that data, we skip it and wait. In addition, IIRC the data waiting to be committed is currently held in the ARC. With this change, we allow this data to be evicted (or pushed to the L2ARC), and then recalled when we are ready to commit. This, along with the next (possibly unneeded) feature, allow a LOG device to become a real write cache.
The reason the ZIL exists is to satisfy the commit-to-persistent-media semantics of storage
protocols. The ARC itself is a write cache, so adding complexity there, or worse -- adding
disk I/O latency, is not likely to be a win.
NB, if you want something more PAM-like, then pay for a fast, nonvolatile SSD for log and set
sync=always. Voila! Just like Netapp! :-)
Otherwise, this sounds like the logbias option, or its automated cousin. In any case, L2ARC does not
apply here because the data will be long since committed to the pool before it is considered for
movement to L2ARC.
The ARC may be a write cache. However, IIRC it only caches very small amounts, committing to disk very quickly.
The commit interval is variable, based on load. But you can build designs based on convenient rules of thumb.
For example, if you have an infeed of 10GbE (~ 1GByte/sec) and the the modern default transaction group
commit interval is 5 seconds, then you need at least 5GB for slog, but not likely more than 10GB. Whether or
not you think this is a small number, depends on your workload, but I've seen production systems that have
more than one 10GbE for infeed and proper design wins.
Post by Karl Wagner
What I am trying to propose is a much larger cache which would deal with peak write loads way in excess of what the pool's underlying storage can handle. Rather than slowing everything down to a crawl, the written data goes into the log where it could sit for some minutes. Meanwhile, the normal work is served from the pool.
Today, written data goes into the ARC. Sync writes also go to the ZIL. Data is normally not read from the ZIL
(for good reason). So the ARC needs to be sized to match your workload. No matter where you choose to cache,
the sizing requirement remains.
Post by Karl Wagner
It would work the other way around, too. A heavy (sequential) read workload is taking place, but other clients are operating lighter mostly write workloads. The heavy sequential read is mostly coming straight from the main pool vdevs. The writes coming in from the other clients are cached, safely, in the combined cache/log, until the pool is able to accept the writes.
Reads, including prefetches, are stored in the ARC. So the ARC policies apply to both.
For heavy read workloads, and decent hardware, you can easily get 6-10 GB/sec from
lots of HDDs or a modest number of SSDs. Your fear of small writes clobbering large
reads is quite rare in my experience -- so rare, I can't recall any such cases.
Post by Karl Wagner
Sidebar question: how would one decide that the pool is "busy"?
I don't know exactly. But I would say you would look at outstanding transactions, both read and write. If there are too many, it's busy. This could possibly be tunable with limits on latency. This is probably a whole topic in itself. I admit that I don't know enough to answer this one :)
It is a whole topic, and simple queues don't work well, especially for HDDs.
Post by Karl Wagner
Just a quick note: One of the projects I am currently looking into is, basically, a simplified HSM system. The underlying pool storage would be on slow media (the 2 I am thinking of are a mechanical disk library, similar to a tape library but using HDDs, or "cloud storage", with limited bandwidth out to the internet). In this situation, even a hard disk is a hell of a lot quicker than the main storage, hence why I was suggesting additional cache levels. So, a write comes in (which could be many gigabytes) and it is safely "buffered" in the log. It can then be "drip fed" to the slow storage. Meanwhile, there is still an L2ARC on the combi cache/log (SSD-based), and there is also a large cache on LnARC, which are just HDDs, avoiding expensive pulls from the main storage.
Methinks you can code this in userland and make it filesystem agnostic much faster
than you could optimize at the filesystem level. Indeed, I believe this is exactly how
the vast majority of developers solve this problem.
-- richard

--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Sašo Kiselkov
2013-04-24 17:24:25 UTC
Permalink
Post by Karl Wagner
Hi all
I currently run the FreeBSD 9.1 release on my home file server. I'm not
sure whether this is the right place to be discussing ZFS on FreeBSD, but
it seems that improvements here are ported over. If I am on the wrong list,
please let me know.
Hi Karl,

We welcome any and all discussion on ZFS here, so feel free to dig in,
even though most (but not all) of us are running some kind of
Illumos-derived system.
Post by Karl Wagner
To get to the point, I was wondering when L2ARC compression and persistence
would be available in FreeBSD. It may be that they are already available in
another branch, but I couldn't find that info. I would prefer to stick with
releases anyway.
FreeBSD is pretty quick in pulling changes that are committed to
Illumos. L2ARC compression is, I believe, stable, but it has yet to go
through a formal code review process - I've requested it on the list
before, but the relevant people seem busy. The persistency thing is a
bit more complicated. The work has been done and I *believe* it is okay,
but it needs more testing to be sure I got everything right.

In short: expect compressed L2ARC in your upcoming FreeBSD release, but
persistent L2ARC might need a bit more time.
Post by Karl Wagner
On to the feature request. I believe that all the groundwork is there to
add a few features which would present a potentially large improvement to
- Combined cache and log device: This would allow cache devices to
contain virtual log devices. Blocks would be allocates as needed, allowing
them to grow or shrink. Obviously you would need a pair of cache devices
for redundancy, but only the log blocks would need to be mirrored. Allowing
this more dynamic caching system would simplify the setup of a ZFS pool, as
well as allowing (potentially) more space to be available for the next
feature.
This doesn't make much sense from my perspective, as log and cache
devices are two fundamentally different things:

1) slog's are small and write-IOPS constrained
2) caches are large and read-IOPS constrained

You can combine the two even today, e.g. by allocating a few GBs of your
cache devices into a different partition and attaching that as a slog.
However, I'm not convinced it makes much sense from a performance
perspective.
Post by Karl Wagner
- ZIL txg commit delay and ARC eviction: I am not sure I am using the
correct terminology, but this seems to fit. What I am suggesting is that,
when it comes to committing data which is held in the ZIL, we check how
"busy" the pool is. If it is going to degrade the performance of the pool
to force a commit of that data, we skip it and wait.
This would necessitate a significant reworking of how the transaction
machinery in DMU works and would make it quite fragile, to say the
least. It's also somewhat unclear to me what this change would achieve.
If your apps are performance sensitive on the write side, implement a
writer queue in a separate thread. If they are sensitive on the read
side, implement custom prefetching, if the default ZFS prefetch code
doesn't work well for you. This is what I do in my real-time apps.

Now of course there is a case to be made for working on the prefetcher
and making it better, but that needs a more specific problem with more
specific operational parameters defined.
Post by Karl Wagner
In addition, IIRC the data waiting to be committed is currently held in the ARC.
No, the ARC is only used for reading. Data to be written is held in the DMU.
Post by Karl Wagner
With this
change, we allow this data to be evicted (or pushed to the L2ARC), and then
recalled when we are ready to commit.
Data is never "pushed" to the L2ARC, it is "pulled" from ARC by the
l2arc_feed_thread. There is no direct path for a buffer from ARC to
L2ARC, the feed thread is at its discretion to elect or reject blocks
from entering the L2ARC.
Post by Karl Wagner
This, along with the next (possibly
unneeded) feature, allow a LOG device to become a real write cache.
- Async ZIL push: IIRC, only sync writes cause entries in the ZIL to be
written. I may be completely wrong. However, if this is the case, I would
propose changing this in line with the above feature. Any async written
data would be allowed to be evicted from the ARC by writing an entry to the
ZIL.
Writes generlaly come in two flavors:

1) sync
2) async

In the first instance, apps care about how quickly the data is committed
to stable storage, and this is already handled by the ZIL using
dedicated log devices. In the second instance, apps don't care when the
blocks make it out to stable storage, so I see relatively little sense
in doing this double caching work.
Post by Karl Wagner
- Prioritised cache devices: Allow multiple cache devices to be given
priorities/levels, such that data to be evicted from the top level L2ARC is
actually migrated down to the next level. This would basically become a
multi-level HSM system.
Due to how the L2ARC is implemented, this is non trivial, but doable. It
becomes radically simpler if we limit the cache hierarchy to some fixed
levels, e.g. L2ARC and L3ARC, or L{2,3,4}ARC. However, this should
mirror the world of storage devices. Is there some third performance
tier that we would need to handle? E.g. something in between ARC and
L2ARC or L2ARC and HDDs. Also keep in mind that that performance delta
has to be significant in order to warrant the extra work to move data
between cache tiers (which is a lot more expensive than ARC -> L2ARC;
first you'll need to fetch the data from the higher tier to move it to
the lower tier). I'm not convinced there is, and even if it there were,
there are probably significantly cheaper options to go about it (e.g.
just buying more L2ARC SSDs and be done with it).
Post by Karl Wagner
- DDT preference/forced load in L2ARC: Unrelated to the rest. As we all
know, ZFS dedupe is very much dependant on having enough RAM and/or L2ARC
space available. What would be nice is, especially on a persistent log
device, to be able to tell ZFS to keep the DDT in L2ARC. If not on a
persistent device, allow it to force a load of the DDT into ARC/L2ARC on
boot/import.
This is something I've been thinking about and I think we could cover
this by implement a cache policy mechanism, which would allow an admin
to control how they prefer their caches to be used. Since the DDT counts
towards metadata, having the ability to dedicate a chunk of ARC/L2ARC to
this would be a nice thing to have. That being said, I'm not sure how
much of a performance impact it would have versus just leaving it as it
is (i.e. self-tuning with an enforced upper bound on metadata). One
could even argue that this is something we already have.
Post by Karl Wagner
- Offline/delayed dedupe: Allow dedupe to be set in such a way that
incoming writes are not checked against the DDT immediately. Instead, they
are committed as if dedupe was off. Then, allow a background process to
examine this data and check for duplicates to be kicked off (like a scrub).
This could be manually from the command line, scheduled, or possibly
automatically by ZFS when it detects a "quiet" pool, suspending if activity
is detected. This could, possibly, allow the space savings of dedupe to be
realised on large datasets by those without the RAM required for the
current dedupe implementation.
This feature has already been discussed on this list and I suggested a
few methods on how to achieve this today. However, you need to consider
that the RAM savings you propose are probably next to nonexistent. When
deduping your data offline, you are still going to have to load the
entire DDT into ARC/L2ARC during the process (since you'll be examining
all data in the particular dataset). The only thing that this protects
you from is taking that hit when you don't want to.

Cheers,
--
Saso
Sašo Kiselkov
2013-04-24 17:32:09 UTC
Permalink
Post by Sašo Kiselkov
No, the ARC is only used for reading. Data to be written is held in the DMU.
Scrub that. In-flight data is held in the ARC, but the old buffers
aren't reused, but new "anonymous" buffers are allocated. Therefore
cached data is effectively dropped from the L2ARC as soon as it's
written to, which is logical.

Cheers,
--
Saso
nwf
2013-05-01 08:23:58 UTC
Permalink
[snip]
Post by Sašo Kiselkov
Post by Karl Wagner
- Offline/delayed dedupe: Allow dedupe to be set in such a way that
incoming writes are not checked against the DDT immediately. Instead, they
are committed as if dedupe was off. Then, allow a background process to
examine this data and check for duplicates to be kicked off (like a scrub).
This could be manually from the command line, scheduled, or possibly
automatically by ZFS when it detects a "quiet" pool, suspending if activity
is detected. This could, possibly, allow the space savings of dedupe to be
realised on large datasets by those without the RAM required for the
current dedupe implementation.
This feature has already been discussed on this list and I suggested a
few methods on how to achieve this today. However, you need to consider
that the RAM savings you propose are probably next to nonexistent. When
deduping your data offline, you are still going to have to load the
entire DDT into ARC/L2ARC during the process (since you'll be examining
all data in the particular dataset). The only thing that this protects
you from is taking that hit when you don't want to.
That's not necessarily true. HAMMER2 intends, AFAIK, to do incremental
deduplication using a sliding-window technique which bounds the amount of
RAM used (to a constant!) but at the expense of needing several passes over
the data to achieve full deduplication. (For my workloads, which are
write-spikey but have gobs of idle time between spikes, this would be
amazingly useful.)

Note that HAMMER2 is designed from the beginning to move data around all the
time, in contrast to the relatively stationary life that data in ZFS leads.
Among other things, HAMMER2 does not actually refcount blocks at all and
relies on a copying GC to evacuate allocation arenas for reuse, much like a
log-structured FS. The GC is in an excellent position to do incremental
deduplication along with its (defragmenting) evacuation work.

Specifically, the technique is something like this:
1) Allocate RAM for up to N dedup metadata entries (hash->blockptr),
which will be maintained sorted ascending by hash.

2) Initialize a "smallest tracked hash" to 0 and a "largest tracked
hash" to all ones.

3) Read the tree on disk in order. For each block, if its hash is
lower than the smallest tracked hash or greater than the largest,
do not attempt to deduplicate it, but continue recursing through it.

If this block matches one in the current table, rewrite the spine
of the tree to pivot the block pointer to the tabled entry and
carry on. Note that these rewrites are done CoW, so they necessarily
spill into new regions of the disk; we'll clean that up, and, indeed
dedup these blocks, too, eventually.

Otherwise, insert its hash and blkptr(s) into the table of N entries.
If this displaces a hash entry off the large side, adjust the
largest tracked hash to be the largest entry still in the table.

4) If the largest tracked hash is not all ones, empty the table, set the
smallest tracked hash to be the largest, the largest to be all ones,
and repeat step 3.

("Read the tree in order" is, indeed, a lot of data over the wire; a full
ZFS scrub. It's potentially important to note, though, that because the
data will often have been rewritten in logical order by the GC that this is
likely to be mostly sequential reads. Refinements can of course also be
made based on the age of allocation arenas, approximate reference counts,
(temporarily) allocating room on disk for a DDT-like index or bloom filters,
and so on.)

Full credit for all of the above goes to Matthew Dillon; errors in
comprehension are mine. Hopefully it is nonetheless good food for thought.
:)
--nwf;
Matthew Ahrens
2013-05-01 16:43:19 UTC
Permalink
On Wed, May 1, 2013 at 1:23 AM, nwf <
Post by nwf
Note that HAMMER2 is designed from the beginning to move data around all the
time, in contrast to the relatively stationary life that data in ZFS leads.
Cool!

Among other things, HAMMER2 does not actually refcount blocks at all


Neither does ZFS; we know immediately when to free blocks by using the
birth times; see https://blogs.oracle.com/ahrens/entry/is_it_magic

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-05-01 17:03:35 UTC
Permalink
Post by Matthew Ahrens
On Wed, May 1, 2013 at 1:23 AM, nwf
Post by nwf
Note that HAMMER2 is designed from the beginning to move data around all the
time, in contrast to the relatively stationary life that data in ZFS leads.
Cool!
Indeed.
Post by Matthew Ahrens
Post by nwf
Among other things, HAMMER2 does not actually refcount blocks at all
Neither does ZFS; we know immediately when to free blocks by using the
birth times; see https://blogs.oracle.com/ahrens/entry/is_it_magic
Would it be fair to say that ZFS refcounts snapshotted TXGs and
because we track block birth TXGs we effectively refcount blocks, if
indirectly?

Nico
--
Matthew Ahrens
2013-05-01 17:33:27 UTC
Permalink
Post by Nico Williams
Post by Matthew Ahrens
Post by nwf
Among other things, HAMMER2 does not actually refcount blocks at all
Neither does ZFS; we know immediately when to free blocks by using the
birth times; see https://blogs.oracle.com/ahrens/entry/is_it_magic
Would it be fair to say that ZFS refcounts snapshotted TXGs and
because we track block birth TXGs we effectively refcount blocks, if
indirectly?
No, I don't think so. Where is there a refcount on a txg?

Not that there's anything inherently bad about refcounting, it just isn't
the way ZFS does it.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-05-01 17:49:27 UTC
Permalink
Post by Matthew Ahrens
Post by Nico Williams
Would it be fair to say that ZFS refcounts snapshotted TXGs and
because we track block birth TXGs we effectively refcount blocks, if
indirectly?
No, I don't think so. Where is there a refcount on a txg?
It's not that there's actual refcount on a txg, but that ZFS knows
which (and so how many) snapshots include a given txg in them.

Nico
--
nwf
2013-05-01 23:39:37 UTC
Permalink
Post by Matthew Ahrens
On Wed, May 1, 2013 at 1:23 AM, nwf <
Post by nwf
Note that HAMMER2 is designed from the beginning to move data around all the
time, in contrast to the relatively stationary life that data in ZFS leads.
Cool!
Among other things, HAMMER2 does not actually refcount blocks at all
Neither does ZFS; we know immediately when to free blocks by using the
birth times; see https://blogs.oracle.com/ahrens/entry/is_it_magic
I thought that was only true of non-deduped data. Deduped data is tracked by
refcount in the DDT, no? (that's the purpose of the ddp_refcnt field, right?)

Thanks.
--nwf;
Nico Williams
2013-05-01 17:32:16 UTC
Permalink
On Wed, May 1, 2013 at 3:23 AM, nwf
Post by nwf
That's not necessarily true. HAMMER2 intends, AFAIK, to do incremental
deduplication using a sliding-window technique which bounds the amount of
RAM used (to a constant!) but at the expense of needing several passes over
the data to achieve full deduplication. (For my workloads, which are
write-spikey but have gobs of idle time between spikes, this would be
amazingly useful.)
Note that HAMMER2 is designed from the beginning to move data around all the
time, in contrast to the relatively stationary life that data in ZFS leads.
Among other things, HAMMER2 does not actually refcount blocks at all and
relies on a copying GC to evacuate allocation arenas for reuse, much like a
log-structured FS. The GC is in an excellent position to do incremental
deduplication along with its (defragmenting) evacuation work.
[snip]
Actually, this sounds a lot like something I was pushing in private
e-mails recently.

Because of snapshots we can't just overwrite an old block after moving
it (and possibly transforming, such as recompressing with a diff.
alg.). We need to instead treat blocks as content addressed, and to
have a *logical* DDT for handling moves of blocks that might still be
referenced somewhere (including in-core data structures!).

Now, a logical DDT for bp re-write need not be the same thing as a
full-blown DDT (though that would work). That means we can optimize
such a logical DDT in many ways:

- we can leave forwardings to new block locations in the old block
locations (this requires more free space for swinging the bp
re-write);

- we can have a list of mini-DDTs, with appropriate indexing
techniques, one per-txg being re-written, to make lookups faster
(because, among other things, we can throw out mini-DDTs as we no
longer need them);

- we can even have a full-blown DDT but one that is temporary and
separate from the main one (if any) so that we can delete it when
we're done / prune it as we go;

- if the Bloom filter idea turns out to provide a desirable trade-off
then we could also apply it to the bp re-write logical DDT.

To think of ZFS as content-address... the DVAs in a blkptr_t need to
be seen as a cache of DDT lookup results. So much so that I wonder
why bother including DVAs when we compute block checksums (for blocks
that contain blkptr_ts, obviously): it's just a cache.

If we didn't checksum DVAs then we could have significantly more
stable Merkle hash chain roots (e.g., dnode checksums, which aren't
roots, but for a whole file they effectively are): they'd only change
when the content changes / they'd not change when only the location
changes. Having stable checksum/hashes we could expose to apps would
be really useful, though, to be fair, checksums would still not be
stable in the face of re-compression, encryption key changes,
recordsize changes, ...

Of course, computing checksums without including DVAs would be
painful. It'd be better to not have DVAs in blkptr_t at all; instead
blocks that contain blkptr_t (e.g., indirect blocks) should have a
contiguous section that is an array of DVAs for the DVA-less blkptr_ts
in the rest of that block, which would make it easy to exclude that
section when computing block checksums. (The size of the DVA section
could be included in the blkptr_t for any block that contains
blkptr_ts, that way it's always easy to re-compute checksums for
validation.)
Post by nwf
1) Allocate RAM for up to N dedup metadata entries (hash->blockptr),
which will be maintained sorted ascending by hash.
This sounds like a mini DDT!
Post by nwf
2) Initialize a "smallest tracked hash" to 0 and a "largest tracked
hash" to all ones.
3) Read the tree on disk in order. For each block, if its hash is
lower than the smallest tracked hash or greater than the largest,
do not attempt to deduplicate it, but continue recursing through it.
You mean traverse in-order, as opposed to pre- or post-?
Post by nwf
If this block matches one in the current table, rewrite the spine
of the tree to pivot the block pointer to the tabled entry and
carry on. Note that these rewrites are done CoW, so they necessarily
spill into new regions of the disk; we'll clean that up, and, indeed
dedup these blocks, too, eventually.
Otherwise, insert its hash and blkptr(s) into the table of N entries.
If this displaces a hash entry off the large side, adjust the
largest tracked hash to be the largest entry still in the table.
4) If the largest tracked hash is not all ones, empty the table, set the
smallest tracked hash to be the largest, the largest to be all ones,
and repeat step 3.
("Read the tree in order" is, indeed, a lot of data over the wire; a full
ZFS scrub. It's potentially important to note, though, that because the
data will often have been rewritten in logical order by the GC that this is
likely to be mostly sequential reads. Refinements can of course also be
made based on the age of allocation arenas, approximate reference counts,
(temporarily) allocating room on disk for a DDT-like index or bloom filters,
and so on.)
BTW, this reminds me greatly of the BSD4.4 LFS (which, to my
knowledge, was never really completed at the time), which required a
GC to re-write old transactions to effectively de-fragment. LFS wrote
fixed-sized transactions in a single (contiguous) block (though it
could be written incrementally) and as time went by older transactions
ended up with more and more dead blocks (i.e., no longer reachable
from the latest transaction). This meant that LFS could run out of
contiguous free space for writing a transaction even though there was
plenty of free space all over older transactions. The GC needed to
find old transactions, mark up logically free space, then re-write
still-live old blocks into the currently-open transaction.

(There are some striking conceptual similarities between ZFS and LFS,
incidentally, though the latter was mostly vaporware and really needed
a lot more work. I don't know if Jeff Bonwick, Bill Moore, Matt
Ahrens, and the rest were aware of LFS though.)
Post by nwf
Full credit for all of the above goes to Matthew Dillon; errors in
comprehension are mine. Hopefully it is nonetheless good food for thought.
:)
It is!

Nico
--
Timothy Coalson
2013-05-01 22:18:27 UTC
Permalink
Post by Nico Williams
Of course, computing checksums without including DVAs would be
painful. It'd be better to not have DVAs in blkptr_t at all; instead
blocks that contain blkptr_t (e.g., indirect blocks) should have a
contiguous section that is an array of DVAs for the DVA-less blkptr_ts
in the rest of that block, which would make it easy to exclude that
section when computing block checksums. (The size of the DVA section
could be included in the blkptr_t for any block that contains
blkptr_ts, that way it's always easy to re-compute checksums for
validation.)
All data read from disk must be checksummed before it can be trusted. If
we aren't going to checksum the DVAs in that location, then we can't trust
them after reading them from disk, so what is the point of having them
there?

This seems obvious enough that it has probably been thought of before, but
if we indirected all filesystem blocks through an arbitrary identifier that
tries to keep locality of related information, we could remove DVAs from
all filesystem metadata blocks, making it much easier to move filesystem
blocks (basically, we wouldn't need bp_rewrite). I'm guessing the reason
this wasn't done has to do with the extra vdev reads associated with
filesystem reads that miss the cached portion of the lookup structure?

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-05-01 23:15:30 UTC
Permalink
Post by Timothy Coalson
Post by Nico Williams
Of course, computing checksums without including DVAs would be
painful. It'd be better to not have DVAs in blkptr_t at all; instead
blocks that contain blkptr_t (e.g., indirect blocks) should have a
contiguous section that is an array of DVAs for the DVA-less blkptr_ts
in the rest of that block, which would make it easy to exclude that
section when computing block checksums. (The size of the DVA section
could be included in the blkptr_t for any block that contains
blkptr_ts, that way it's always easy to re-compute checksums for
validation.)
All data read from disk must be checksummed before it can be trusted. If
we aren't going to checksum the DVAs in that location, then we can't trust
them after reading them from disk, so what is the point of having them
there?
Let's say that the DVAs got corrupted. So you'll read the wrong
block. And then the checksum from the blkptr_t won't match. So
you're good. No harm results. (Although you'll want a
cryptographic-strength hash.)

All checksumming DVAs accomplishes is saving you some read I/Os on the
way to detecting corruption. Also, it makes the checksums depend on
block locations, which kinda sucks: vdev evacuation and
defragmentation would necessarily result in the pointers to dnode
blocks changing even though no content changed, which makes any notion
of exposing "file checksums" from the filesystem to the app...
pointless.

But having the DVAs where the block pointer (the checksum of the
pointed-to block, plus other metadata) makes reading that block fast:
no additional lookup is needed.
Post by Timothy Coalson
This seems obvious enough that it has probably been thought of before, but
if we indirected all filesystem blocks through an arbitrary identifier that
tries to keep locality of related information, we could remove DVAs from all
filesystem metadata blocks, making it much easier to move filesystem blocks
(basically, we wouldn't need bp_rewrite). I'm guessing the reason this
wasn't done has to do with the extra vdev reads associated with filesystem
reads that miss the cached portion of the lookup structure?
Content-addressed filesystems exist. Content-addressing roughly means
using a DDT not just at write-time, but at read-time as well, and
performance goes in the toilet.

I'm not advocating that ZFS become a content-addressed filesystem.
I'm proposing that thinking of it as such helps think through bp
re-write. For example, if a block appears to be corrupt, it might
just be the case that it recently got re-written, so go look it up
somewhere (a temporary, logical DDT-like structure, perhaps).

Nico
--
Jim Klimov
2013-05-01 23:24:44 UTC
Permalink
Post by Nico Williams
Post by Timothy Coalson
All data read from disk must be checksummed before it can be trusted. If
we aren't going to checksum the DVAs in that location, then we can't trust
them after reading them from disk, so what is the point of having them
there?
Let's say that the DVAs got corrupted. So you'll read the wrong
block. And then the checksum from the blkptr_t won't match. So
you're good. No harm results. (Although you'll want a
cryptographic-strength hash.)
Wouldn't this mean that if the DVAs are not checksummed, then a corrupt
one is not detectable? That is, we are requesting to read a stream of
sectors from some random location, find a checksum mismatch, request a
repair of sectors at that location using some random data... at best,
if the block was dittoed elsewhere (copies >= 2) then we have another
set of untrustworthy DVAs to read and validate against data checksum -
and use that data to "repair" something at a random location from the
corrupt DVA, actually likely overwriting valid data of some other block.

Is this a realistic breakage scenario, or did I miss something in your
thread (I did just skim over some paragraphs)? ;)

//Jim
Nico Williams
2013-05-01 23:28:48 UTC
Permalink
Post by Jim Klimov
Post by Nico Williams
Post by Timothy Coalson
All data read from disk must be checksummed before it can be trusted. If
we aren't going to checksum the DVAs in that location, then we can't trust
them after reading them from disk, so what is the point of having them
there?
Let's say that the DVAs got corrupted. So you'll read the wrong
block. And then the checksum from the blkptr_t won't match. So
you're good. No harm results. (Although you'll want a
cryptographic-strength hash.)
Wouldn't this mean that if the DVAs are not checksummed, then a corrupt
one is not detectable? That is, we are requesting to read a stream of
It's detectable: you read whatever is at the DVA (assuming it's a
valid DVA in the first place) and checksum it, and if the checksum
doesn't match the one in the blkptr_t and you either have nowhere to
lookup the block by it's cryptographic address (checksum) or you don't
find it there, then the DVA is corrupt.
Post by Jim Klimov
sectors from some random location, find a checksum mismatch, request a
repair of sectors at that location using some random data... at best,
if the block was dittoed elsewhere (copies >= 2) then we have another
set of untrustworthy DVAs to read and validate against data checksum -
and use that data to "repair" something at a random location from the
corrupt DVA, actually likely overwriting valid data of some other block.
Repair is an issue: you have to make sure that you're not stepping on
bp re-write's toes. I think that can be done.
Post by Jim Klimov
Is this a realistic breakage scenario, or did I miss something in your
thread (I did just skim over some paragraphs)? ;)
You didn't miss anything.
Timothy Coalson
2013-05-01 23:37:28 UTC
Permalink
Post by Nico Williams
Post by Timothy Coalson
Post by Nico Williams
Of course, computing checksums without including DVAs would be
painful. It'd be better to not have DVAs in blkptr_t at all; instead
blocks that contain blkptr_t (e.g., indirect blocks) should have a
contiguous section that is an array of DVAs for the DVA-less blkptr_ts
in the rest of that block, which would make it easy to exclude that
section when computing block checksums. (The size of the DVA section
could be included in the blkptr_t for any block that contains
blkptr_ts, that way it's always easy to re-compute checksums for
validation.)
All data read from disk must be checksummed before it can be trusted. If
we aren't going to checksum the DVAs in that location, then we can't
trust
Post by Timothy Coalson
them after reading them from disk, so what is the point of having them
there?
Let's say that the DVAs got corrupted. So you'll read the wrong
block. And then the checksum from the blkptr_t won't match. So
you're good. No harm results. (Although you'll want a
cryptographic-strength hash.)
And then it will try to reconstruct the "incorrect" block at that nonsense
DVA from parity or mirrors based on the correct checksum, fail, and report
an unrecoverable error in a nonsense block. It is important to know where
the error is, not just that there is one. Or do you know of some way to
recover from this gracefully?

All checksumming DVAs accomplishes is saving you some read I/Os on the
Post by Nico Williams
way to detecting corruption. Also, it makes the checksums depend on
block locations, which kinda sucks: vdev evacuation and
defragmentation would necessarily result in the pointers to dnode
blocks changing even though no content changed, which makes any notion
of exposing "file checksums" from the filesystem to the app...
pointless.
From what I understand of the on-disk structure, it would need to get the
checksums of all blocks of a single file, and recombine them to get a
file-specific checksum not changed by other unrelated metadata. Even then,
some checksum recombinations may change depending on how the data gets
split into blocks due to variable block sizes. I'm not sure how
applications would ask for this for a file, given that most filesystems
don't provide this. I'm not entirely sure of the merit of the
functionality, either.
Post by Nico Williams
But having the DVAs where the block pointer (the checksum of the
no additional lookup is needed.
Yes, as long as you can identify when they get corrupted, as opposed to the
block they point to being corrupted.
Post by Nico Williams
Post by Timothy Coalson
This seems obvious enough that it has probably been thought of before,
but
Post by Timothy Coalson
if we indirected all filesystem blocks through an arbitrary identifier
that
Post by Timothy Coalson
tries to keep locality of related information, we could remove DVAs from
all
Post by Timothy Coalson
filesystem metadata blocks, making it much easier to move filesystem
blocks
Post by Timothy Coalson
(basically, we wouldn't need bp_rewrite). I'm guessing the reason this
wasn't done has to do with the extra vdev reads associated with
filesystem
Post by Timothy Coalson
reads that miss the cached portion of the lookup structure?
Content-addressed filesystems exist. Content-addressing roughly means
using a DDT not just at write-time, but at read-time as well, and
performance goes in the toilet.
This isn't a DDT, this is more like an inode in ext2/3/4 (at least, as I
understand it). The DDT is slow when it gets too big because there is zero
locality of reference, so partially caching is about as bad as not caching
at all. If you have all blocks of a file in a small cluster of
identifiers, say a linear range, you only need a small contiguous piece of
the lookup structure to be in-memory for that file, and it may even contain
the lookups for other files written at close to the same time, which may
have been read just before, or going to be read just after this file.

I'm not advocating that ZFS become a content-addressed filesystem.
Post by Nico Williams
I'm proposing that thinking of it as such helps think through bp
re-write. For example, if a block appears to be corrupt, it might
just be the case that it recently got re-written, so go look it up
somewhere (a temporary, logical DDT-like structure, perhaps).
Yes, it has merit as a way of handling the problem of untracked multiple
references. I just thought I'd mention something that I didn't see in
previous discussions of bp_rewrite.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Timothy Coalson
2013-05-04 00:06:46 UTC
Permalink
Answering myself after some reading up.
Post by Timothy Coalson
Post by Nico Williams
Let's say that the DVAs got corrupted. So you'll read the wrong
block. And then the checksum from the blkptr_t won't match. So
you're good. No harm results. (Although you'll want a
cryptographic-strength hash.)
And then it will try to reconstruct the "incorrect" block at that nonsense
DVA from parity or mirrors based on the correct checksum, fail, and report
an unrecoverable error in a nonsense block. It is important to know where
the error is, not just that there is one. Or do you know of some way to
recover from this gracefully?
Per your other email, yes, you do have a way. However, since it only
applies when dedup is used, when it isn't used the DVAs must still be
checksummed (since they don't exist in any other location that we can know
to look at), which means the existence of any benefits of the change are
subject to configuration. Personally, I would want anything exposed to
applications to be largely independent of configuration.

I'm not sure why you mentioned using a cryptographic hash, for a hash
collision you should compare the possibly corrupt DVA to all entries for
the hash in the DDT, and if it doesn't match any of them, it is corrupt.
Making a lot of collisions on one hash and waiting for a DVA corruption on
one of them doesn't seem that practical an attack for a slightly increased
chance of it misidentifying the DVA as correct. Could you elaborate?
Post by Timothy Coalson
This seems obvious enough that it has probably been thought of before, but
Post by Nico Williams
Post by Timothy Coalson
if we indirected all filesystem blocks through an arbitrary identifier
that
Post by Timothy Coalson
tries to keep locality of related information, we could remove DVAs
from all
Post by Timothy Coalson
filesystem metadata blocks, making it much easier to move filesystem
blocks
Post by Timothy Coalson
(basically, we wouldn't need bp_rewrite). I'm guessing the reason this
wasn't done has to do with the extra vdev reads associated with
filesystem
Post by Timothy Coalson
reads that miss the cached portion of the lookup structure?
Content-addressed filesystems exist. Content-addressing roughly means
using a DDT not just at write-time, but at read-time as well, and
performance goes in the toilet.
This isn't a DDT, this is more like an inode in ext2/3/4 (at least, as I
understand it). The DDT is slow when it gets too big because there is zero
locality of reference, so partially caching is about as bad as not caching
at all. If you have all blocks of a file in a small cluster of
identifiers, say a linear range, you only need a small contiguous piece of
the lookup structure to be in-memory for that file, and it may even contain
the lookups for other files written at close to the same time, which may
have been read just before, or going to be read just after this file.
I was thinking that each file block was an object, which is incorrect (and
would have given them each an identifier like I was thinking anyway).
Changing this would cause some limits on filesystem size to shrink, and
would do absolutely nothing for other object types, so really doesn't solve
anything, and makes other things worse.

Tim



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...