Discussion:
Voluntary fsync
Nagy, Attila via illumos-zfs
2014-08-15 16:03:09 UTC
Permalink
Hi,

I'm not sure the name in the subject is right, but here's what I think of.
FSYNC(2) from FreeBSD says:
The fsync() system call causes all modified data and attributes of
fd to
be moved to a permanent storage device. This normally results in
all in-
core modified copies of buffers for the associated file to be
written to
a disk.

I would call it mandatory fsync, meaning if I call fsync(fd), the OS
immediately starts to write dirty buffers onto stable storage (ZIL in
zfs, possibly a double write eventually) and returns when it's done.

Under voluntary fsync I mean it will not trigger a sync. Everything
works as today, zfs collects the to be written data in memory and when
time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns
when all dirty buffers up to the point, it's called are safely written
(no matter where, into the ZIL or its final place).

My use case:
I have some mail servers. The SMTP servers receive mails from the
internet from other SMTP servers. When the SMTP daemon receives a mail,
it has to fsync that in order to ensure that the mail is on the disk.
If 1000 mails come in the same second, it would have to do 1000 fsyncs.
No throughput, SSDs needed to overcome this.
With a voluntary fsync, the server would issue 1000 (v)fsyncs too, but
each of them would block until zfs writes the 1000 e-mails onto stable
storage (or something else triggers a txg switch).
If a zfs txg is no larger than 1 second, each mail delivery will be
delayed with a maximum of 1 second, but writing 1000 mails will only
trigger one txg flush, with much less IOPS needed.

Of course the program could be smart about that and manage all of this
itself (collecting incoming data into one file, delaying
acknowledgements and issue just one fsync when it's needed), but it
would need a major rewrite in nearly all of these software.

Having a voluntary fsync in zfs is a lot more easier, only the fsyncs
which can wait would have to be changed to "vfsync" and the rest would
be done by zfs.

What do you think?
Bob Friesenhahn via illumos-zfs
2014-08-16 01:30:16 UTC
Permalink
Having a voluntary fsync in zfs is a lot more easier, only the fsyncs which
can wait would have to be changed to "vfsync" and the rest would be done by
zfs.
What do you think?
I think that it is a very interesting idea.

I also wonder if this would cause unforseen application-space
deadlocks due to programs which depend on each other getting stuck in
the writes.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Garrett D'Amore via illumos-zfs
2014-08-16 01:38:14 UTC
Permalink
Nah. You don't defer the sync forever, just until the next regularly
schedule flush or txg commit. No problem.


On Fri, Aug 15, 2014 at 6:30 PM, Bob Friesenhahn via illumos-zfs <
Post by Bob Friesenhahn via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Having a voluntary fsync in zfs is a lot more easier, only the fsyncs
which can wait would have to be changed to "vfsync" and the rest would be
done by zfs.
What do you think?
I think that it is a very interesting idea.
I also wonder if this would cause unforseen application-space deadlocks
due to programs which depend on each other getting stuck in the writes.
Bob
--
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/
22035932-85c5d227
Modify Your Subscription: https://www.listbox.com/
member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens via illumos-zfs
2014-08-16 15:58:34 UTC
Permalink
Sounds reasonable. Could be implemented as a new value for the "logbias" property. It's kind of like an extreme form of logbias=throughput. "zfs set logbias=slow"? :-)

--matt
Post by Nagy, Attila via illumos-zfs
Hi,
I'm not sure the name in the subject is right, but here's what I think of.
The fsync() system call causes all modified data and attributes of fd to
be moved to a permanent storage device. This normally results in all in-
core modified copies of buffers for the associated file to be written to
a disk.
I would call it mandatory fsync, meaning if I call fsync(fd), the OS immediately starts to write dirty buffers onto stable storage (ZIL in zfs, possibly a double write eventually) and returns when it's done.
Under voluntary fsync I mean it will not trigger a sync. Everything works as today, zfs collects the to be written data in memory and when time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns when all dirty buffers up to the point, it's called are safely written (no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the internet from other SMTP servers. When the SMTP daemon receives a mail, it has to fsync that in order to ensure that the mail is on the disk.
If 1000 mails come in the same second, it would have to do 1000 fsyncs. No throughput, SSDs needed to overcome this.
With a voluntary fsync, the server would issue 1000 (v)fsyncs too, but each of them would block until zfs writes the 1000 e-mails onto stable storage (or something else triggers a txg switch).
If a zfs txg is no larger than 1 second, each mail delivery will be delayed with a maximum of 1 second, but writing 1000 mails will only trigger one txg flush, with much less IOPS needed.
Of course the program could be smart about that and manage all of this itself (collecting incoming data into one file, delaying acknowledgements and issue just one fsync when it's needed), but it would need a major rewrite in nearly all of these software.
Having a voluntary fsync in zfs is a lot more easier, only the fsyncs which can wait would have to be changed to "vfsync" and the rest would be done by zfs.
What do you think?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Nagy, Attila via illumos-zfs
2014-08-17 21:54:06 UTC
Permalink
Post by Matthew Ahrens via illumos-zfs
Sounds reasonable. Could be implemented as a new value for the "logbias" property. It's kind of like an extreme form of logbias=throughput. "zfs set logbias=slow"? :-)
I hope I was clear on what I would like to have (tried to summarise in
my previous e-mail).
If that sounds reasonable, I'm very happy and would even happier if it's
so reasonable that somebody will implement it. :)
Do you see any complexity here, or everything is in place for it?

BTW, I would call it aggregated or delayed. If well tuned it will be
anything but slow.

And yes, making this a zfs option is maybe the easiest form of
implementing it, because that way applications could easily use it (the
sysadmin could decide, no modification needed) and with zfs, it can be a
per-directory setting, which may be flexible enough.

Maybe the only problem here is that txg timeout can only be set
globally. This could really rock if the maximum interval for which it
delays the physical disk writes could be set on the zfs itself.

But for my use case even this is quite good.
Bob Friesenhahn via illumos-zfs
2014-08-17 23:44:48 UTC
Permalink
Post by Matthew Ahrens via illumos-zfs
Sounds reasonable. Could be implemented as a new value for the "logbias"
property. It's kind of like an extreme form of logbias=throughput. "zfs
set logbias=slow"? :-)
I hope I was clear on what I would like to have (tried to summarise in my
previous e-mail).
If that sounds reasonable, I'm very happy and would even happier if it's so
reasonable that somebody will implement it. :)
Do you see any complexity here, or everything is in place for it?
BTW, I would call it aggregated or delayed. If well tuned it will be anything
but slow.
It would be good to consider the possibly increased pool fragmentation
caused by producing transaction groups more often and the increased
amount of COW activity reaching the disk, resulting in larger
snapshots.

With normal zfs behavior, quite a lot of writes don't even make it to
disk since the write was overwritten by subsequent writes before the
next transaction group occurs. This is easily demonstrated.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling via illumos-zfs
2014-08-17 02:01:37 UTC
Permalink
Post by Nagy, Attila via illumos-zfs
Hi,
I'm not sure the name in the subject is right, but here's what I think of.
The fsync() system call causes all modified data and attributes of fd to
be moved to a permanent storage device. This normally results in all in-
core modified copies of buffers for the associated file to be written to
a disk.
I would call it mandatory fsync, meaning if I call fsync(fd), the OS immediately starts to write dirty buffers onto stable storage (ZIL in zfs, possibly a double write eventually) and returns when it's done.
Are you sure your workload is doing what you think? You don't say what mailer you're
using, but be aware that there is also fsync-on-close semantics. If the mailer enqueues
each message as a separate file, then the workload becomes dominated by create-open-close.

There is a plethora of performance information on scaling mail servers out there. In
particular, the filebench mail workload models the old sendmail + NFS mounted mail
clients workload.
Post by Nagy, Attila via illumos-zfs
Under voluntary fsync I mean it will not trigger a sync. Everything works as today, zfs collects the to be written data in memory and when time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns when all dirty buffers up to the point, it's called are safely written (no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the internet from other SMTP servers. When the SMTP daemon receives a mail, it has to fsync that in order to ensure that the mail is on the disk.
MTAs work differently than the client-side IMAP, POP, et.al. workloads.
IIRC, the spec requires an MTA put received messages on persistent media before
acknowledging they are queued and closing the session.
Post by Nagy, Attila via illumos-zfs
If 1000 mails come in the same second, it would have to do 1000 fsyncs. No throughput, SSDs needed to overcome this.
You'll notice that the big boys don't do this with traditional POSIX file systems.

Have you taken performance measurements to compare against your proposed
solution? This can be relatively easy to setup as a demo project for experiments.
-- richard
Post by Nagy, Attila via illumos-zfs
With a voluntary fsync, the server would issue 1000 (v)fsyncs too, but each of them would block until zfs writes the 1000 e-mails onto stable storage (or something else triggers a txg switch).
If a zfs txg is no larger than 1 second, each mail delivery will be delayed with a maximum of 1 second, but writing 1000 mails will only trigger one txg flush, with much less IOPS needed.
Of course the program could be smart about that and manage all of this itself (collecting incoming data into one file, delaying acknowledgements and issue just one fsync when it's needed), but it would need a major rewrite in nearly all of these software.
Having a voluntary fsync in zfs is a lot more easier, only the fsyncs which can wait would have to be changed to "vfsync" and the rest would be done by zfs.
What do you think?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
Nagy, Attila via illumos-zfs
2014-08-17 20:50:38 UTC
Permalink
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Under voluntary fsync I mean it will not trigger a sync. Everything works as today, zfs collects the to be written data in memory and when time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns when all dirty buffers up to the point, it's called are safely written (no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the internet from other SMTP servers. When the SMTP daemon receives a mail, it has to fsync that in order to ensure that the mail is on the disk.
MTAs work differently than the client-side IMAP, POP, et.al. workloads.
Sure, but how's that relevant here?
Post by Richard Elling via illumos-zfs
IIRC, the spec requires an MTA put received messages on persistent media before
acknowledging they are queued and closing the session.
Sure, that's where this voluntary (or maybe better: delayed, let's call
it this way, maybe it's more clear) fsync could help.
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
If 1000 mails come in the same second, it would have to do 1000 fsyncs. No throughput, SSDs needed to overcome this.
You'll notice that the big boys don't do this with traditional POSIX file systems.
Even the big boys must ensure that the data is safe. Safe could mean
it's on one or two local disks, or on two or more local/remote servers
(and/or their disks, but some big boys don't care about that :).
But again: what this has to do with the proposed approach?
Post by Richard Elling via illumos-zfs
Have you taken performance measurements to compare against your proposed
solution? This can be relatively easy to setup as a demo project for experiments.
Sure.
Opening a file, appending 10000*100 kiB (let's call these mails) and
calling fsync on each write runs for 186.53 seconds on my notebook with
a slow as hell disk and zfs.
Calling fsync only once, after the loop ended runs for 1.9 secs.

If zfs flushes its write cache every 2 seconds, and I would try to
deliver e-mails on this machine, I could see a ~100x speedup with the
proposed, delayed fsync against the current fsync.
53.6 e-mails/sec versus 5255.9 e-mails/sec. Yet, all e-mails would be
safe, as today.

Impressive, no?
Richard Elling via illumos-zfs
2014-08-18 00:15:53 UTC
Permalink
[re-reading the full thread]

NB: I don't recall the old BSD man page specs, but in illumos, system calls are section 2
and library calls are section 3.
Post by Nagy, Attila via illumos-zfs
Under voluntary fsync I mean it will not trigger a sync.
fsync(3c) does not trigger a sync(2)
Post by Nagy, Attila via illumos-zfs
Everything works as today, zfs collects the to be written data in memory and when time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns when all dirty buffers up to the point, it's called are safely written (no matter where, into the ZIL or its final place).
This is how fsync(3c) works today. Dirty buffers not already committed are written to the ZIL.
Post by Nagy, Attila via illumos-zfs
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Under voluntary fsync I mean it will not trigger a sync. Everything works as today, zfs collects the to be written data in memory and when time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns when all dirty buffers up to the point, it's called are safely written (no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the internet from other SMTP servers. When the SMTP daemon receives a mail, it has to fsync that in order to ensure that the mail is on the disk.
MTAs work differently than the client-side IMAP, POP, et.al. workloads.
Sure, but how's that relevant here?
A high-volume MTA will remain a difficult challenge since there will be many open sessions
and buffered I/O waiting for the txg. A well-designed MTA can deal with it.
Post by Nagy, Attila via illumos-zfs
Post by Richard Elling via illumos-zfs
IIRC, the spec requires an MTA put received messages on persistent media before
acknowledging they are queued and closing the session.
Sure, that's where this voluntary (or maybe better: delayed, let's call it this way, maybe it's more clear) fsync could help.
If I understand the variations of the proposal correctly, a userland thread calling fsync() will
block waiting for the txg commit. Though we try to commit in zfs_txg_timeout seconds, it is not
unusual for a busy system to see txg_commits that take longer. Though it isn't quite as bad
with the new write throttle, it is still possible for long commits. With the bad, old write throttle on
a busy system you can routinely see txg commits on the order of 4x to 10x zfs_txg_timeout.
To add insult to injury, dropping zfs_txg_timeout to low values doesn't help much... I've seen
10 second spa syncs with zfs_txg_timeout = 1 on bad, old write throttle systems. I think this
is why Matt proposed "logbias=slow" with a smiley :-)

We take this moment to commiserate for our Solaris bretheren.

The moment is up. Moving along...
Post by Nagy, Attila via illumos-zfs
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
If 1000 mails come in the same second, it would have to do 1000 fsyncs. No throughput, SSDs needed to overcome this.
You'll notice that the big boys don't do this with traditional POSIX file systems.
Even the big boys must ensure that the data is safe. Safe could mean it's on one or two local disks, or on two or more local/remote servers (and/or their disks, but some big boys don't care about that :).
But again: what this has to do with the proposed approach?
Post by Richard Elling via illumos-zfs
Have you taken performance measurements to compare against your proposed
solution? This can be relatively easy to setup as a demo project for experiments.
Sure.
Opening a file, appending 10000*100 kiB (let's call these mails) and calling fsync on each write runs for 186.53 seconds on my notebook with a slow as hell disk and zfs.
Calling fsync only once, after the loop ended runs for 1.9 secs.
It hurts when you do that. Which is why developers interested in both speed and
correctness don't do that.
Post by Nagy, Attila via illumos-zfs
If zfs flushes its write cache every 2 seconds, and I would try to deliver e-mails on this machine, I could see a ~100x speedup with the proposed, delayed fsync against the current fsync.
53.6 e-mails/sec versus 5255.9 e-mails/sec. Yet, all e-mails would be safe, as today.
Impressive, no?
AIUI, what we are really talking about here is a callback mechanism that says writes
up to a barrier are now on persistent media. Developers have been doing this by hand
for decades. But your proposal is to have the barrier scheduled by the file system and
not by the app. So that leaves at least three possibilities:

1. Filesystem-level policy setting to cause fsync() to block until txg commit completes.

2. New open(2) option, perhaps something like O_NILSYNC for Need-it-later SYNC
semantics. As above, fsync() blocks until txg commit, but you're not paying the
O_DSYNC penalty for every write and writes don't go to the ZIL.

3. Filesystem-level policy to put all writes in the ZIL, but don't block. Sorta like an
opportunitistic ZIL fill, but without the block-everything implmented by sync=always

History with options like #1 (see zfs_nocacheflush) shows that people will abuse them
and then complain about data loss or poor performance (unintended consequences).

History with options like #2 show developers to be late adopters.

Methinks #3 is only feasible with a decent slog or nonvolatile cache, but probably isn't
much different in practice than O_DSYNC or sync=always.


FInally, Matt's comment about logbias could enter into the picture too. If you don't have
a slog, then the default threshold for latency vs throughput is a write of 32k size. You won't
hear us talk about it much, because it is rarely the case that this comes into play when
it is so easy to add a slog for most folks. If you're testing on a HDD system and no slog,
be sure to experiment with logbias. For a queueing system, like an MTA, on an HDD-only
system it might be a best practice to set logbias=throughput.
-- richard
Nagy, Attila via illumos-zfs
2014-08-18 20:10:39 UTC
Permalink
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Under voluntary fsync I mean it will not trigger a sync.
fsync(3c) does not trigger a sync(2)
Should've written (quoting from BSD fsync(2)):
Under voluntary fsync I mean it will not trigger "all modified data and
attributes of fd to be moved to a permanent storage device"?
I hoped that "sync" will be clear.
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Everything works as today, zfs collects the to be written data in memory and when time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns when all dirty buffers up to the point, it's called are safely written (no matter where, into the ZIL or its final place).
This is how fsync(3c) works today. Dirty buffers not already committed are written to the ZIL.
No, it triggers "dirty buffers not already committed are written to the
ZIL", while the proposed one just waits for it (the write to stable
storage) to happen.
Post by Richard Elling via illumos-zfs
If I understand the variations of the proposal correctly, a userland thread calling fsync() will
block waiting for the txg commit. Though we try to commit in zfs_txg_timeout seconds, it is not
unusual for a busy system to see txg_commits that take longer. Though it isn't quite as bad
with the new write throttle, it is still possible for long commits. With the bad, old write throttle on
a busy system you can routinely see txg commits on the order of 4x to 10x zfs_txg_timeout.
To add insult to injury, dropping zfs_txg_timeout to low values doesn't help much... I've seen
10 second spa syncs with zfs_txg_timeout = 1 on bad, old write throttle systems. I think this
is why Matt proposed "logbias=slow" with a smiley :-)
I know it's not guaranteed. BTW, syncing on each file adds to this
busyness too.
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Sure.
Opening a file, appending 10000*100 kiB (let's call these mails) and calling fsync on each write runs for 186.53 seconds on my notebook with a slow as hell disk and zfs.
Calling fsync only once, after the loop ended runs for 1.9 secs.
It hurts when you do that. Which is why developers interested in both speed and
correctness don't do that.
Could you please elaborate?
Post by Richard Elling via illumos-zfs
Post by Nagy, Attila via illumos-zfs
If zfs flushes its write cache every 2 seconds, and I would try to deliver e-mails on this machine, I could see a ~100x speedup with the proposed, delayed fsync against the current fsync.
53.6 e-mails/sec versus 5255.9 e-mails/sec. Yet, all e-mails would be safe, as today.
Impressive, no?
AIUI, what we are really talking about here is a callback mechanism that says writes
up to a barrier are now on persistent media. Developers have been doing this by hand
for decades. But your proposal is to have the barrier scheduled by the file system and
Yes. BTW, I would be fine with transactional capabilities too (so you
-the app- could decide what to commit and when), but I guess it's a
harder topic. :)
Post by Richard Elling via illumos-zfs
1. Filesystem-level policy setting to cause fsync() to block until txg commit completes.
Which would be best with per-zfs txg timeout settings.
Post by Richard Elling via illumos-zfs
2. New open(2) option, perhaps something like O_NILSYNC for Need-it-later SYNC
semantics. As above, fsync() blocks until txg commit, but you're not paying the
O_DSYNC penalty for every write and writes don't go to the ZIL.
Or maybe a timeout option to fsync, which hints the kernel about how
much time the application is willing to wait for the "sync" to start.
Fajar A. Nugraha via illumos-zfs
2014-08-18 05:09:35 UTC
Permalink
On Mon, Aug 18, 2014 at 3:50 AM, Nagy, Attila via illumos-zfs <
Post by Nagy, Attila via illumos-zfs
Post by Richard Elling via illumos-zfs
Have you taken performance measurements to compare against your proposed
solution? This can be relatively easy to setup as a demo project for experiments.
Sure.
Opening a file, appending 10000*100 kiB (let's call these mails) and
calling fsync on each write runs for 186.53 seconds on my notebook with a
slow as hell disk and zfs.
Calling fsync only once, after the loop ended runs for 1.9 secs.
If zfs flushes its write cache every 2 seconds, and I would try to deliver
e-mails on this machine, I could see a ~100x speedup with the proposed,
delayed fsync against the current fsync.
53.6 e-mails/sec versus 5255.9 e-mails/sec. Yet, all e-mails would be
safe, as today.
Impressive, no?
The emails won't be "safe". There's a chance that (in your example)
2-seconds worth of emails can be lost in the event of power loss, due to
the fact the data is still in memory, and not yet commited to disk. If you
can live with that, setting "zfs set sync=disabled" on the fs and setting
the module parameter zfs_txg_timeout=2 should do what you want.

As a comparision, this is similar in mysql if you set
innodb_flush_log_at_trx_commit=2 (
http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit).
You can lost up to 1 second worth of transaction.
--
Fajar



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Chris Siebenmann via illumos-zfs
2014-08-18 05:59:59 UTC
Permalink
Post by Fajar A. Nugraha via illumos-zfs
Post by Nagy, Attila via illumos-zfs
If zfs flushes its write cache every 2 seconds, and I would try to deliver
e-mails on this machine, I could see a ~100x speedup with the proposed,
delayed fsync against the current fsync.
53.6 e-mails/sec versus 5255.9 e-mails/sec. Yet, all e-mails would be
safe, as today.
Impressive, no?
The emails won't be "safe". There's a chance that (in your example)
2-seconds worth of emails can be lost in the event of power loss,
due to the fact the data is still in memory, and not yet commited to
disk. If you can live with that, setting "zfs set sync=disabled" on
the fs and setting the module parameter zfs_txg_timeout=2 should do
what you want.
As I understand the proposal, the emails would be perfectly safe
in that the local MTA would not have acknowledged them until the
transaction commits. If the machine crashes before then, it is no
different than the machine crashing while the message was being
received; the remote mailer will retry.

Here's another way to look at it. Right now, traditional Unix provides
two disk durability options: your data is completely at risk for some
arbitrary amount of time (plain normal writes) and 'write my data
to disk now' (fsync()). The original proposal is for a third option,
'notify me when my data is durably on the disk' (as proposed it would be
from sync() or fsync() returning). I personally think that this is not a
crazy option to want and it certainly fills a middle ground between low
latency to the application (ie current fsync()) and complete indifference.
You could not use it in a latency sensitive application, but then an MTA
is generally not latency sensitive (partly because there are so many other
sources of latency when moving mail from one machine to another via SMTP).

- cks
Fajar A. Nugraha via illumos-zfs
2014-08-18 06:10:27 UTC
Permalink
Post by Nagy, Attila via illumos-zfs
Post by Fajar A. Nugraha via illumos-zfs
Post by Nagy, Attila via illumos-zfs
If zfs flushes its write cache every 2 seconds, and I would try to
deliver
Post by Fajar A. Nugraha via illumos-zfs
Post by Nagy, Attila via illumos-zfs
e-mails on this machine, I could see a ~100x speedup with the proposed,
delayed fsync against the current fsync.
53.6 e-mails/sec versus 5255.9 e-mails/sec. Yet, all e-mails would be
safe, as today.
Impressive, no?
The emails won't be "safe". There's a chance that (in your example)
2-seconds worth of emails can be lost in the event of power loss,
due to the fact the data is still in memory, and not yet commited to
disk. If you can live with that, setting "zfs set sync=disabled" on
the fs and setting the module parameter zfs_txg_timeout=2 should do
what you want.
As I understand the proposal, the emails would be perfectly safe
in that the local MTA would not have acknowledged them until the
transaction commits. If the machine crashes before then, it is no
different than the machine crashing while the message was being
received; the remote mailer will retry.
Here's another way to look at it. Right now, traditional Unix provides
two disk durability options: your data is completely at risk for some
arbitrary amount of time (plain normal writes) and 'write my data
to disk now' (fsync()). The original proposal is for a third option,
'notify me when my data is durably on the disk' (as proposed it would be
from sync() or fsync() returning). I personally think that this is not a
crazy option to want and it certainly fills a middle ground between low
latency to the application (ie current fsync()) and complete indifference.
You could not use it in a latency sensitive application, but then an MTA
is generally not latency sensitive (partly because there are so many other
sources of latency when moving mail from one machine to another via SMTP).
Ah, OK. So it's "block for (at most) 2 seconds until the data is synced to
the disk (which must occur every 2 seconds)?"

At the top of my head it would greatly increase the number of open
connections (e.g. about 11k concurrent connections, following the OP's
example). That would mean more resource usage (e.g. memory).

Assuming the MTA only sync after all the data is received (i.e. when it
finishes writing spool file), then probably writing something similar
to "eatmydata" (that preloads an override fsync and friends) would work.
The difference would be instead of disabling it, it changes the call with
"sleep 3" (or whatever the ammount of time it takes to guarantee a full
sync to disk). Then combine this with "zfs sync=disabled", and
zfs_txg_timeout=2, it could probably work using (mostly) what you already
have today, without having to write new code in zfs. Only minimal
customization in eatmydata.
--
Fajar



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao via illumos-zfs
2014-08-17 03:45:48 UTC
Permalink
Post by Nagy, Attila via illumos-zfs
Hi,
I'm not sure the name in the subject is right, but here's what I think of.
The fsync() system call causes all modified data and attributes of
fd to
be moved to a permanent storage device. This normally results in
all in-
core modified copies of buffers for the associated file to be
written to
a disk.
I would call it mandatory fsync, meaning if I call fsync(fd), the OS
immediately starts to write dirty buffers onto stable storage (ZIL in
zfs, possibly a double write eventually) and returns when it's done.
Under voluntary fsync I mean it will not trigger a sync. Everything
works as today, zfs collects the to be written data in memory and when
time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns
when all dirty buffers up to the point, it's called are safely written
(no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the
internet from other SMTP servers. When the SMTP daemon receives a mail,
it has to fsync that in order to ensure that the mail is on the disk.
If 1000 mails come in the same second, it would have to do 1000 fsyncs.
No throughput, SSDs needed to overcome this.
With a voluntary fsync, the server would issue 1000 (v)fsyncs too, but
each of them would block until zfs writes the 1000 e-mails onto stable
storage (or something else triggers a txg switch).
If a zfs txg is no larger than 1 second, each mail delivery will be
delayed with a maximum of 1 second, but writing 1000 mails will only
trigger one txg flush, with much less IOPS needed.
I think you might be confusing terminology. Try watching the output of
`zdb -C poolname`. If you have the GNU watch utility available, you
should be able to do `watch -n5 zdb -C poolname`. Then try running your
workload or even running `sync` several times with some writes
interleaved. You should see that the txg commit only occurs on the
interval specified. I believe that sync=always should disable that and
force us to do the txg commit sooner.

Anyway, ZIL is the reason that we do not need to actually do a full
transaction group commit. We will still flush, but the flushes do not
slow us down much. That being said, we could accomplish what you want by
relying on marking writes involved in zil_commit with "Force Unit
Access". That way we could avoid a full flush and just txg_wait() for
everything to be on stable storage.

A possible downside to this is that there exists hardware that does not
obey forced unit access. IDE drives do not support it and I am not sure
we can expect USB drives to obtain it either. There is a claim online
that Windows does not use it with SATA disks because SATA drives exist
that do not support it either:

http://workinghardinit.wordpress.com/tag/forced-unit-access/

That being said, I imagine that we could implement support for Forced
Unit Access zil_commit() operations that is off by default and enabled
via mdb, a module parameter or however your platform handles these
options. Another option is to add another setting to logbias as Matt
suggested. How does some implementation of FUA sound to you?
Post by Nagy, Attila via illumos-zfs
Of course the program could be smart about that and manage all of this
itself (collecting incoming data into one file, delaying
acknowledgements and issue just one fsync when it's needed), but it
would need a major rewrite in nearly all of these software.
Having a voluntary fsync in zfs is a lot more easier, only the fsyncs
which can wait would have to be changed to "vfsync" and the rest would
be done by zfs.
What do you think?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24010604-91e32bd2
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao via illumos-zfs
2014-08-17 03:53:42 UTC
Permalink
Post by Richard Yao via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Hi,
I'm not sure the name in the subject is right, but here's what I think of.
The fsync() system call causes all modified data and attributes of
fd to
be moved to a permanent storage device. This normally results in
all in-
core modified copies of buffers for the associated file to be
written to
a disk.
I would call it mandatory fsync, meaning if I call fsync(fd), the OS
immediately starts to write dirty buffers onto stable storage (ZIL in
zfs, possibly a double write eventually) and returns when it's done.
Under voluntary fsync I mean it will not trigger a sync. Everything
works as today, zfs collects the to be written data in memory and when
time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns
when all dirty buffers up to the point, it's called are safely written
(no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the
internet from other SMTP servers. When the SMTP daemon receives a mail,
it has to fsync that in order to ensure that the mail is on the disk.
If 1000 mails come in the same second, it would have to do 1000 fsyncs.
No throughput, SSDs needed to overcome this.
With a voluntary fsync, the server would issue 1000 (v)fsyncs too, but
each of them would block until zfs writes the 1000 e-mails onto stable
storage (or something else triggers a txg switch).
If a zfs txg is no larger than 1 second, each mail delivery will be
delayed with a maximum of 1 second, but writing 1000 mails will only
trigger one txg flush, with much less IOPS needed.
I think you might be confusing terminology. Try watching the output of
`zdb -C poolname`. If you have the GNU watch utility available, you
should be able to do `watch -n5 zdb -C poolname`. Then try running your
workload or even running `sync` several times with some writes
interleaved. You should see that the txg commit only occurs on the
interval specified. I believe that sync=always should disable that and
force us to do the txg commit sooner.
Anyway, ZIL is the reason that we do not need to actually do a full
transaction group commit. We will still flush, but the flushes do not
slow us down much. That being said, we could accomplish what you want by
relying on marking writes involved in zil_commit with "Force Unit
Access". That way we could avoid a full flush and just txg_wait() for
everything to be on stable storage.
A possible downside to this is that there exists hardware that does not
obey forced unit access. IDE drives do not support it and I am not sure
we can expect USB drives to obtain it either. There is a claim online
that Windows does not use it with SATA disks because SATA drives exist
http://workinghardinit.wordpress.com/tag/forced-unit-access/
That being said, I imagine that we could implement support for Forced
Unit Access zil_commit() operations that is off by default and enabled
via mdb, a module parameter or however your platform handles these
options. Another option is to add another setting to logbias as Matt
suggested. How does some implementation of FUA sound to you?
On second thought, we could limit this functionality to SLOG devices
unless an in-kernel variable is changed via mdb, a module parameter, or
some other method. I believe that all SLOG devices that anyone would
want to use should support this, so it would be safe and SLOG devices
would see a benefit from increased parallelism.

By the way, Nagy, are you using a SLOG device right now?




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Simon Casady via illumos-zfs
2014-08-17 12:18:50 UTC
Permalink
The QNX 4 file system allowed you to set the file system to sync or to set
max wait before the file system synced. A wait of one second was hundreds
of times faster than full sync. With this sort of capability your "vfsync"
would just be a one second sleep.
Perhaps there could be a property for max time between txg commits.


On Sat, Aug 16, 2014 at 10:53 PM, Richard Yao via illumos-zfs <
Post by Richard Yao via illumos-zfs
Post by Richard Yao via illumos-zfs
Post by Nagy, Attila via illumos-zfs
Hi,
I'm not sure the name in the subject is right, but here's what I think
of.
Post by Richard Yao via illumos-zfs
Post by Nagy, Attila via illumos-zfs
The fsync() system call causes all modified data and attributes of
fd to
be moved to a permanent storage device. This normally results in
all in-
core modified copies of buffers for the associated file to be
written to
a disk.
I would call it mandatory fsync, meaning if I call fsync(fd), the OS
immediately starts to write dirty buffers onto stable storage (ZIL in
zfs, possibly a double write eventually) and returns when it's done.
Under voluntary fsync I mean it will not trigger a sync. Everything
works as today, zfs collects the to be written data in memory and when
time has come, it writes them onto the disks.
Voluntary fsync should block until this write happens, and only returns
when all dirty buffers up to the point, it's called are safely written
(no matter where, into the ZIL or its final place).
I have some mail servers. The SMTP servers receive mails from the
internet from other SMTP servers. When the SMTP daemon receives a mail,
it has to fsync that in order to ensure that the mail is on the disk.
If 1000 mails come in the same second, it would have to do 1000 fsyncs.
No throughput, SSDs needed to overcome this.
With a voluntary fsync, the server would issue 1000 (v)fsyncs too, but
each of them would block until zfs writes the 1000 e-mails onto stable
storage (or something else triggers a txg switch).
If a zfs txg is no larger than 1 second, each mail delivery will be
delayed with a maximum of 1 second, but writing 1000 mails will only
trigger one txg flush, with much less IOPS needed.
I think you might be confusing terminology. Try watching the output of
`zdb -C poolname`. If you have the GNU watch utility available, you
should be able to do `watch -n5 zdb -C poolname`. Then try running your
workload or even running `sync` several times with some writes
interleaved. You should see that the txg commit only occurs on the
interval specified. I believe that sync=always should disable that and
force us to do the txg commit sooner.
Anyway, ZIL is the reason that we do not need to actually do a full
transaction group commit. We will still flush, but the flushes do not
slow us down much. That being said, we could accomplish what you want by
relying on marking writes involved in zil_commit with "Force Unit
Access". That way we could avoid a full flush and just txg_wait() for
everything to be on stable storage.
A possible downside to this is that there exists hardware that does not
obey forced unit access. IDE drives do not support it and I am not sure
we can expect USB drives to obtain it either. There is a claim online
that Windows does not use it with SATA disks because SATA drives exist
http://workinghardinit.wordpress.com/tag/forced-unit-access/
That being said, I imagine that we could implement support for Forced
Unit Access zil_commit() operations that is off by default and enabled
via mdb, a module parameter or however your platform handles these
options. Another option is to add another setting to logbias as Matt
suggested. How does some implementation of FUA sound to you?
On second thought, we could limit this functionality to SLOG devices
unless an in-kernel variable is changed via mdb, a module parameter, or
some other method. I believe that all SLOG devices that anyone would
want to use should support this, so it would be safe and SLOG devices
would see a benefit from increased parallelism.
By the way, Nagy, are you using a SLOG device right now?
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24018577-4d8b86e0
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Richard Yao via illumos-zfs
2014-08-17 14:54:10 UTC
Permalink
The QNX 4 file system allowed you to set the file system to sync or to set max wait before the file system synced. A wait of one second was hundreds of times faster than full sync. With this sort of capability your "vfsync" would just be a one second sleep.
Perhaps there could be a property for max time between txg commits.
zfs_txg_timeout does this. You can adjust it with mdb on Illumos.
Nagy, Attila via illumos-zfs
2014-08-17 22:00:57 UTC
Permalink
Post by Richard Yao via illumos-zfs
On second thought, we could limit this functionality to SLOG devices
unless an in-kernel variable is changed via mdb, a module parameter, or
some other method. I believe that all SLOG devices that anyone would
want to use should support this, so it would be safe and SLOG devices
would see a benefit from increased parallelism.
I'm not sure we are talking about the same. :)
Post by Richard Yao via illumos-zfs
By the way, Nagy, are you using a SLOG device right now?
(Nagy is my surname, in Hungary it comes first, like in Japan)
We use ZFS for a lot of things. There are places where we use SLOG
devices and there are many places where we don't.
To alleviate the described problem we mostly use B/FBWC modules
(battery/flash backed write cache), which is much more cheaper than a
good SLOG device.
Nagy, Attila via illumos-zfs
2014-08-17 21:43:48 UTC
Permalink
Post by Richard Yao via illumos-zfs
I think you might be confusing terminology. Try watching the output of
`zdb -C poolname`. If you have the GNU watch utility available, you
Maybe you thought of -s, or the one in FreeBSD differs from yours?
Post by Richard Yao via illumos-zfs
should be able to do `watch -n5 zdb -C poolname`. Then try running your
workload or even running `sync` several times with some writes
interleaved. You should see that the txg commit only occurs on the
interval specified. I believe that sync=always should disable that and
force us to do the txg commit sooner.
After reading
http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs, it seems
you are right, thanks for correcting me.
BTW, I'm sure ZFS worked this way previously, because I've had major
problems with that.
A quick google found:
https://blogs.oracle.com/roch/entry/the_dynamics_of_zfs, which says I'm
not stupid, just have obsolete information in my mind. :)
Post by Richard Yao via illumos-zfs
Anyway, ZIL is the reason that we do not need to actually do a full
transaction group commit. We will still flush, but the flushes do not
slow us down much. That being said, we could accomplish what you want by
relying on marking writes involved in zil_commit with "Force Unit
Access". That way we could avoid a full flush and just txg_wait() for
everything to be on stable storage.
A possible downside to this is that there exists hardware that does not
obey forced unit access. IDE drives do not support it and I am not sure
we can expect USB drives to obtain it either. There is a claim online
that Windows does not use it with SATA disks because SATA drives exist
http://workinghardinit.wordpress.com/tag/forced-unit-access/
That being said, I imagine that we could implement support for Forced
Unit Access zil_commit() operations that is off by default and enabled
via mdb, a module parameter or however your platform handles these
options. Another option is to add another setting to logbias as Matt
suggested. How does some implementation of FUA sound to you?
I try to summarise in plain english (let's call this delayed fsync,
maybe it's more clear than the subject):
fsync: please commit the data I've written so far to stable storage NOW
and return when ready
dfsync: please return when the data I've written so far is committed to
stable storage (do not initiate it)

(doesn't matter whether this is implemented as a zfs parameter -so all
fsyncs on that zfs will behave this way-, or a new syscall)

I'm not sure how FUA relates to this, could you please explain this?
Loading...