Post by Sašo KiselkovPost by Etienne DechampsIn theory, you're right. Problem is, SCSI doesn't work that way
(probably because it's primarily designed for local targets, not remote
ones). It is a very well-specified protocol with very clear semantics
with respect to link loss, reconnects, logins and the like. Write cache
loss is not one of them. According to the specs, the initiator can very
well assume that the write cache is still there after a reconnect. Any
target that doesn't behave that way is breaking SCSI specs.
Can you please point me to the spec where it says that? Because if that
is true, the spec must have been written by an idiot. "Local" is
meaningless - the target can very well have been rebooted or power
cycled between link resets (remember hot swap?).
Post by Etienne DechampsPhilosophically, when a link to a target goes down, SCSI is optimistic
and assumes that only the link went down, not the target itself.
http://youtu.be/6F9bscdqRpo
I investigated this problem and dug into the T10 specs a year ago. I
don't remember the exact references. If I'm not mistaken, I think it was
a consequence of the SCSI layering (I'm talking about
http://www.t10.org/scsi-3.htm), in which transient failures in the lower
layers (e.g. iSCSI, at the bottom of the stack) are transparent with
regard to the upper layers (e.g. SBC, at the top of the stack, which
handles write caching). IIRC (again), it comes down to the write cache
being a state of the logical unit, not the I_T nexus. Something like
that. I really don't feel like going through these hundred pages again
just for the sake of this argument.
To clarify: I'm not really saying that the target doesn't have any way
of notifying the initiator that a power cycle has occurred and that the
write cache is lost. It does (though I've never seen any iSCSI
implementation do that, which is why this is dangerous by default). The
thing is, if you do send that notification, then you're crashing your
VMs because OSes consider that an irrecoverable error. That's what
happens if you power cycle a non-redundant local disk while the OS is
using it: it tends not to like that. At all.
Crashing your VMs is probably better than corrupting data, but again,
that's not how I've seen iSCSI targets behave by default.
Post by Sašo KiselkovPost by Etienne DechampsWhat you can do, however, is make the target fail the cache flush in
that case, but with most initiator software that will result in
inevitable meltdown. For example, on Linux this will trigger an
irrecoverable disk I/O error which will typically cause a read-only
remount. That makes sense because in that case Linux has no choice: it
cannot reissue the writes because it doesn't have them in memory
anymore, and there's no userland API to notify the applications that
their writes are lost, so it panics and bails out.
WTF? What journaled filesystem removes disk blocks from memory right
after writing but before it has successfully committed the write? That
would seem to make the journal almost meaningless.
You're assuming that journaling filesystems will always keep the whole
data blocks in the journal. That's not necessarily the case. For
example, if my understanding of ext4 is correct (disclaimer: I have not
actually checked this), then when using the data=ordered option (which
is the default), ext4 will not write data blocks to the journal. Instead
it will do the following:
1. Write the data block to its final position (not the journal).
2. Flush to make sure it's on stable storage.
3. If the data was appended to a file, write the metadata change to the
journal (with a pointer to the new block).
4. Flush to make sure the new journal entry (if there is one) is on
stable storage.
5. Notify the upper layer (i.e. the application) that the sync() is
successful.
In fact, I believe that if the write does not change the file size, the
journal is not used at all. It's just a single flush. There's nothing
wrong with that. It certainly doesn't break sync() semantics.
There is absolutely no reason for the OS to keep the data in RAM between
steps (1) and (2). It will remove data blocks from RAM as soon as they
have been sent to the storage device. So if a target reboot occurs
between (1) and (2), you're in for a bad time. Even if the OS is
notified, it won't be able to do anything because "the train has already
left the station", i.e. it doesn't have the data blocks in RAM anymore.
Its only choice is to crash because it knows it won't be able to honor
subsequent sync() requests reliably.
--
Etienne Dechamps
Phone: +44 74 50 65 82 17