Block alignment and sizing in ZFS volumes for iSCSI to vSphere 5

Discussion:

Block alignment and sizing in ZFS volumes for iSCSI to vSphere 5

Sašo Kiselkov

2013-03-29 20:34:48 UTC

So I'm building this rather big storage thing that will primarily (only)
serve as the SAN storage unit for holding the storage for vSphere 5. Can
people please share their experience with correct ZFS volume block
sizing and how that interacts with vSphere?

As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that. This creates the potential for a single ZFS volume
containing lots of mixed workloads from all sorts of various VMs.
Naturally, this isn't a recipe for a happy storage box, since there's
going to be a large mix of block sizes and alignments from all possible
virtual machines.

Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

Cheers,
--
Saso

Tim Cook

2013-03-29 20:41:22 UTC

Post by SaÅ¡o Kiselkov
So I'm building this rather big storage thing that will primarily (only)
serve as the SAN storage unit for holding the storage for vSphere 5. Can
people please share their experience with correct ZFS volume block
sizing and how that interacts with vSphere?
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that. This creates the potential for a single ZFS volume
containing lots of mixed workloads from all sorts of various VMs.
Naturally, this isn't a recipe for a happy storage box, since there's
going to be a large mix of block sizes and alignments from all possible
virtual machines.
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).
Cheers,
--
Saso

Any reason ins particular you're using iSCSI instead of NFS?

--TIm

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 20:51:45 UTC

Post by Tim Cook
Any reason ins particular you're using iSCSI instead of NFS?

1) active-active multipathing over two fully independent SANs
2) SCSI Unmap support in COMSTAR for thin-provisioning

If there are ways to get these in NFS, I'd go with NFS.

--
Saso

Tim Cook

2013-03-29 20:58:05 UTC

Post by SaÅ¡o Kiselkov

Post by Tim Cook
Any reason ins particular you're using iSCSI instead of NFS?

1) active-active multipathing over two fully independent SANs
2) SCSI Unmap support in COMSTAR for thin-provisioning
If there are ways to get these in NFS, I'd go with NFS.
--
Saso

You won't get active/active multipathing. The best you could get is active
on a per share basis (IE: manual load balancing). Although unless you're
running on 1gbe, I'd challenge you that your storage is going to tip over
before saturating the link with a vmware type workload.

NFS is thin by design. Unless you thick provision the vmdk's they will
only take up the space they're actually using.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 21:02:32 UTC

Post by Tim Cook
You won't get active/active multipathing. The best you could get is active
on a per share basis (IE: manual load balancing). Although unless you're
running on 1gbe, I'd challenge you that your storage is going to tip over
before saturating the link with a vmware type workload.

I guess this depends on your definition of what a "big storage box"
looks like. Do you think a rackfull of drives qualifies? (That's what
I'm dealing with.)

Post by Tim Cook
NFS is thin by design. Unless you thick provision the vmdk's they will
only take up the space they're actually using.

With NFS, your VMDKs are pretty much in growth-only mode. With SCSI
Unmap and clients issuing them, you can reclaim the space on the storage.

--
Saso

Tim Cook

2013-03-29 21:08:27 UTC

Post by SaÅ¡o Kiselkov

Post by Tim Cook
You won't get active/active multipathing. The best you could get is

active

Post by Tim Cook
on a per share basis (IE: manual load balancing). Although unless you're
running on 1gbe, I'd challenge you that your storage is going to tip over
before saturating the link with a vmware type workload.

I guess this depends on your definition of what a "big storage box"
looks like. Do you think a rackfull of drives qualifies? (That's what
I'm dealing with.)

I've got customers with fully loaded NetApp 62xx series in the field with
well north of 2k VM's running on NFS with a 10gbe connection using LACP.
It's about 4 racks of 15k rpm disk at the moment and some flash cache.
I'm confident the networking will not be your bottleneck. With LACP, on a
per ESX server basis they should balance across your two ports.

That's the best example I can give you. Other vendors we typically deploy
with block.

Post by SaÅ¡o Kiselkov

Post by Tim Cook
NFS is thin by design. Unless you thick provision the vmdk's they will
only take up the space they're actually using.

With NFS, your VMDKs are pretty much in growth-only mode. With SCSI
Unmap and clients issuing them, you can reclaim the space on the storage.

And how you do you envision Windows is going to tell ESX that it deleted a
block? Other vendors have software shims in the guest. I'm not aware of
this functionality being available to the *laris derivatives but I'm open
to suggestions.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 21:21:48 UTC

Post by Tim Cook
I've got customers with fully loaded NetApp 62xx series in the field with
well north of 2k VM's running on NFS with a 10gbe connection using LACP.

LACP isn't possible to two independent networks.

Post by Tim Cook
It's about 4 racks of 15k rpm disk at the moment and some flash cache.
I'm confident the networking will not be your bottleneck. With LACP, on a
per ESX server basis they should balance across your two ports.
That's the best example I can give you. Other vendors we typically deploy
with block.

This largely depends on the workloads and general throughput of the
storage subsystems. I don't know NetApp performance very well, so I
can't draw much of a reference there...

Post by Tim Cook
And how you do you envision Windows is going to tell ESX that it deleted a
block?

I may sound like a broken record, but again: SCSI Unmap (or SATA TRIM).
To my knowledge, vSphere 5.1 support it on both ends (i.e. receiving it
from the client and issuing it to the underlying storage). Moreover,
when using opaque LUNs exported directly to the VMs, ESX should simply
act as a SCSI passthru and let the VM converse over SCSI directly with
COMSTAR. So far, this is just a hypothesis - not sure how transparent
ESX can be in this.

Post by Tim Cook
Other vendors have software shims in the guest. I'm not aware of
this functionality being available to the *laris derivatives but I'm open
to suggestions.

The clients I'm interested in hosting is mostly Windows and Linux servers.

Cheers,
--
Saso

Markus Grundmann

2013-03-29 21:32:02 UTC

Tim my last Information about LACP is that the protocol does not support load balancing over all links. It's only for failover issues. If you have more than one MAC address the switch/server can use all channel members to transport the data but only one connection per member is possible.
This information based on my experience on Cisco IOS devices.

Regards,
Markus

Post by SaÅ¡o Kiselkov

Post by Tim Cook
You won't get active/active multipathing. The best you could get is active
on a per share basis (IE: manual load balancing). Although unless you're
running on 1gbe, I'd challenge you that your storage is going to tip over
before saturating the link with a vmware type workload.

I guess this depends on your definition of what a "big storage box"
looks like. Do you think a rackfull of drives qualifies? (That's what
I'm dealing with.)

I've got customers with fully loaded NetApp 62xx series in the field with well north of 2k VM's running on NFS with a 10gbe connection using LACP. It's about 4 racks of 15k rpm disk at the moment and some flash cache. I'm confident the networking will not be your bottleneck. With LACP, on a per ESX server basis they should balance across your two ports.
That's the best example I can give you. Other vendors we typically deploy with block.

Post by SaÅ¡o Kiselkov

Post by Tim Cook
NFS is thin by design. Unless you thick provision the vmdk's they will
only take up the space they're actually using.

With NFS, your VMDKs are pretty much in growth-only mode. With SCSI
Unmap and clients issuing them, you can reclaim the space on the storage.

And how you do you envision Windows is going to tell ESX that it deleted a block? Other vendors have software shims in the guest. I'm not aware of this functionality being available to the *laris derivatives but I'm open to suggestions.
--Tim
illumos-zfs | Archives | Modify Your Subscription

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-03-29 21:54:41 UTC

This post might be inappropriate. Click to display it.

Sašo Kiselkov

2013-03-29 22:00:53 UTC

Post by Richard Elling

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support load balancing over all links. It's only for failover issues. If you have more than one MAC address the switch/server can use all channel members to transport the data but only one connection per member is possible.

LACP = link aggregation control protocol and its job is to coordinate the aggregations
between hosts or switches. It doesn't really do anything about failover. Indeed, you can
manually connect links and forget LACP entirely. The best use case for LACP is when
there is a media converter in the link (eg DWDM)

Post by Markus Grundmann
This information based on my experience on Cisco IOS devices.

Perhaps Tim was confusing LACP with something that is actually useful, like EtherChannel
spread across switches or Arista's MLAG? :-)

Slightly off-topic to what you are mentioning, I had a look at the state
of mLACP (or multi-chassis etherchannel, or however they call it), and
it looks like a pretty fragile thing I'd rather avoid (which is why
we're going with two fully independent networks).

I've had my fair number of encounters with bugs in too smart switching
logic bringing networks down.

--
Saso

Richard Elling

2013-03-29 22:05:22 UTC

Post by SaÅ¡o Kiselkov

Post by Richard Elling

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support load balancing over all links. It's only for failover issues. If you have more than one MAC address the switch/server can use all channel members to transport the data but only one connection per member is possible.

LACP = link aggregation control protocol and its job is to coordinate the aggregations
between hosts or switches. It doesn't really do anything about failover. Indeed, you can
manually connect links and forget LACP entirely. The best use case for LACP is when
there is a media converter in the link (eg DWDM)

Post by Markus Grundmann
This information based on my experience on Cisco IOS devices.

Perhaps Tim was confusing LACP with something that is actually useful, like EtherChannel
spread across switches or Arista's MLAG? :-)

Slightly off-topic to what you are mentioning, I had a look at the state
of mLACP (or multi-chassis etherchannel, or however they call it), and
it looks like a pretty fragile thing I'd rather avoid (which is why
we're going with two fully independent networks).

Balancing higher in the stack is also a good idea. Multipathing is simple
and reliable for iSCSI and avoids all the messiness below (L2 is very
constrained)

Post by SaÅ¡o Kiselkov
I've had my fair number of encounters with bugs in too smart switching
logic bringing networks down.

Me too. Deep scars :-(
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Tim Cook

2013-03-29 22:22:13 UTC

Post by Markus Grundmann

Post by Richard Elling

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not

support load balancing over all links. It's only for failover issues. If
you have more than one MAC address the switch/server can use all channel
members to transport the data but only one connection per member is
possible.

Post by Richard Elling
LACP = link aggregation control protocol and its job is to coordinate

the aggregations

Post by Richard Elling
between hosts or switches. It doesn't really do anything about failover.

Indeed, you can

Post by Richard Elling
manually connect links and forget LACP entirely. The best use case for

LACP is when

Post by Richard Elling
there is a media converter in the link (eg DWDM)

Post by Markus Grundmann
This information based on my experience on Cisco IOS devices.

Perhaps Tim was confusing LACP with something that is actually useful,

like EtherChannel

Post by Richard Elling
spread across switches or Arista's MLAG? :-)

Slightly off-topic to what you are mentioning, I had a look at the state
of mLACP (or multi-chassis etherchannel, or however they call it), and
it looks like a pretty fragile thing I'd rather avoid (which is why
we're going with two fully independent networks).
I've had my fair number of encounters with bugs in too smart switching
logic bringing networks down.

Cisco's implementation (vPC) is in no way fragile. There are literally
tens of thousands of implementations in the wild. It's the default
deployment for new nexus infrastructure and I've yet to hear of any major
showstoppers with it anytime recently.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 22:29:34 UTC

Post by Tim Cook
Cisco's implementation (vPC) is in no way fragile. There are literally
tens of thousands of implementations in the wild. It's the default
deployment for new nexus infrastructure and I've yet to hear of any major
showstoppers with it anytime recently.

1) As I mentioned earlier, it's not possible for me to pay Cisco prices.
2) No substantially large bit of software is completely bug-free - a
split-brain SAN is the last thing I want.
3) I can solve this in a much more elegant way in the application
through iSCSI multipath.

--
Saso

Tim Cook

2013-03-29 22:17:17 UTC

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support
load balancing over all links. It's only for failover issues. If you have
more than one MAC address the switch/server can use all channel members to
transport the data but only one connection per member is possible.
LACP = link aggregation control protocol and its job is to coordinate the aggregations
between hosts or switches. It doesn't really do anything about failover. Indeed, you can
manually connect links and forget LACP entirely. The best use case for LACP is when
there is a media converter in the link (eg DWDM)
This information based on my experience on Cisco IOS devices.
Perhaps Tim was confusing LACP with something that is actually useful, like EtherChannel
spread across switches or Arista's MLAG? :-)
-- richard

Tim isn't confusing anything ;) LACP and Etherchannel (proprietary name
for PaGP) are nearly identical. The primary difference being LACP is smart
enough to shut down one link and continue operating if an admin does
something stupid like plug a different server into a port meant to be part
of the channel. With Etherchannel/PaGP, both ends blindly thinks that both
ports are part of a channel with no negotiation occurring. If you happen
to plug a server into the wrong port you'll knock everything offline.
Etherchannel and MLAG are in no way superior to LACP, they're a massive
step behind. They're simply easier to implement which is why Etherchannel
came before LACP and most of the vendors chose MLAG rather than
cross-chassis LACP (Cisco calling it VPC).

What I was referring to when talking about using LACP is that it's
source/destination based, and in an ESX environment, you are going to have
more than one server. That means mathematically it's almost impossible not
to utilize both links in an LACP channel. To go above and beyond this, you
can statically pin from the ESX which NIC is primary to utilize one
physical switch or the other.

Short of proprietary software on the host side, there is no way to get mpio
like load balancing in the ethernet NAS world talking between a host and a
storage device.

--TIm

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Tim Cook

2013-03-29 22:20:08 UTC

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support
load balancing over all links. It's only for failover issues. If you have
more than one MAC address the switch/server can use all channel members to
transport the data but only one connection per member is possible.
LACP = link aggregation control protocol and its job is to coordinate the aggregations
between hosts or switches. It doesn't really do anything about failover. Indeed, you can
manually connect links and forget LACP entirely. The best use case for LACP is when
there is a media converter in the link (eg DWDM)
This information based on my experience on Cisco IOS devices.
Perhaps Tim was confusing LACP with something that is actually useful, like EtherChannel
spread across switches or Arista's MLAG? :-)
-- richard

And LACP absolutely does failover. If one link in an LACP channel/LAG
dies, traffic will continue running on the remaining link. Perhaps there's
some muddying of terms in that LACP is generally talking about both the act
of automatic negotiation as well as the channel that is formed via this
negotiation.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-03-29 22:32:32 UTC

And LACP absolutely does failover. If one link in an LACP channel/LAG dies, traffic will continue running on the remaining link.

Don't need LACP for that. Link aggregation works just fine as long as the
link state represents the state of the link :-)

But we digress, back to ZFS which blissfully ignores networks :-)
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jason Matthews

2013-03-30 00:01:28 UTC

Sent from my iPad

Post by Richard Elling

And LACP absolutely does failover. If one link in an LACP channel/LAG dies, traffic will continue running on the remaining link.

Don't need LACP for that. Link aggregation works just fine as long as the
link state represents the state of the link :-)

Richard,
With all due respect, I don't grok this at all

Lacp is link aggregation. What do you mean?

J.

Tim Cook

2013-03-30 00:08:54 UTC

Post by Jason Matthews
Sent from my iPad

Post by Richard Elling

Post by Tim Cook
And LACP absolutely does failover. If one link in an LACP channel/LAG

dies, traffic will continue running on the remaining link.

Post by Richard Elling
Don't need LACP for that. Link aggregation works just fine as long as the
link state represents the state of the link :-)

Richard,
With all due respect, I don't grok this at all
Lacp is link aggregation. What do you mean?
J.

He's talking about PaGP, or static channel (no negotiation). Etherchannel
and LACP do indeed behave basically the same in all aspects EXCEPT that
LACP saves you from someone plugging a cable into the wrong port, and/or
accidentally changing a switch port config on one of the ports in the
channel.

With etherchannel, you would lose all connectivity. With LACP, you would
lose a path, but the other should remain. There are about a million corner
cases and as Richard says, we digress. If you get nothing else out of the
discussion, the main point is that LACP is the most robust channeling
option we have today. The others work, they're just more susceptible to
outages.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jason Matthews

2013-03-30 01:57:50 UTC

If you get nothing else out of the discussion, the main point is that LACP is the most robust channeling option we have today. The others work, they're just more susceptible to outages.

This is precisely MY point. :-)

J.

Richard Elling

2013-03-30 00:30:17 UTC

This post might be inappropriate. Click to display it.

Jason Matthews

2013-03-29 22:37:58 UTC

Sent from my iPad

Post by Richard Elling
LACP = link aggregation control protocol and its job is to coordinate the aggregations
between hosts or switches. It doesn't really do anything about failover.

In the absence of PDUs, transmitted at 1/sec from the active partner, the device will hold down a misbehaving member. In this way you get datalink redundancy.

Richard Elling

2013-03-29 22:45:33 UTC

Post by Jason Matthews
Sent from my iPad

Post by Richard Elling
LACP = link aggregation control protocol and its job is to coordinate the aggregations
between hosts or switches. It doesn't really do anything about failover.

In the absence of PDUs, transmitted at 1/sec from the active partner, the device will hold down a misbehaving member. In this way you get datalink redundancy.

Ethernet link state probes are every 150ms, so LACP is useful in the case where link state doesn't
represent the state of the link, as previously mentioned in the case where there is a media converter.
In my experience, this is a very rare case.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jason Matthews

2013-03-29 23:57:18 UTC

Sent from my iPad

Post by Richard Elling

Post by Jason Matthews
In the absence of PDUs, transmitted at 1/sec from the active partner, the device will hold down a misbehaving member. In this way you get datalink redundancy.

Ethernet link state probes are every 150ms, so LACP is useful in the case where link state doesn't
represent the state of the link, as previously mentioned in the case where there is a media converter.
In my experience, this is a very rare case

This may be a case of mismatched expectations or perhaps the pure scientist versus the engineer.

If the link goes down entirely it is generally fast tracked for being disabled in the lag any way, bypassing the PDU health checks, thanks to those 150ms probes. It is one system built on top of the other (at least in switches).

Perhaps it is not as responsive as a straight ether switching config but it is doing more than being a traffic light for pushing frames out of port. It is providing redundancy and potentially load balancing depending on the traffic mix. I am not too concerned if a system with straight up ether switching has switch that can shut down a port a few ms faster as the result is that the server is just as unavailable. Now enters the argument for ip multipathing.

(Former) sun guys all seem to prefer ip multipathing (active/standby) over lacp (active/active) and I have never understood why. Multipathing was great in the Solaris 8 days. As a design preference today, i prefer lacp connected to a cheap virtual chassis (ex4200) on two differ chassis chassis members. I could be persuaded that multipathing is the right choice on a server that talks to few other hosts, but in a server with many clients I want lacp. Then enters the cost managing hosts with different layer 2 configurations, and I want to limit those as much as possible. Headaches arent NxO so much as the are O^N

I guess I have unconsciously taken the view that lacp is essentially a super set to ip multipathing.

J.

Richard Elling

2013-04-01 05:29:45 UTC

Post by Jason Matthews
Sent from my iPad

Post by Richard Elling

Post by Jason Matthews
In the absence of PDUs, transmitted at 1/sec from the active partner, the device will hold down a misbehaving member. In this way you get datalink redundancy.

Ethernet link state probes are every 150ms, so LACP is useful in the case where link state doesn't
represent the state of the link, as previously mentioned in the case where there is a media converter.
In my experience, this is a very rare case

This may be a case of mismatched expectations or perhaps the pure scientist versus the engineer.
If the link goes down entirely it is generally fast tracked for being disabled in the lag any way, bypassing the PDU health checks, thanks to those 150ms probes. It is one system built on top of the other (at least in switches).
Perhaps it is not as responsive as a straight ether switching config but it is doing more than being a traffic light for pushing frames out of port. It is providing redundancy and potentially load balancing depending on the traffic mix. I am not too concerned if a system with straight up ether switching has switch that can shut down a port a few ms faster as the result is that the server is just as unavailable. Now enters the argument for ip multipathing.
(Former) sun guys all seem to prefer ip multipathing (active/standby) over lacp (active/active) and I have never understood why.

This isn't an exclusive choice. IPMP works at the network layer (L3) and link aggregation works at
the Ethernet link layer (L2). In other words, link aggregation can't protect you from network failures
like IPMP can, but IPMP can protect you from link failures. For high-dependability solutions, it is
not uncommon to use both. Higher-level protocols like SCSI multipathing offer yet another layer
of protection. It is always better to have redundancy closer to the application.

Post by Jason Matthews
Multipathing was great in the Solaris 8 days.

Actually, IPMP is much improved over the Solaris 8 implementation :-)

Post by Jason Matthews
As a design preference today, i prefer lacp connected to a cheap virtual chassis (ex4200) on two differ chassis chassis members. I could be persuaded that multipathing is the right choice on a server that talks to few other hosts, but in a server with many clients I want lacp. Then enters the cost managing hosts with different layer 2 configurations, and I want to limit those as much as possible. Headaches arent NxO so much as the are O^N
I guess I have unconsciously taken the view that lacp is essentially a super set to ip multipathing.

As above, reality is the other way around.
-- richard

--
ZFS and performance consulting
http://www.RichardElling.com

Garrett D'Amore

2013-04-02 17:42:11 UTC

Post by Jason Matthews
(Former) sun guys all seem to prefer ip multipathing (active/standby) over lacp (active/active) and I have never understood why.

Not all of us. I know a lot of people like IP multipathing, and it *is* simpler to deploy, especially in heterogenous networks. However, a clean implementation of link aggregation can work just as well if not better, for certain kinds of deployments.

Part of what you need to do is analyze your deployment, including an analysis of what kinds of failures you need to guard against, failover times, and resilience requirements (i.e. what traffic has to be resilient, etc.) Doing it at the link layer allows for a faster failover, trunking (aggregation!), and supports all upper layer protocols. IP multipathing is a little slower with failover, and only supports IP traffic. To get the maximum resilience, you need active probes, which requires another IP network. But IPMP lets you build configurations that span multiple vendor switching infrastructures, and failover modes that can span larger network segments as well.

Its also the case with IPMP that you have to do the work in each virtual guest, whereas with LACP you can satisfy the requirement in the hypervisor. So for some deployments LACP/aggregation leads to a far simpler configuration.

Short answer: there is no one answer, and no substitute for proper requirements analysis.

- Garrett

Sašo Kiselkov

2013-04-02 17:57:05 UTC

Post by Garrett D'Amore
Short answer: there is no one answer, and no substitute for proper requirements analysis.

Amen, which is why we're going with iSCSI multipath.

Cheers,
--
Saso

Koopmann, Jan-Peter

2013-03-29 22:22:09 UTC

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support load balancing over all links. It's only for failover issues.

I believe this is incorrect. It depends on how the sending side determines how to send a packet on "a" link. Several implementations exist like

- src mac
- dst mac
- src and dst mac
- source ip
- source ip and port
- dst ip
- etc

Vmware only supports one method (don't remember the particular implementation). However it boils down to the fact that a VM on a particular host will use one link only (outbound) I believe. And when using nfs it means that a connection between a particular vmware host and one nfs share will also only use one link.

So lacp does provide load balancing however it depends on the particular implementation.

Regards
JP

Tim Cook

2013-03-29 22:26:02 UTC

On Fri, Mar 29, 2013 at 5:22 PM, Koopmann, Jan-Peter

Post by Markus Grundmann

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support

load balancing over all links. It's only for failover issues.
I believe this is incorrect. It depends on how the sending side determines
how to send a packet on "a" link. Several implementations exist like
- src mac
- dst mac
- src and dst mac
- source ip
- source ip and port
- dst ip
- etc
Vmware only supports one method (don't remember the particular
implementation). However it boils down to the fact that a VM on a
particular host will use one link only (outbound) I believe. And when using
nfs it means that a connection between a particular vmware host and one nfs
share will also only use one link.
So lacp does provide load balancing however it depends on the particular implementation.
Regards
JP

VMware doesn't support LACP at all with the native vSwitch. You'd have to
use something like the cisco 1000v. I'd imagine the IBM and NEC offerings
will do it as well although I haven't tried them. The default ESX vSwitch
will do PaGP/Etherchannel, and last I looked it was IP Hash only.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Koopmann, Jan-Peter

2013-03-29 22:35:48 UTC

VMware doesn't support LACP at all with the native vSwitch. You'd have to use something like the cisco 1000v. I'd imagine the IBM and NEC offerings will do it as well although I haven't tried them. The default ESX vSwitch will do PaGP/Etherchannel, and last I looked it was IP Hash only.

Correct. Lacp is only offered with vds. But my statements are true for etherchannel as well as lacp since as you mentioned in many respects (esp. regarding what packet to send on which trunk) they are alike. :-)

So if you can do load balancing with etherchannel you can surely do it with lacp as well.

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Jason Matthews

2013-03-29 22:33:18 UTC

Post by Markus Grundmann
Tim my last Information about LACP is that the protocol does not support load balancing over all links. It's only for failover issues. If you have more than one MAC address the switch/server can use all channel members to transport the data but only one connection per member is possible.
This information based on my experience on Cisco IOS devices.

That is not right either. Default behavior often is to load balance by destination Mac address. In this way a flow can be asymmetric where data is received on member 0 but transmitted by member one. You will only see performance gains under lacp where you have one system talking to many systems. In a one to one relationship only one member link will be used, but you will get datalink redundancy.

Generally switches can be tuned to hash to member links based on src, dst, src+dst mac addr, layer 4 ports, or Layer 4 ports+Mac addresses to get a more even distribution if the defaults aren't working out.

J.
Sent from my iPad

Koopmann, Jan-Peter

2013-03-29 20:58:54 UTC

1) IPMP or use 10G in the first place.

2) This is only needed with iscsi is it not? With NFS would deleted blocks/disks etc. not automatically reduce space consumption since the vmdks get smaller or deleted?

Liebe Grüße
JPK

Sent from a mobile device.

Post by SaÅ¡o Kiselkov

Post by Tim Cook
Any reason ins particular you're using iSCSI instead of NFS?

1) active-active multipathing over two fully independent SANs
2) SCSI Unmap support in COMSTAR for thin-provisioning
If there are ways to get these in NFS, I'd go with NFS.
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24019602-a137dbc7
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered

Sašo Kiselkov

2013-03-29 21:06:47 UTC

Post by Koopmann, Jan-Peter
1) IPMP or use 10G in the first place.

IPMP requires both interfaces to be connected into the same network.
What we're building are two fully independent SANs sharing nothing, not
even stacking switches together - this is to provide high failure
resiliency and the ability to do in-service software upgrades on the
networking infrastructure (try issuing "reload" on a Cisco switch stack
and see how that works out for you).

Post by Koopmann, Jan-Peter
2) This is only needed with iscsi is it not? With NFS would deleted
blocks/disks etc. not automatically reduce space consumption since
the vmdks get smaller or deleted?

VMDKs never shrink and AFAIK NFS doesn't have the equivalent of a SCSI
Unmap operation. The only way to reclaim space is to delete the VMDK
entirely and recreate it.

--
Saso

Tim Cook

2013-03-29 21:12:29 UTC

Post by SaÅ¡o Kiselkov

Post by Koopmann, Jan-Peter
1) IPMP or use 10G in the first place.

IPMP requires both interfaces to be connected into the same network.
What we're building are two fully independent SANs sharing nothing, not
even stacking switches together - this is to provide high failure
resiliency and the ability to do in-service software upgrades on the
networking infrastructure (try issuing "reload" on a Cisco switch stack
and see how that works out for you).

If you're using Nexus gear it's called a VPC. Pretty much any modern 10gbe
switch worth it's salt supports cross-chassis aggregation either through
PaGP (extreme networks, force10, juniper), or LACP (cisco).

Post by SaÅ¡o Kiselkov

Post by Koopmann, Jan-Peter
2) This is only needed with iscsi is it not? With NFS would deleted
blocks/disks etc. not automatically reduce space consumption since
the vmdks get smaller or deleted?

VMDKs never shrink and AFAIK NFS doesn't have the equivalent of a SCSI
Unmap operation. The only way to reclaim space is to delete the VMDK
entirely and recreate it.

Much like with iSCSI, you just need to have a shim in the guest. NetApp
does it with snapdrive on NFS or iSCSI. You aren't just going to magically
get deleted blocks back on iSCSI because comstar supports block reclaim
from an API level.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 21:25:07 UTC

Post by Tim Cook
If you're using Nexus gear it's called a VPC. Pretty much any modern 10gbe
switch worth it's salt supports cross-chassis aggregation either through
PaGP (extreme networks, force10, juniper), or LACP (cisco).

I know, but I'm not paying an arm and a leg for a 10G switch.

Post by Tim Cook
Much like with iSCSI, you just need to have a shim in the guest. NetApp
does it with snapdrive on NFS or iSCSI. You aren't just going to magically
get deleted blocks back on iSCSI because comstar supports block reclaim
from an API level.

NTFS and Ext4 both have support for issuing SCSI Unmap and SATA TRIM
commands to the underlying storage. At that point, it's just a question
of getting those across the wire to COMSTAR.

Cheers,
--
Saso

Ray Van Dolson

2013-03-29 21:14:13 UTC

Post by SaÅ¡o Kiselkov

Post by Koopmann, Jan-Peter
1) IPMP or use 10G in the first place.

IPMP requires both interfaces to be connected into the same network.
What we're building are two fully independent SANs sharing nothing, not
even stacking switches together - this is to provide high failure
resiliency and the ability to do in-service software upgrades on the
networking infrastructure (try issuing "reload" on a Cisco switch stack
and see how that works out for you).

Post by Koopmann, Jan-Peter
2) This is only needed with iscsi is it not? With NFS would deleted
blocks/disks etc. not automatically reduce space consumption since
the vmdks get smaller or deleted?

VMDKs never shrink and AFAIK NFS doesn't have the equivalent of a SCSI
Unmap operation. The only way to reclaim space is to delete the VMDK
entirely and recreate it.

I think you can also re-run VMware converter to reclaim the space.
Could be wrong on that, but regardless -- not an efficient process.

We've run VMware just fine on iSCSI -- in my experience is performs
similarly to NFS, and if you do it on a ZFS appliance of some sort you
can throw SSD's at it in the same way to gain performance improvements.

Currently, however, we run pretty much exclusively on NFS (on NetApp)
currently. From a reliability standpoint it's been solid. We also
have two uplink switches behind each "host" -- if one fails, there is
no disruption to the environment.

We use link status failure detection, both physical NICs "active" and
map traffic based on virtual port id (so it isn't a smooth load balance
across links necessarily).

Ray

Jim Klimov

2013-03-30 12:39:30 UTC

Post by Koopmann, Jan-Peter
2) This is only needed with iscsi is it not? With NFS would deleted blocks/disks etc. not automatically reduce space consumption since the vmdks get smaller or deleted?

Post by SaÅ¡o Kiselkov
2) SCSI Unmap support in COMSTAR for thin-provisioning

I believe that with NFS (over ZFS or other compressable filesystems)
it would suffice to zero-out "UNMAPped" blocks on the virtual disk.
For a large part, ZFS "zle" compression would save physical space
on storage host with minimal intervention into storage processing
overheads.

Likely, the proper location for such code would be VMWare itself
(as it receives the UNMAP call from a VM, it should write zeroes
into the NFS-served VM disk image), and I have no idea if it does
that in practice.

Second-best is the regularly employed creation of a huge file
filled with zeroes, so that the VM's filesystem would give it
the blocks which it deems "available".

For some filesystems (ext2/3, ntfs) there are online or offline
tools which just zero-out unused blocks - these are FS-dependent
of course. For Windows in particular, there is "sdelete -c" from
www.sysinternals.com.

HTH,
//Jim

Sašo Kiselkov

2013-03-30 12:52:34 UTC

Post by Jim Klimov

Post by Koopmann, Jan-Peter
2) This is only needed with iscsi is it not? With NFS would deleted
blocks/disks etc. not automatically reduce space consumption since the
vmdks get smaller or deleted?

Post by SaÅ¡o Kiselkov
2) SCSI Unmap support in COMSTAR for thin-provisioning

I believe that with NFS (over ZFS or other compressable filesystems)
it would suffice to zero-out "UNMAPped" blocks on the virtual disk.
For a large part, ZFS "zle" compression would save physical space
on storage host with minimal intervention into storage processing
overheads.

Correct, but this approach adds more stress to the storage subsystems
due to having to transfer the zero bytes which are going to be thrown
away anyway. An Unmap command is a few bytes, as opposed to potentially
gigabytes of useless transfers. Plus, it requires manual setup on the
client.

Post by Jim Klimov
Likely, the proper location for such code would be VMWare itself
(as it receives the UNMAP call from a VM, it should write zeroes
into the NFS-served VM disk image), and I have no idea if it does
that in practice.

Neither do I.

Post by Jim Klimov
Second-best is the regularly employed creation of a huge file
filled with zeroes, so that the VM's filesystem would give it
the blocks which it deems "available".

Requires manual integration into the client, and is thus a no-go.

Post by Jim Klimov
For some filesystems (ext2/3, ntfs) there are online or offline
tools which just zero-out unused blocks - these are FS-dependent
of course. For Windows in particular, there is "sdelete -c" from
www.sysinternals.com.

The clients I'm concerned with are Windows server machines on NTFS and
Linux distros on Ext4, both of which natively support and issue SCSI
Unmap (or SATA TRIM depending on the back-end).

One question that remains in my mind is: one-LUN with VMFS + multiple
VMs in the VMFS, or multiple LUNs, one per VM and use some sort of
direct SCSI passthru in vSphere? (Only machine configs would be kept on
VMFS, but the VMDKs themselves would be on stand-alone zvols.) Anybody
got any ideas on how usable/doable this approach might be?

Cheers,
--
Saso

Yuri Pankov

2013-03-30 13:02:24 UTC

On Sat, 30 Mar 2013 13:52:34 +0100, Sašo Kiselkov wrote:
[...]

Post by SaÅ¡o Kiselkov
One question that remains in my mind is: one-LUN with VMFS + multiple
VMs in the VMFS, or multiple LUNs, one per VM and use some sort of
direct SCSI passthru in vSphere? (Only machine configs would be kept on
VMFS, but the VMDKs themselves would be on stand-alone zvols.) Anybody
got any ideas on how usable/doable this approach might be?

It ("multiple LUNs, one per VM and use some sort of direct SCSI passthru
in vSphere") is doable as easy as selecting "Raw Device Mappings" when
adding the disks to the VM.

Sašo Kiselkov

2013-03-30 13:20:11 UTC

Post by Yuri Pankov
[...]

Post by SaÅ¡o Kiselkov
One question that remains in my mind is: one-LUN with VMFS + multiple
VMs in the VMFS, or multiple LUNs, one per VM and use some sort of
direct SCSI passthru in vSphere? (Only machine configs would be kept on
VMFS, but the VMDKs themselves would be on stand-alone zvols.) Anybody
got any ideas on how usable/doable this approach might be?

It ("multiple LUNs, one per VM and use some sort of direct SCSI passthru
in vSphere") is doable as easy as selecting "Raw Device Mappings" when
adding the disks to the VM.

Thanks, I'm not a VMware-guru or anything, so your hint is very much
welcome. RDM in physical compatibility mode looks like just what I need.

Cheers,
--
Saso

Jim Klimov

2013-03-30 14:56:57 UTC

Post by SaÅ¡o Kiselkov

Post by Yuri Pankov
[...]

Post by SaÅ¡o Kiselkov
One question that remains in my mind is: one-LUN with VMFS + multiple
VMs in the VMFS, or multiple LUNs, one per VM and use some sort of
direct SCSI passthru in vSphere? (Only machine configs would be kept on
VMFS, but the VMDKs themselves would be on stand-alone zvols.) Anybody
got any ideas on how usable/doable this approach might be?

It ("multiple LUNs, one per VM and use some sort of direct SCSI passthru
in vSphere") is doable as easy as selecting "Raw Device Mappings" when
adding the disks to the VM.

Thanks, I'm not a VMware-guru or anything, so your hint is very much
welcome. RDM in physical compatibility mode looks like just what I need.

Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Also, how do you plan to maintain the two independent SAN setups
to the benefit of your VMs' uptimes and availability? Map disk
images from both, and do software HDD mirroring in VMs? If so,
test what happens when you offline one SAN (maintenance, etc.) -
won't the clients go crazy, postpone writes, etc., and then bring
it back up?..

My 2c,
//Jim

Sašo Kiselkov

2013-03-30 15:25:53 UTC

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

I know and I'll test this.

Post by Jim Klimov
Also, how do you plan to maintain the two independent SAN setups
to the benefit of your VMs' uptimes and availability? Map disk
images from both, and do software HDD mirroring in VMs? If so,
test what happens when you offline one SAN (maintenance, etc.) -
won't the clients go crazy, postpone writes, etc., and then bring
it back up?..

The storage is split into two separate pools to evenly utilize both
storage heads, with mutual take-over in case of head node failure. SAN
path failover is handled by ESX's software iSCSI initiator - sudden path
outages result in a short delay (typically around 10-20 seconds) on disk
activity inside the VMs.

Cheers,
--
Saso

Sašo Kiselkov

2013-03-30 17:44:57 UTC

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.

The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.

For a demo of this, see:

(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

--
Saso

Pasi Kärkkäinen

2013-04-13 20:39:53 UTC

Post by SaÅ¡o Kiselkov

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.
The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.
For a demo of this, see: http://youtu.be/oKH-DW6HrMc
(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

.. but remember the best practice is to use one-lun-per-target,
so each lun can have it's own IP connection etc.

-- Pasi

Sašo Kiselkov

2013-04-13 20:47:46 UTC

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.
The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.
For a demo of this, see: http://youtu.be/oKH-DW6HrMc
(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

.. but remember the best practice is to use one-lun-per-target,
so each lun can have it's own IP connection etc.

This only matters if you use LACP or etherchannel or some other
hash-based link load-balancing method to connect to your SAN (plus each
device in the interface bond needs to support hashing L4 information as
part of the interface selection policy, and I'm not sure ESXi's "IP
hash" setting does that).

In my case, this is pointless, as I'm using iSCSI multipathing over two
physically separate SANs.

Cheers,
--
Saso

Pasi Kärkkäinen

2013-04-13 21:21:27 UTC

Post by SaÅ¡o Kiselkov

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.
The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.
For a demo of this, see: http://youtu.be/oKH-DW6HrMc
(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

.. but remember the best practice is to use one-lun-per-target,
so each lun can have it's own IP connection etc.

This only matters if you use LACP or etherchannel or some other
hash-based link load-balancing method to connect to your SAN (plus each
device in the interface bond needs to support hashing L4 information as
part of the interface selection policy, and I'm not sure ESXi's "IP
hash" setting does that).
In my case, this is pointless, as I'm using iSCSI multipathing over two
physically separate SANs.

Ok. I'm not sure what model COMSTAR prefers, but many other iSCSI targets
(target vendors) prefer the one-lun-per-target model.

Just something to check :)

-- Pasi

Sašo Kiselkov

2013-04-13 21:27:36 UTC

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.
The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.
For a demo of this, see: http://youtu.be/oKH-DW6HrMc
(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

.. but remember the best practice is to use one-lun-per-target,
so each lun can have it's own IP connection etc.

This only matters if you use LACP or etherchannel or some other
hash-based link load-balancing method to connect to your SAN (plus each
device in the interface bond needs to support hashing L4 information as
part of the interface selection policy, and I'm not sure ESXi's "IP
hash" setting does that).
In my case, this is pointless, as I'm using iSCSI multipathing over two
physically separate SANs.

Ok. I'm not sure what model COMSTAR prefers, but many other iSCSI targets
(target vendors) prefer the one-lun-per-target model.
Just something to check :)

COMSTAR has no preference either way. iSCSI is simply a port provider
and whether SCSI commands reach the STMF (SCSI Target-Mode Framework) by
way of multiple TCP connections or a single one (or even over completely
different port providers, e.g. FC) makes no difference - they'll be
executed by the same back-end logical unit (LU). In fact, if I
understand the COMSTAR architecture correctly, it's possible to make a
single LU available over multiple ports at the same time (e.g. iSCSI,
FCoE and FC, all going to a single LU).

Cheers,
--
Saso

Richard Elling

2013-04-13 22:10:59 UTC

Post by SaÅ¡o Kiselkov

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.
The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.
For a demo of this, see: http://youtu.be/oKH-DW6HrMc
(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

.. but remember the best practice is to use one-lun-per-target,
so each lun can have it's own IP connection etc.

This only matters if you use LACP or etherchannel or some other
hash-based link load-balancing method to connect to your SAN (plus each
device in the interface bond needs to support hashing L4 information as
part of the interface selection policy, and I'm not sure ESXi's "IP
hash" setting does that).
In my case, this is pointless, as I'm using iSCSI multipathing over two
physically separate SANs.

Ok. I'm not sure what model COMSTAR prefers, but many other iSCSI targets
(target vendors) prefer the one-lun-per-target model.
Just something to check :)

COMSTAR has no preference either way. iSCSI is simply a port provider
and whether SCSI commands reach the STMF (SCSI Target-Mode Framework) by
way of multiple TCP connections or a single one (or even over completely
different port providers, e.g. FC) makes no difference - they'll be
executed by the same back-end logical unit (LU). In fact, if I
understand the COMSTAR architecture correctly, it's possible to make a
single LU available over multiple ports at the same time (e.g. iSCSI,
FCoE and FC, all going to a single LU).

Yes, and this can be very entertaining :-)
I have yet to convince someone to put it into an RFP, though...
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-04-13 22:20:16 UTC

Post by Richard Elling

Post by SaÅ¡o Kiselkov

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Pasi KÃ¤rkkÃ¤inen

Post by SaÅ¡o Kiselkov

Post by Jim Klimov
Just in case, to know the limits before you strike them, test how
many LUNs your version of VMWare can mount (or of NFS shares, for
that matter), and how many your COMSTAR server can serve (there was
this thread about 255 limit, recently)?

Just a quick follow up on this: the limit is irrelevant. It's a limit on
the number of SCSI targets that you can make visible through COMSTAR,
but SCSI target != zvol. You first create a logical unit under a target
and map *that* to a zvol. Obviously, you can create as many logical
units under a single target as you like.
The additional benefit should be simplified VMware iSCSI configuration,
since VMware configures path failover and path parameters on a
per-target, not per-LU basis. So once you get the target up and
configured from the ESX host, you can just assign new LUs to individual
VMs without having to worry about multipathing and path failover.
For a demo of this, see: http://youtu.be/oKH-DW6HrMc
(btw: this is exactly the networking setup I want to use, but on two
separate subnets)

.. but remember the best practice is to use one-lun-per-target,
so each lun can have it's own IP connection etc.

This only matters if you use LACP or etherchannel or some other
hash-based link load-balancing method to connect to your SAN (plus each
device in the interface bond needs to support hashing L4 information as
part of the interface selection policy, and I'm not sure ESXi's "IP
hash" setting does that).
In my case, this is pointless, as I'm using iSCSI multipathing over two
physically separate SANs.

Ok. I'm not sure what model COMSTAR prefers, but many other iSCSI targets
(target vendors) prefer the one-lun-per-target model.
Just something to check :)

COMSTAR has no preference either way. iSCSI is simply a port provider
and whether SCSI commands reach the STMF (SCSI Target-Mode Framework) by
way of multiple TCP connections or a single one (or even over completely
different port providers, e.g. FC) makes no difference - they'll be
executed by the same back-end logical unit (LU). In fact, if I
understand the COMSTAR architecture correctly, it's possible to make a
single LU available over multiple ports at the same time (e.g. iSCSI,
FCoE and FC, all going to a single LU).

Yes, and this can be very entertaining :-)
I have yet to convince someone to put it into an RFP, though...

Call it "investment protection" or some such buzzword...

In my mind, the economic argument is pretty clear: uncoupling your
storage resources from the SAN used to access them lowers the cost of
switching fabric architectures. It allows businesses to do a "gradual"
switchover, making the same data available on both the old and new
interconnect during the migration process.

Anyway, this discussion seems to have veered off-topic quite a lot.

Cheers,
--
Saso

Paul B. Henson

2013-03-29 20:46:22 UTC

Post by SaÅ¡o Kiselkov
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that.

We have a reasonably sizable ESXi deployment that has historically used
SAN storage over fibre channel. Our vmware guy is currently evaluating
the EMC Isilon storage appliance, and according to him VMWare support
recommended setting up NFSv4 shares to provide the storage rather than
using iSCSI as he originally planned. I don't know all of the technical
details, but supposedly they said that would provide better performance.

Tim Cook

2013-03-29 20:49:12 UTC

Post by SaÅ¡o Kiselkov
As I understand it, the recommended practice in vSphere, when using

Post by SaÅ¡o Kiselkov
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that.

We have a reasonably sizable ESXi deployment that has historically used
SAN storage over fibre channel. Our vmware guy is currently evaluating the
EMC Isilon storage appliance, and according to him VMWare support
recommended setting up NFSv4 shares to provide the storage rather than
using iSCSI as he originally planned. I don't know all of the technical
details, but supposedly they said that would provide better performance.

Run, run while you can. Isilon is absolutely not fit for VMware storage.
It's small random performance is horrible and that's all that VMWare does.
Round peg/square hole. You've got an overzealous sales guy who doesn't
know what he's selling.

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Paul B. Henson

2013-03-29 21:29:19 UTC

Post by Tim Cook
Run, run while you can. Isilon is absolutely not fit for VMware
storage. It's small random performance is horrible and that's all
that VMWare does. Round peg/square hole. You've got an overzealous
sales guy who doesn't know what he's selling.

Hmm, I'll pass this along to the guy that's doing the evaluation, thanks.

Ray Van Dolson

2013-03-29 20:51:36 UTC

Post by Paul B. Henson

Post by SaÅ¡o Kiselkov
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that.

We have a reasonably sizable ESXi deployment that has historically
used SAN storage over fibre channel. Our vmware guy is currently
evaluating the EMC Isilon storage appliance, and according to him
VMWare support recommended setting up NFSv4 shares to provide the
storage rather than using iSCSI as he originally planned. I don't
know all of the technical details, but supposedly they said that
would provide better performance.

Does ESXi support NFSv4 now? I am definitely not in the loop, but
thought it was still NFSv3 only through ESXi 5.x.

I'd imagine NFS and iSCSI would provide *similar* performance with NFS
giving you a bit more flexibility. Perhaps Isilon does NAS better than
block?

Ray

Tim Cook

2013-03-29 20:58:34 UTC

Post by Ray Van Dolson

Post by Paul B. Henson

Post by SaÅ¡o Kiselkov
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that.

We have a reasonably sizable ESXi deployment that has historically
used SAN storage over fibre channel. Our vmware guy is currently
evaluating the EMC Isilon storage appliance, and according to him
VMWare support recommended setting up NFSv4 shares to provide the
storage rather than using iSCSI as he originally planned. I don't
know all of the technical details, but supposedly they said that
would provide better performance.

Does ESXi support NFSv4 now? I am definitely not in the loop, but
thought it was still NFSv3 only through ESXi 5.x.
I'd imagine NFS and iSCSI would provide *similar* performance with NFS
giving you a bit more flexibility. Perhaps Isilon does NAS better than
block?
Ray

It doesn't, that's one of many reasons I told him to run ;)

--Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Pasi Kärkkäinen

2013-04-13 20:40:47 UTC

Post by Paul B. Henson

Post by SaÅ¡o Kiselkov
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that.

We have a reasonably sizable ESXi deployment that has historically
used SAN storage over fibre channel. Our vmware guy is currently
evaluating the EMC Isilon storage appliance, and according to him
VMWare support recommended setting up NFSv4 shares to provide the
storage rather than using iSCSI as he originally planned. I don't
know all of the technical details, but supposedly they said that
would provide better performance.

Vmware/ESXi only supports NFSv3.

-- Pasi

Jason Matthews

2013-03-29 20:51:31 UTC

Post by SaÅ¡o Kiselkov
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

So that is one zpool per VMware? Ugg. Let's think about this for second.

I am assuming you are going to use mirrors in which case Queuing theory dominates and says you get the best overall performance putting all the disks in one zpool

If you are trying to maximize storage then things are different. If you are not using mirrors then you are getting poor performance and burning storage on several parity disks using one zpool per vm.

Using one zpool per vm is almost certainly the wrong thing to do.

If you have a lot of sync writes put some money in an ddrx1. Next put your money in ram. When you max out the system in ram then consider SSDs for l2arc.

J.

Sent from my iPad

Jason Matthews

2013-03-29 20:55:09 UTC

Post by Jason Matthews
If you have a lot of sync writes put some money in an ddrx1.

You are using iscsi, start with the x1. It's a no-brainier.

J.

Sašo Kiselkov

2013-03-29 20:57:23 UTC

Post by Jason Matthews

Post by SaÅ¡o Kiselkov
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

So that is one zpool per VMware? Ugg. Let's think about this for second.

No, one ZFS *VOLUME* (zvol) per VM.

Cheers,
--
Saso

Nico Williams

2013-04-03 19:57:00 UTC

Post by Jason Matthews

Post by SaÅ¡o Kiselkov
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

So that is one zpool per VMware? Ugg. Let's think about this for second.

I think Sašo meant one zvol per-VMDK, not one pool.

Matthew Ahrens

2013-03-29 20:55:23 UTC

We share NFS to ESX, with the default (128k) ZFS recordsize, and
compression. You could try similar settings with iSCSI. We have
sufficient DRAM that we rarely have to read data from disk. This is for
our test & development environments -- a few hundred VM's mostly running an
illumos derivative (DelphixOS), with some Linux and Windows as well.

--matt

Post by SaÅ¡o Kiselkov
So I'm building this rather big storage thing that will primarily (only)
serve as the SAN storage unit for holding the storage for vSphere 5. Can
people please share their experience with correct ZFS volume block
sizing and how that interacts with vSphere?
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that. This creates the potential for a single ZFS volume
containing lots of mixed workloads from all sorts of various VMs.
Naturally, this isn't a recipe for a happy storage box, since there's
going to be a large mix of block sizes and alignments from all possible
virtual machines.
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Koopmann, Jan-Peter

2013-03-29 21:00:28 UTC

Latest best practice documents suggest 8k block size with iscsi. At least the ones I know with latest VMware file systems in vsphere5.

Liebe GrÃŒÃe
JPK

Sent from a mobile device.

Am 29.03.2013 um 21:56 schrieb "Matthew Ahrens" <***@delphix.com<mailto:***@delphix.com>>:

We share NFS to ESX, with the default (128k) ZFS recordsize, and compression. You could try similar settings with iSCSI. We have sufficient DRAM that we rarely have to read data from disk. This is for our test & development environments -- a few hundred VM's mostly running an illumos derivative (DelphixOS), with some Linux and Windows as well.

--matt

On Fri, Mar 29, 2013 at 1:34 PM, SaÅ¡o Kiselkov <***@gmail.com<mailto:***@gmail.com>> wrote:
So I'm building this rather big storage thing that will primarily (only)
serve as the SAN storage unit for holding the storage for vSphere 5. Can
people please share their experience with correct ZFS volume block
sizing and how that interacts with vSphere?

As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that. This creates the potential for a single ZFS volume
containing lots of mixed workloads from all sorts of various VMs.
Naturally, this isn't a recipe for a happy storage box, since there's
going to be a large mix of block sizes and alignments from all possible
virtual machines.

Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

Cheers,
--
Saso

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

illumos-zfs | Archives<https://www.listbox.com/member/archive/182191/=now> [Loading Image...

<https://www.listbox.com/member/archive/rss/182191/24019602-a137dbc7> | Modify<https://www.listbox.com/member/?&> Your Subscription [Loading Image...

<http://www.listbox.com>

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 21:13:01 UTC

Post by Koopmann, Jan-Peter
Latest best practice documents suggest 8k block size with iscsi. At least the ones I know
with latest VMware file systems in vsphere5.

The only mention of 8k alignment in that document I could find pertains
to partition alignment and says:

"If your storage vendor makes no specific recommendation, use a starting
block that is a multiple of 8KB."

8K is a good default, but is shitty when you have tons of L2ARC - each
block requires some overhead in the ARC to keep it referenced. E.g. 1TB
of L2ARC at 8k comes to:

# gdb debug64/zfs
(gdb) p sizeof(arc_buf_hdr_t) + sizeof(l2arc_buf_hdr_t)
$2 = 272

# bc
(10^12 / 8192 * 272) / 2^30
30

So 1TB of L2ARC at 8k eats about 30 GB of DRAM.

--
Saso

Richard Elling

2013-03-29 21:59:02 UTC

Post by Koopmann, Jan-Peter
Latest best practice documents suggest 8k block size with iscsi. At least the ones I know with latest VMware file systems in vsphere5.

This is ok for ZFS backends, but anything up to 32k for iSCSI should work fine. In a ZFS
system the balance is between metadata and data: small data block size means more
metadata is needed. In general 4K is too small, 8-16-32 are generally ok.

On other systems, especially RAID-5, the block size even as small as 8K can kill performance.
They tend to want bigger blocks.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Richard Elling

2013-03-29 22:00:35 UTC

One more thing, don't worry about block alignment, enable compression and be happy.
I've got considerable amounts of data that show no correlation between latency and
alignment for VMware 5 on ZFS -- with or without compression.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Sašo Kiselkov

2013-03-29 22:05:22 UTC

Post by Richard Elling
One more thing, don't worry about block alignment, enable compression and be happy.
I've got considerable amounts of data that show no correlation between latency and
alignment for VMware 5 on ZFS -- with or without compression.

Finally an answer to one of my original questions! Thanks Richard!

One more thing: what about block sizing then? You mentioned 32k -
obviously, larger block sizes give better compression performance. How
about 64k or 128k zvols on iscsi to vmware/vmfs? Do you have any data on
that?

Nevertheless, even with 32k, I'd be a very happy man, since it cuts DRAM
consumption by L2ARC metadata and general metadata overhead by a factor
four when compared to the general practice of 8k zvols.

Cheers,
--
Saso

Koopmann, Jan-Peter

2013-03-29 22:12:40 UTC

I fail to find the document that suggested 8k for vmfs-5. However vmfs-5 uses a sub block size of 8k (whatever that means in particular) which to me suggest using a zvol 8k block size.

I am by no means trying to "be right". On the contrary. I am in a middle of a vmware/nexenta/iscsi project and would appreciate any hints. So using 16-32k can make sense? Depending in the workload (like Exchange block size, file server mean file size etc.)? Or more than 32k?

Kind regards
JPK

Sent from a mobile device.

Post by SaÅ¡o Kiselkov

Post by Richard Elling
One more thing, don't worry about block alignment, enable compression and be happy.
I've got considerable amounts of data that show no correlation between latency and
alignment for VMware 5 on ZFS -- with or without compression.

Finally an answer to one of my original questions! Thanks Richard!
One more thing: what about block sizing then? You mentioned 32k -
obviously, larger block sizes give better compression performance. How
about 64k or 128k zvols on iscsi to vmware/vmfs? Do you have any data on
that?
Nevertheless, even with 32k, I'd be a very happy man, since it cuts DRAM
consumption by L2ARC metadata and general metadata overhead by a factor
four when compared to the general practice of 8k zvols.
Cheers,
--
Saso
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/24019602-a137dbc7
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.co

Richard Elling

2013-03-29 22:13:24 UTC

Post by SaÅ¡o Kiselkov

Post by Richard Elling
One more thing, don't worry about block alignment, enable compression and be happy.
I've got considerable amounts of data that show no correlation between latency and
alignment for VMware 5 on ZFS -- with or without compression.

Finally an answer to one of my original questions! Thanks Richard!
One more thing: what about block sizing then? You mentioned 32k -
obviously, larger block sizes give better compression performance. How
about 64k or 128k zvols on iscsi to vmware/vmfs? Do you have any data on
that?

I don't have any recent data (post-Sun) and it is a very rare case, because the default
zvol block size is 8k.

Post by SaÅ¡o Kiselkov
Nevertheless, even with 32k, I'd be a very happy man, since it cuts DRAM
consumption by L2ARC metadata and general metadata overhead by a factor
four when compared to the general practice of 8k zvols.

Yes. Bandwidth is usually overprovisioned, so it really comes down to all the other
work needed to manage smaller blocks.
-- richard

--

***@RichardElling.com
+1-760-896-4422

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Robert Milkowski

2013-04-19 10:06:33 UTC

-----Original Message-----
Sent: 29 March 2013 22:05
Cc: Richard Elling
Subject: Re: [zfs] Block alignment and sizing in ZFS volumes for iSCSI
to vSphere 5

Post by Richard Elling
One more thing, don't worry about block alignment, enable compression

and be happy.

Post by Richard Elling
I've got considerable amounts of data that show no correlation

between

Post by Richard Elling
latency and alignment for VMware 5 on ZFS -- with or without

compression.
Finally an answer to one of my original questions! Thanks Richard!
One more thing: what about block sizing then? You mentioned 32k -
obviously, larger block sizes give better compression performance. How
about 64k or 128k zvols on iscsi to vmware/vmfs? Do you have any data
on that?
Nevertheless, even with 32k, I'd be a very happy man, since it cuts
DRAM consumption by L2ARC metadata and general metadata overhead by a
factor four when compared to the general practice of 8k zvols.

It all comes down to your specific workload and your requirements.
From my experience most writes from VMs will be around 8KB in size and
VMware does all writes as synchronous writes so it is sensitive to
latencies. With high recordsize/volblocksize on zfs you will end-up with
having to do much more i/o to satisfy a single 8k write which will inflate
your latencies.

Now it may or may not be an issue for you.

--
Robert Milkowski
http://milek.blogspot.com

Sašo Kiselkov

2013-04-20 08:58:33 UTC

Post by Robert Milkowski
It all comes down to your specific workload and your requirements.
From my experience most writes from VMs will be around 8KB in size and
VMware does all writes as synchronous writes so it is sensitive to
latencies. With high recordsize/volblocksize on zfs you will end-up with
having to do much more i/o to satisfy a single 8k write which will inflate
your latencies.
Now it may or may not be an issue for you.

As for block size, that depends on the setting of the filesystem block
size in the VM. On NTFS I can boost the block size all the way up to 64k.

Are you certain that VMware does pure sync writes over iSCSI as well?
Because so far I haven't see that happen.

--
Saso

Etienne Dechamps

2013-04-20 23:25:20 UTC

Post by SaÅ¡o Kiselkov
Are you certain that VMware does pure sync writes over iSCSI as well?
Because so far I haven't see that happen.

Indeed VMware always uses synchronous writes on NFS, but on iSCSI it
will use the write cache if it's available. Note, however, that one must
be extremely careful when using write cache on remote disk storage
because of the "write - target reboot - flush" issue that can easily
cause data corruption if the target crashes.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-04-20 23:30:21 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Are you certain that VMware does pure sync writes over iSCSI as well?
Because so far I haven't see that happen.

Indeed VMware always uses synchronous writes on NFS, but on iSCSI it
will use the write cache if it's available. Note, however, that one must
be extremely careful when using write cache on remote disk storage
because of the "write - target reboot - flush" issue that can easily
cause data corruption if the target crashes.

Does VMware not manage SCSI target caches correctly? Meaning, it should
issue a SCSI CACHE FLUSH command if it wants to invoke a write barrier.
That's how "normal" SCSI initiators do it. It should never ever assume
that the target has battery backed command completion. If it does, the
target should ignore a CACHE FLUSH command. By not issuing it, ESXi is
to take the blame if data is inconsistent.

Cheers,
--
Saso

Pawel Jakub Dawidek

2013-04-20 23:38:33 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Are you certain that VMware does pure sync writes over iSCSI as well?
Because so far I haven't see that happen.

Indeed VMware always uses synchronous writes on NFS, but on iSCSI it
will use the write cache if it's available. Note, however, that one must
be extremely careful when using write cache on remote disk storage
because of the "write - target reboot - flush" issue that can easily
cause data corruption if the target crashes.

Does VMware not manage SCSI target caches correctly? Meaning, it should
issue a SCSI CACHE FLUSH command if it wants to invoke a write barrier.
That's how "normal" SCSI initiators do it. It should never ever assume
that the target has battery backed command completion. If it does, the
target should ignore a CACHE FLUSH command. By not issuing it, ESXi is
to take the blame if data is inconsistent.

The problem Etienne describes, as I understand it, is that VMware
doesn't expect write to silently fail.

It issues a write, iSCSI target receives the write, crashes, reboots,
VMware issues a flush, iSCSI starts, receives the flush, confirms the
flush. And we have a phantom write.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Pawel Jakub Dawidek

2013-04-20 23:40:01 UTC

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Are you certain that VMware does pure sync writes over iSCSI as well?
Because so far I haven't see that happen.

Indeed VMware always uses synchronous writes on NFS, but on iSCSI it
will use the write cache if it's available. Note, however, that one must
be extremely careful when using write cache on remote disk storage
because of the "write - target reboot - flush" issue that can easily
cause data corruption if the target crashes.

Does VMware not manage SCSI target caches correctly? Meaning, it should
issue a SCSI CACHE FLUSH command if it wants to invoke a write barrier.
That's how "normal" SCSI initiators do it. It should never ever assume
that the target has battery backed command completion. If it does, the
target should ignore a CACHE FLUSH command. By not issuing it, ESXi is
to take the blame if data is inconsistent.

The problem Etienne describes, as I understand it, is that VMware
doesn't expect write to silently fail.
It issues a write, iSCSI target receives the write, crashes, reboots,
VMware issues a flush, iSCSI starts, receives the flush, confirms the
flush. And we have a phantom write.

Just to be perfectly clear: iSCSI target receives the write and crashes
before storing it on disk.

--
Pawel Jakub Dawidek http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org

Sašo Kiselkov

2013-04-20 23:45:37 UTC

On Sun, Apr 21, 2013 at 01:38:33AM +0200, Pawel Jakub Dawidek

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Are you certain that VMware does pure sync writes over
iSCSI as well? Because so far I haven't see that happen.

Indeed VMware always uses synchronous writes on NFS, but on
iSCSI it will use the write cache if it's available. Note,
however, that one must be extremely careful when using write
cache on remote disk storage because of the "write - target
reboot - flush" issue that can easily cause data corruption
if the target crashes.

Does VMware not manage SCSI target caches correctly? Meaning,
it should issue a SCSI CACHE FLUSH command if it wants to
invoke a write barrier. That's how "normal" SCSI initiators do
it. It should never ever assume that the target has battery
backed command completion. If it does, the target should ignore
a CACHE FLUSH command. By not issuing it, ESXi is to take the
blame if data is inconsistent.

The problem Etienne describes, as I understand it, is that
VMware doesn't expect write to silently fail.
It issues a write, iSCSI target receives the write, crashes,
reboots, VMware issues a flush, iSCSI starts, receives the flush,
confirms the flush. And we have a phantom write.

Just to be perfectly clear: iSCSI target receives the write and
crashes before storing it on disk.

This is an easily detected situation, as it leads to an iSCSI session
reset and re-login. ESXi obviously can detect that the open
transaction needs redoing, since the underlying transport endpoint
died. Now whether ESXi implements this, I don't know, but something
definitely happens at the transport layer; at the very least a TCP
reset, reconnect and iSCSI re-login.

Cheers,
- --
Saso

Etienne Dechamps

2013-04-21 00:05:45 UTC

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek

Post by Pawel Jakub Dawidek
It issues a write, iSCSI target receives the write, crashes,
reboots, VMware issues a flush, iSCSI starts, receives the flush,
confirms the flush. And we have a phantom write.

Just to be perfectly clear: iSCSI target receives the write and
crashes before storing it on disk.

This is an easily detected situation, as it leads to an iSCSI session
reset and re-login. ESXi obviously can detect that the open
transaction needs redoing, since the underlying transport endpoint
died. Now whether ESXi implements this, I don't know, but something
definitely happens at the transport layer; at the very least a TCP
reset, reconnect and iSCSI re-login.

In theory, you're right. Problem is, SCSI doesn't work that way
(probably because it's primarily designed for local targets, not remote
ones). It is a very well-specified protocol with very clear semantics
with respect to link loss, reconnects, logins and the like. Write cache
loss is not one of them. According to the specs, the initiator can very
well assume that the write cache is still there after a reconnect. Any
target that doesn't behave that way is breaking SCSI specs.
Philosophically, when a link to a target goes down, SCSI is optimistic
and assumes that only the link went down, not the target itself.

What you can do, however, is make the target fail the cache flush in
that case, but with most initiator software that will result in
inevitable meltdown. For example, on Linux this will trigger an
irrecoverable disk I/O error which will typically cause a read-only
remount. That makes sense because in that case Linux has no choice: it
cannot reissue the writes because it doesn't have them in memory
anymore, and there's no userland API to notify the applications that
their writes are lost, so it panics and bails out.

One solution would be for VMware to keep cached writes in its own
initiator memory so that it can reissue them when needed, and then come
up with some kind of protocol (such as an SCSI extension) so that it can
be notified if the target write cache is lost. Unfortunately, they don't
seem to be working on such a thing at the moment.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Ian Collins

2013-04-21 00:33:52 UTC

Post by Etienne Dechamps
In theory, you're right. Problem is, SCSI doesn't work that way
(probably because it's primarily designed for local targets, not remote
ones). It is a very well-specified protocol with very clear semantics
with respect to link loss, reconnects, logins and the like. Write cache
loss is not one of them. According to the specs, the initiator can very
well assume that the write cache is still there after a reconnect. Any
target that doesn't behave that way is breaking SCSI specs.
Philosophically, when a link to a target goes down, SCSI is optimistic
and assumes that only the link went down, not the target itself.
What you can do, however, is make the target fail the cache flush in
that case, but with most initiator software that will result in
inevitable meltdown. For example, on Linux this will trigger an
irrecoverable disk I/O error which will typically cause a read-only
remount. That makes sense because in that case Linux has no choice: it
cannot reissue the writes because it doesn't have them in memory
anymore, and there's no userland API to notify the applications that
their writes are lost, so it panics and bails out.
One solution would be for VMware to keep cached writes in its own
initiator memory so that it can reissue them when needed, and then come
up with some kind of protocol (such as an SCSI extension) so that it can
be notified if the target write cache is lost. Unfortunately, they don't
seem to be working on such a thing at the moment.

That all sounds like a very strong argument for using NFS rather than
iSCSI with VMware. It also helps me to see why Joyent argue so strongly
for direct attached storage.

--
Ian.

Etienne Dechamps

2013-04-21 10:10:27 UTC

Post by Ian Collins

Post by Etienne Dechamps
In theory, you're right. Problem is, SCSI doesn't work that way
(probably because it's primarily designed for local targets, not remote
ones). It is a very well-specified protocol with very clear semantics
with respect to link loss, reconnects, logins and the like. Write cache
loss is not one of them. According to the specs, the initiator can very
well assume that the write cache is still there after a reconnect. Any
target that doesn't behave that way is breaking SCSI specs.
Philosophically, when a link to a target goes down, SCSI is optimistic
and assumes that only the link went down, not the target itself.
What you can do, however, is make the target fail the cache flush in
that case, but with most initiator software that will result in
inevitable meltdown. For example, on Linux this will trigger an
irrecoverable disk I/O error which will typically cause a read-only
remount. That makes sense because in that case Linux has no choice: it
cannot reissue the writes because it doesn't have them in memory
anymore, and there's no userland API to notify the applications that
their writes are lost, so it panics and bails out.
One solution would be for VMware to keep cached writes in its own
initiator memory so that it can reissue them when needed, and then come
up with some kind of protocol (such as an SCSI extension) so that it can
be notified if the target write cache is lost. Unfortunately, they don't
seem to be working on such a thing at the moment.

That all sounds like a very strong argument for using NFS rather than
iSCSI with VMware.

Not really. If you disable write cache on iSCSI, it's the same as using
NFS. This is only an issue if you use iSCSI *with write cache enabled*.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Koopmann, Jan-Peter

2013-04-21 07:41:40 UTC

Hi,

One solution would be for VMware to keep cached writes in its own initiator memory so that it can reissue them when needed, and then come up with some kind of protocol (such as an SCSI extension) so that it can be notified if the target write cache is lost. Unfortunately, they don't seem to be working on such a thing at the moment.

My naive thinking was that a vm would issue all critical writes as sync writes. Those would only be acked if those writes are at least in the ZIL/SLOG. A crash should not hurt then. Is this correct or what am I missing?

Of course async writes would be lost but a VM filesystem should be able to recover. Coming to think of it, should it? Are you referring to a situation in which the VM stays up and never realizes the failure of the async writes since the storage comes back but without that writes. I can see why this might cause corruption. And a storage cluster will not protect you against this, unless you disable zvol write cache and force every write through the ZIL? Has anyone done testing on how badly this hits performance with a fast ZIL (zeusram)? I assume it will hit hard?

Lets assume NFS which I can't use since its lacking mulitpathing, how is vmware handling this? Is it simply issuing everything as sync writes? And if so, why not disable zvol write cache and force the same behaviour?

What is the best way to protect against this? Not using iSCSI is not an option when you are forced to use multipathing for performance reasons. And ZFS has no sync mechanism as the Netapp one mentioned in another post, at least none that I am aware of. And even that mechanism probably eats performance does it not?

Regards,
JP

Etienne Dechamps

2013-04-21 10:28:44 UTC

Post by Koopmann, Jan-Peter

One solution would be for VMware to keep cached writes in its own initiator memory so that it can reissue them when needed, and then come up with some kind of protocol (such as an SCSI extension) so that it can be notified if the target write cache is lost. Unfortunately, they don't seem to be working on such a thing at the moment.

My naive thinking was that a vm would issue all critical writes as sync writes. Those would only be acked if those writes are at least in the ZIL/SLOG. A crash should not hurt then. Is this correct or what am I missing?

You are correct. That's what happens when using iSCSI with write cache
disabled (or NFS).

Post by Koopmann, Jan-Peter
Of course async writes would be lost but a VM filesystem should be able to recover. Coming to think of it, should it? Are you referring to a situation in which the VM stays up and never realizes the failure of the async writes since the storage comes back but without that writes. I can see why this might cause corruption. And a storage cluster will not protect you against this, unless you disable zvol write cache and force every write through the ZIL? Has anyone done testing on how badly this hits performance with a fast ZIL (zeusram)? I assume it will hit hard?

Yes, it will hit performance hard because all writes will have to go
through the ZIL. To counteract that you need a *very* fast SLOG (in
terms of latency *and* throughput) and you need to tune zil_slog_limit
to use the SLOG for large ZIL commits (see
https://github.com/zfsonlinux/zfs/issues/1012).

In addition, the SLOG will be hit with all writes in this configuration,
so if it can only handle a limited number of write cycles, it might die
pretty quickly.

Post by Koopmann, Jan-Peter
Lets assume NFS which I can't use since its lacking mulitpathing, how is vmware handling this? Is it simply issuing everything as sync writes? And if so, why not disable zvol write cache and force the same behaviour?

Yes, with NFS VMware is issuing everything as sync writes. You can
indeed replicate this behavior on iSCSI by disabling the write cache.
The downside is of course performance.

Post by Koopmann, Jan-Peter
What is the best way to protect against this? Not using iSCSI is not an option when you are forced to use multipathing for performance reasons. And ZFS has no sync mechanism as the Netapp one mentioned in another post, at least none that I am aware of. And even that mechanism probably eats performance does it not?

Well, technically, a "write cache recovery mechanism" would not need to
be implemented in ZFS. It could also be implemented in the iSCSI target
code. One way to do this in a hot standby configuration is to send a
copy of all writes to the standby which keeps them in memory until they
are flushed. Then when a failover occurs, the standby replays the writes
it has in memory. Performance would not suffer much if you have a
high-throughput low-latency link between the master and the standby. Of
course, you will probably need to write a significant amount of code if
you are to implement this solution.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Koopmann, Jan-Peter

2013-04-21 10:55:50 UTC

Hi,

You are correct. That's what happens when using iSCSI with write cache
disabled (or NFS).

Ok. This answers my former question. Running VMWare over NFS will (in terms of sync writes) be equivalent to running it over iSCSI with zvol write cache disabled (or sync=always).

Yes, it will hit performance hard because all writes will have to go
through the ZIL.

This explains beautifully why my Citrix Xen based installations using NFS without a ZIL device were incredibly slow while using iSCSI gives good results. However after introducing a ZeusRAM the NFS performance was nearly the same as iSCSI.

To counteract that you need a *very* fast SLOG (in
terms of latency *and* throughput) and you need to tune zil_slog_limit
to use the SLOG for large ZIL commits (see
https://github.com/zfsonlinux/zfs/issues/1012).

In addition, the SLOG will be hit with all writes in this configuration,
so if it can only handle a limited number of write cycles, it might die
pretty quickly.

Well thats what a ZeusRAM is there for after all, is it not? I agree that Fusion like PCIe cards would perform even better but we could not use those in a cluster setup...

Yes, with NFS VMware is issuing everything as sync writes.

From what I experienced, Citrix Xen seems to do the same.

You can
indeed replicate this behavior on iSCSI by disabling the write cache.
The downside is of course performance.

What is the best way to protect against this? Not using iSCSI is not an option when you are forced to use multipathing for performance reasons. And ZFS has no sync mechanism as the Netapp one mentioned in another post, at least none that I am aware of. And even that mechanism probably eats performance does it not?

Well, technically, a "write cache recovery mechanism" would not need to
be implemented in ZFS. It could also be implemented in the iSCSI target
code. One way to do this in a hot standby configuration is to send a
copy of all writes to the standby which keeps them in memory until they
are flushed. Then when a failover occurs, the standby replays the writes
it has in memory. Performance would not suffer much if you have a
high-throughput low-latency link between the master and the standby. Of
course, you will probably need to write a significant amount of code if
you are to implement this solution.

This all sounds perfectly clear now. I wonder why this is not mentioned in the best practices guides I have been reading. Neither on Nexenta nor on VMWare guides I remember a suggestion for disabling write caches...

BTW: Look what google showed:

http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/43353

Esp. Post http://permalink.gmane.org/gmane.os.solaris.opensolaris.zfs/43370 and Richards answer to that underline Etiennes statements.

And I really hoped you were wrong... :-)

Kind regards,
JP

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Etienne Dechamps

2013-04-21 11:26:13 UTC

Post by Etienne Dechamps
You are correct. That's what happens when using iSCSI with write cache
disabled (or NFS).
Ok. This answers my former question. Running VMWare over NFS will (in
terms of sync writes) be equivalent to running it over iSCSI with zvol
write cache disabled (or sync=always).

sync=always will probably work but I would not do that, since there's
nothing stopping the upper layers (above ZFS) from caching. The best
solution is to disable the write cache in the iSCSI target itself. I am
not familiar with Illumos so I'm not sure how to do that on that OS.

Post by Etienne Dechamps
This all sounds perfectly clear now. I wonder why this is not mentioned
in the best practices guides I have been reading. Neither on Nexenta nor
on VMWare guides I remember a suggestion for disabling write caches…

I think this is because commercial SAN offerings (EMC, NetApp, etc.)
probably err on the safe side and disable write cache unless they have a
system in place (such as the one I just described) to preserve write
cache across failures. Of course that doesn't necessarily apply to
Do-It-Yourself SANs (e.g. homemade iSCSI ZFS boxes), but since when does
VMware care about that?

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Sašo Kiselkov

2013-04-21 08:36:29 UTC

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov

Post by Pawel Jakub Dawidek

Post by Pawel Jakub Dawidek
It issues a write, iSCSI target receives the write, crashes,
reboots, VMware issues a flush, iSCSI starts, receives the flush,
confirms the flush. And we have a phantom write.

Just to be perfectly clear: iSCSI target receives the write and
crashes before storing it on disk.

This is an easily detected situation, as it leads to an iSCSI session
reset and re-login. ESXi obviously can detect that the open
transaction needs redoing, since the underlying transport endpoint
died. Now whether ESXi implements this, I don't know, but something
definitely happens at the transport layer; at the very least a TCP
reset, reconnect and iSCSI re-login.

In theory, you're right. Problem is, SCSI doesn't work that way
(probably because it's primarily designed for local targets, not remote
ones). It is a very well-specified protocol with very clear semantics
with respect to link loss, reconnects, logins and the like. Write cache
loss is not one of them. According to the specs, the initiator can very
well assume that the write cache is still there after a reconnect. Any
target that doesn't behave that way is breaking SCSI specs.

Can you please point me to the spec where it says that? Because if that
is true, the spec must have been written by an idiot. "Local" is
meaningless - the target can very well have been rebooted or power
cycled between link resets (remember hot swap?).

Post by Etienne Dechamps
Philosophically, when a link to a target goes down, SCSI is optimistic
and assumes that only the link went down, not the target itself.

Post by Etienne Dechamps
What you can do, however, is make the target fail the cache flush in
that case, but with most initiator software that will result in
inevitable meltdown. For example, on Linux this will trigger an
irrecoverable disk I/O error which will typically cause a read-only
remount. That makes sense because in that case Linux has no choice: it
cannot reissue the writes because it doesn't have them in memory
anymore, and there's no userland API to notify the applications that
their writes are lost, so it panics and bails out.

WTF? What journaled filesystem removes disk blocks from memory right
after writing but before it has successfully committed the write? That
would seem to make the journal almost meaningless.

Post by Etienne Dechamps
One solution would be for VMware to keep cached writes in its own
initiator memory so that it can reissue them when needed, and then come
up with some kind of protocol (such as an SCSI extension) so that it can
be notified if the target write cache is lost. Unfortunately, they don't
seem to be working on such a thing at the moment.

How about VMware's VMFS used a proper write journal and simply reissued
the whole transaction anew? It *knows* when the link goes down - the TCP
session simply dies and it needs to re-login to the target. In that
case, everything up to the last cache flush must be assumed to have been
lost.

Cheers,
--
Saso

Koopmann, Jan-Peter

2013-04-21 10:35:48 UTC

Hi Sa¹o,

Can you please point me to the spec where it says that? Because if that
is true, the spec must have been written by an idiot.

Well, maybe at that time he could not have anticipated everything. I agree that from todays perspective it seems to be an unwise assumption. :-)

WTF? What journaled filesystem removes disk blocks from memory right
after writing but before it has successfully committed the write? That
would seem to make the journal almost meaningless.

Etienne please correct me if my interpretation is incorrect but I thought you were talking about scenarios like this:

client sends an async write. Write stays in memory of storage node A.
Node A crashes. Let's assume a cluster so node B fires up well within the iscsi timeout.
client sends a sync write. Write goes to storage node B and is written.

However the first async write will never have made it to the discs. Since the client does not get an error for the more critical sync write it will close whatever sort of transaction it has (e.g. journaling filesystem assuming it uses transactions) and will think all is well. In fact, the system is corrupted.

Etienne. Write so far?

Sa¹o I agree that it would be nice and most likely also possible for VMWare to detect the scenario and let the underlying client OS know about the problem. But lets work under the assumption that it is not doing so and we will not be forcing them to implement something wonderful in the near future.

How about VMware's VMFS used a proper write journal and simply reissued
the whole transaction anew?

Which assumes that VMWare would know about the clients transactions. Is this always the case?

Kind regards,
JP

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Etienne Dechamps

2013-04-21 11:57:37 UTC

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
In theory, you're right. Problem is, SCSI doesn't work that way
(probably because it's primarily designed for local targets, not remote
ones). It is a very well-specified protocol with very clear semantics
with respect to link loss, reconnects, logins and the like. Write cache
loss is not one of them. According to the specs, the initiator can very
well assume that the write cache is still there after a reconnect. Any
target that doesn't behave that way is breaking SCSI specs.

Can you please point me to the spec where it says that? Because if that
is true, the spec must have been written by an idiot. "Local" is
meaningless - the target can very well have been rebooted or power
cycled between link resets (remember hot swap?).

Post by Etienne Dechamps
Philosophically, when a link to a target goes down, SCSI is optimistic
and assumes that only the link went down, not the target itself.

http://youtu.be/6F9bscdqRpo

I investigated this problem and dug into the T10 specs a year ago. I
don't remember the exact references. If I'm not mistaken, I think it was
a consequence of the SCSI layering (I'm talking about
http://www.t10.org/scsi-3.htm), in which transient failures in the lower
layers (e.g. iSCSI, at the bottom of the stack) are transparent with
regard to the upper layers (e.g. SBC, at the top of the stack, which
handles write caching). IIRC (again), it comes down to the write cache
being a state of the logical unit, not the I_T nexus. Something like
that. I really don't feel like going through these hundred pages again
just for the sake of this argument.

To clarify: I'm not really saying that the target doesn't have any way
of notifying the initiator that a power cycle has occurred and that the
write cache is lost. It does (though I've never seen any iSCSI
implementation do that, which is why this is dangerous by default). The
thing is, if you do send that notification, then you're crashing your
VMs because OSes consider that an irrecoverable error. That's what
happens if you power cycle a non-redundant local disk while the OS is
using it: it tends not to like that. At all.

Crashing your VMs is probably better than corrupting data, but again,
that's not how I've seen iSCSI targets behave by default.

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps
What you can do, however, is make the target fail the cache flush in
that case, but with most initiator software that will result in
inevitable meltdown. For example, on Linux this will trigger an
irrecoverable disk I/O error which will typically cause a read-only
remount. That makes sense because in that case Linux has no choice: it
cannot reissue the writes because it doesn't have them in memory
anymore, and there's no userland API to notify the applications that
their writes are lost, so it panics and bails out.

WTF? What journaled filesystem removes disk blocks from memory right
after writing but before it has successfully committed the write? That
would seem to make the journal almost meaningless.

You're assuming that journaling filesystems will always keep the whole
data blocks in the journal. That's not necessarily the case. For
example, if my understanding of ext4 is correct (disclaimer: I have not
actually checked this), then when using the data=ordered option (which
is the default), ext4 will not write data blocks to the journal. Instead
it will do the following:
1. Write the data block to its final position (not the journal).
2. Flush to make sure it's on stable storage.
3. If the data was appended to a file, write the metadata change to the
journal (with a pointer to the new block).
4. Flush to make sure the new journal entry (if there is one) is on
stable storage.
5. Notify the upper layer (i.e. the application) that the sync() is
successful.

In fact, I believe that if the write does not change the file size, the
journal is not used at all. It's just a single flush. There's nothing
wrong with that. It certainly doesn't break sync() semantics.

There is absolutely no reason for the OS to keep the data in RAM between
steps (1) and (2). It will remove data blocks from RAM as soon as they
have been sent to the storage device. So if a target reboot occurs
between (1) and (2), you're in for a bad time. Even if the OS is
notified, it won't be able to do anything because "the train has already
left the station", i.e. it doesn't have the data blocks in RAM anymore.
Its only choice is to crash because it knows it won't be able to honor
subsequent sync() requests reliably.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Etienne Dechamps

2013-04-20 23:51:46 UTC

Post by Pawel Jakub Dawidek

Post by SaÅ¡o Kiselkov

Post by Etienne Dechamps

Post by SaÅ¡o Kiselkov
Are you certain that VMware does pure sync writes over iSCSI as well?
Because so far I haven't see that happen.

Indeed VMware always uses synchronous writes on NFS, but on iSCSI it
will use the write cache if it's available. Note, however, that one must
be extremely careful when using write cache on remote disk storage
because of the "write - target reboot - flush" issue that can easily
cause data corruption if the target crashes.

Does VMware not manage SCSI target caches correctly? Meaning, it should
issue a SCSI CACHE FLUSH command if it wants to invoke a write barrier.
That's how "normal" SCSI initiators do it. It should never ever assume
that the target has battery backed command completion. If it does, the
target should ignore a CACHE FLUSH command. By not issuing it, ESXi is
to take the blame if data is inconsistent.

The problem Etienne describes, as I understand it, is that VMware
doesn't expect write to silently fail.
It issues a write, iSCSI target receives the write, crashes, reboots,
VMware issues a flush, iSCSI starts, receives the flush, confirms the
flush. And we have a phantom write.

Exactly.

VMware does issue flushes, that's not the issue. The issue is that if
the target crashes after a write but before a flush, and then it reboots
and comes up again, the write cache is now empty but the initiator is
not aware of that and still thinks it holds the data from the previous
write. So when the initiator flushes, it thinks it just flushed the
previously written data, when in fact nothing happened. The result: data
corruption.

To be fair this is not an issue with VMware per se - it applies to all
iSCSI configurations. It's a fundamental problem within the SCSI caching
model: there's no way to tell an initiator "hey, I'm back, but FYI, I
lost the write cache, so you might want to reissue your writes". In
other words, SCSI does not handle the "target crashed, but initiator
didn't" case very well. That's not a problem with local disks: indeed,
in that case both initiator and target crash at the same time, so
there's no chance the initiator will try to flush something. It's only a
problem with remote disks.

There are three possible solutions to this problem:
- Disable the write cache; or
- Make sure that if the target crashes and loses the write cache, then
the initiator crashes too (i.e. in the case of VMware, reboot all VMs); or
- Make the target store the write cache on some kind of non-volatile
memory, so that it can recover the write cache when it reboots.

Actually there are some solutions from EMC, Netapp & co. using the third
option: in their configurations there are two storage controllers with
one in hot standby. The standby has a replica of the write cache. If the
currently active storage controller fails, it fails over to the standby
controller, which uses its write cache replica to maintain consistency.

--
Etienne Dechamps
Phone: +44 74 50 65 82 17

Ian Collins

2013-03-30 03:12:40 UTC

Post by SaÅ¡o Kiselkov
So I'm building this rather big storage thing that will primarily (only)
serve as the SAN storage unit for holding the storage for vSphere 5. Can
people please share their experience with correct ZFS volume block
sizing and how that interacts with vSphere?
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that. This creates the potential for a single ZFS volume
containing lots of mixed workloads from all sorts of various VMs.
Naturally, this isn't a recipe for a happy storage box, since there's
going to be a large mix of block sizes and alignments from all possible
virtual machines.
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

Reading this thread makes me even more pleased we've opted for SmartOS
with its local storage over VmWare!

--
Ian.

Sašo Kiselkov

2013-03-30 10:14:02 UTC

Post by Ian Collins

Post by SaÅ¡o Kiselkov
So I'm building this rather big storage thing that will primarily (only)
serve as the SAN storage unit for holding the storage for vSphere 5. Can
people please share their experience with correct ZFS volume block
sizing and how that interacts with vSphere?
As I understand it, the recommended practice in vSphere, when using
iSCSI storage, is to create a VMFS volume on the exported LUNs and store
VMDKs inside of that. This creates the potential for a single ZFS volume
containing lots of mixed workloads from all sorts of various VMs.
Naturally, this isn't a recipe for a happy storage box, since there's
going to be a large mix of block sizes and alignments from all possible
virtual machines.
Another approach I was thinking about was to create one ZFS volume per
VM and have vSphere use those as VMDKs directly (not sure yet if this
can be done).

Reading this thread makes me even more pleased we've opted for SmartOS
with its local storage over VmWare!

Local storage isn't all peaches. For instance, you can't migrate a
workload transparently from one host to another. This isn't an issue for
Joyent, since their business model (cloud computing) is structured in
such a way that individual persistent VM state on disk is essentially
worthless, as you can always spin up a new copy of a load balancer or a
node.js app elsewhere. Not all workloads are like that, though, and for
some customers the costs associated with VM down time are tremendous. It
is this latter type of customer that we serve. However, we do keep an
eye on the way SmartOS does things and might take a stab at doing things
their way at some later time.

So, to each their own.

--
Saso

Ian Collins

2013-03-30 23:33:45 UTC

Post by SaÅ¡o Kiselkov

Post by Ian Collins
Reading this thread makes me even more pleased we've opted for SmartOS
with its local storage over VmWare!

Local storage isn't all peaches. For instance, you can't migrate a
workload transparently from one host to another. This isn't an issue for
Joyent, since their business model (cloud computing) is structured in
such a way that individual persistent VM state on disk is essentially
worthless, as you can always spin up a new copy of a load balancer or a
node.js app elsewhere. Not all workloads are like that, though, and for
some customers the costs associated with VM down time are tremendous. It
is this latter type of customer that we serve. However, we do keep an
eye on the way SmartOS does things and might take a stab at doing things
their way at some later time.
So, to each their own.

Indeed. In our case, a large part of the choice came down to a
trade-off between the cost of maintaining a redundant NAS infrastructure
and potential down time. From the quotes I have seen, vendors pretty
much give servers away in the hope of making big margins on storage
infrastructure. For the systems we are currently building, the servers
where about 10-15% of the original costing, so a local storage only
solution had a huge cost advantage.

I guess we were fortunate that all our critical applications (mainly
databases) have their own clustering support. Running virtualised busy
MS-SQL servers wasn't a viable option on a VmWare/NAS environment, but
thanks to the benefits of ZFS, it is on SmartOS. For transaction
intensive applications, the closer you can get your ZFS storage to the
application the better.

--
Ian.

86 Replies
894 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Sašo Kiselkov 2013-03-29 20:34:48 UTC

Tim Cook 2013-03-29 20:41:22 UTC

Sašo Kiselkov 2013-03-29 20:51:45 UTC

Tim Cook 2013-03-29 20:58:05 UTC

Sašo Kiselkov 2013-03-29 21:02:32 UTC

Tim Cook 2013-03-29 21:08:27 UTC

Sašo Kiselkov 2013-03-29 21:21:48 UTC

Markus Grundmann 2013-03-29 21:32:02 UTC

Richard Elling 2013-03-29 21:54:41 UTC

Sašo Kiselkov 2013-03-29 22:00:53 UTC

Richard Elling 2013-03-29 22:05:22 UTC

Tim Cook 2013-03-29 22:22:13 UTC

Sašo Kiselkov 2013-03-29 22:29:34 UTC

Tim Cook 2013-03-29 22:17:17 UTC

Tim Cook 2013-03-29 22:20:08 UTC

Richard Elling 2013-03-29 22:32:32 UTC

Jason Matthews 2013-03-30 00:01:28 UTC

Tim Cook 2013-03-30 00:08:54 UTC

Jason Matthews 2013-03-30 01:57:50 UTC

Richard Elling 2013-03-30 00:30:17 UTC

Jason Matthews 2013-03-29 22:37:58 UTC

Richard Elling 2013-03-29 22:45:33 UTC

Jason Matthews 2013-03-29 23:57:18 UTC

Richard Elling 2013-04-01 05:29:45 UTC

Garrett D'Amore 2013-04-02 17:42:11 UTC

Sašo Kiselkov 2013-04-02 17:57:05 UTC

Koopmann, Jan-Peter 2013-03-29 22:22:09 UTC

Tim Cook 2013-03-29 22:26:02 UTC

Koopmann, Jan-Peter 2013-03-29 22:35:48 UTC

Jason Matthews 2013-03-29 22:33:18 UTC

Koopmann, Jan-Peter 2013-03-29 20:58:54 UTC

Sašo Kiselkov 2013-03-29 21:06:47 UTC

Tim Cook 2013-03-29 21:12:29 UTC

Sašo Kiselkov 2013-03-29 21:25:07 UTC

Ray Van Dolson 2013-03-29 21:14:13 UTC

Jim Klimov 2013-03-30 12:39:30 UTC

Sašo Kiselkov 2013-03-30 12:52:34 UTC

Yuri Pankov 2013-03-30 13:02:24 UTC

Sašo Kiselkov 2013-03-30 13:20:11 UTC

Jim Klimov 2013-03-30 14:56:57 UTC

Sašo Kiselkov 2013-03-30 15:25:53 UTC

Sašo Kiselkov 2013-03-30 17:44:57 UTC

Pasi Kärkkäinen 2013-04-13 20:39:53 UTC

Sašo Kiselkov 2013-04-13 20:47:46 UTC

Pasi Kärkkäinen 2013-04-13 21:21:27 UTC

Sašo Kiselkov 2013-04-13 21:27:36 UTC

Richard Elling 2013-04-13 22:10:59 UTC

Sašo Kiselkov 2013-04-13 22:20:16 UTC

Paul B. Henson 2013-03-29 20:46:22 UTC

Tim Cook 2013-03-29 20:49:12 UTC

Paul B. Henson 2013-03-29 21:29:19 UTC

Ray Van Dolson 2013-03-29 20:51:36 UTC

Tim Cook 2013-03-29 20:58:34 UTC

Pasi Kärkkäinen 2013-04-13 20:40:47 UTC

Jason Matthews 2013-03-29 20:51:31 UTC

Jason Matthews 2013-03-29 20:55:09 UTC

Sašo Kiselkov 2013-03-29 20:57:23 UTC

Nico Williams 2013-04-03 19:57:00 UTC

Matthew Ahrens 2013-03-29 20:55:23 UTC

Koopmann, Jan-Peter 2013-03-29 21:00:28 UTC

Sašo Kiselkov 2013-03-29 21:13:01 UTC

Richard Elling 2013-03-29 21:59:02 UTC

Richard Elling 2013-03-29 22:00:35 UTC

Sašo Kiselkov 2013-03-29 22:05:22 UTC

Koopmann, Jan-Peter 2013-03-29 22:12:40 UTC

Richard Elling 2013-03-29 22:13:24 UTC

Robert Milkowski 2013-04-19 10:06:33 UTC

Sašo Kiselkov 2013-04-20 08:58:33 UTC

Etienne Dechamps 2013-04-20 23:25:20 UTC

Sašo Kiselkov 2013-04-20 23:30:21 UTC

Pawel Jakub Dawidek 2013-04-20 23:38:33 UTC

Pawel Jakub Dawidek 2013-04-20 23:40:01 UTC

Sašo Kiselkov 2013-04-20 23:45:37 UTC

Etienne Dechamps 2013-04-21 00:05:45 UTC

Ian Collins 2013-04-21 00:33:52 UTC

Etienne Dechamps 2013-04-21 10:10:27 UTC

Koopmann, Jan-Peter 2013-04-21 07:41:40 UTC

Etienne Dechamps 2013-04-21 10:28:44 UTC

Koopmann, Jan-Peter 2013-04-21 10:55:50 UTC

Etienne Dechamps 2013-04-21 11:26:13 UTC

Sašo Kiselkov 2013-04-21 08:36:29 UTC

Koopmann, Jan-Peter 2013-04-21 10:35:48 UTC

Etienne Dechamps 2013-04-21 11:57:37 UTC

Etienne Dechamps 2013-04-20 23:51:46 UTC

Ian Collins 2013-03-30 03:12:40 UTC

Sašo Kiselkov 2013-03-30 10:14:02 UTC

Ian Collins 2013-03-30 23:33:45 UTC

about - legalese

Loading...