Discussion:
[developer] Hardlinks between files of different filesystems in the same zpool?
Simon Toedt
2013-08-01 21:58:53 UTC
Permalink
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?

Simon


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Freddie Cash
2013-08-01 22:09:19 UTC
Permalink
Post by Simon Toedt
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?
Sounds like dedupe to me. ;)

zfs set dedup=on filesystemA
zfs set dedup=on filesystemB

rm filesystemB/file1
cp filesystemA/file1 filesystemB/file1

rm filesystemA/file1
cp filesystemB/file1 filesystemA/file1

Probably not what you wanted to hear, though.

Since hardlinks work be setting inode numbers, and inode number are
filesystem-specific (or filesystem-independent, however you want to see
it), I don't see how this would work. Unless you want a more ZFS-specific
version of a hard-link acting below the POSIX layer.
--
Freddie Cash
***@gmail.com



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Simon Toedt
2013-08-01 22:12:19 UTC
Permalink
Post by Freddie Cash
Post by Simon Toedt
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?
Sounds like dedupe to me. ;)
zfs set dedup=on filesystemA
zfs set dedup=on filesystemB
rm filesystemB/file1
cp filesystemA/file1 filesystemB/file1
rm filesystemA/file1
cp filesystemB/file1 filesystemA/file1
Probably not what you wanted to hear, though.
Since hardlinks work be setting inode numbers, and inode number are filesystem-specific (or filesystem-independent, however you want to see it), I don't see how this would work. Unless you want a more ZFS-specific version of a hard-link acting below the POSIX layer.
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?

Simon
Eric Sproul
2013-08-02 13:49:29 UTC
Permalink
Post by Simon Toedt
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?
I believe this is unfeasible because different filesystems may have
different properties, such as different compression algorithms or no
compression. Those transforms happen below the POSIX layer, so it
would be impractical, if not impossible, to meet the potentially
divergent requirements of those filesystems while maintaining a single
copy of a block that will work for multiple consuming filesystems.

Eric


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-08-02 15:12:31 UTC
Permalink
Lemme jump in and shake the boat too! ;)
Post by Eric Sproul
Post by Simon Toedt
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?
I believe this is unfeasible because different filesystems may have
different properties, such as different compression algorithms or no
compression. Those transforms happen below the POSIX layer, so it
would be impractical, if not impossible, to meet the potentially
divergent requirements of those filesystems while maintaining a single
copy of a block that will work for multiple consuming filesystems.
I think that exactly this counter-argument is quite a weak one ;)

Blocks of mixed compression and checksum properties can be written
to a dataset and read from it, these are per-block attributes which
are applied during write and happen to be "inherited" from their
dataset's current attribute setting. If there were hardlinked files
between datasets, I believe ZFS wouldn't have a problem appending
gzip-9 blocks to a file when addressed via one dataset, and lz4
blocks upon access from another, and no problem reading them as
well. Encryption would be a problem, but we don't have that in
illumos as of yet ;)

A harder problem, though hardly a showstopper, would be to maintain
the globally (pool-wide) unique inode namespace. Indeed, having one
might simplify things like NFS/CIFS/lofs sub-mounts which now have
to virtualize inodes for their networked clients' consumption (who
might only see one FS mountpoint and expect it to have a single
inode namespace).

Complications arise in a different area IMHO:

1) What do we do with snapshots and clones? They would inherit the
origin's globally unique inode values, making them hardlinks to
another dataset's files?

2) How do we assign globally unique inodes to replicated datasets
(zfs send-recv)?

3) Access rights, be it at POSIX/ACL levels or at dataset "allow"
levels - should they block/permit access to file via one path if
the file is accessible to this user via another?

4) Speaking of which, per-dataset ACL mode and/or inheritance (with
rights normally assigned to the inode entry) might become a problem
when different datasets process different ACL rulesets and modes...

5) The OP suggested a "range" of inodes - how can we pick its size
good for everyone? What do we do when it overflows? How do we change
the inode numbers for existing entries (which might be hardlinked
within their singular filesystems already)? Regarding this point -
I think it should be an all-or-nothing approach - either a pool-wide
or a per-dataset uniquity of inode numbers :)


Possibly, some functionally similar behavior might be slapped on as
a virtualization layer (in POSIX implementation?) which would link
together certain inodes across filesystems, maybe based on some xattr
value, and when writes arrive into one of such inodes, same writes
are automatically scheduled for other "hard"-linked inodes and dedup
is enforced. This way such "hard"-linked files won't use extra space,
they would change atomically in all logical paths. While this would
add some overhead due to dedup, but not so much as if it were enabled
pool-wide/dataset-wide (and also add unique single blocks to pollute
the DDT); also, the relevant metadata which influences the transaction
write would be cached and quickly applied to many instances of the
logically different but physically same blocks - again, this would
be more efficient than a typical dedup of a singular random incoming
block. Hardlinks within an FS dataset would work the same as today,
except that when you try to link a file to a file in another dataset,
the new translation layer would check if there is already an inode in
the "new filename's" dataset and if yes - classically hardlink to it.

Being dedup for ZFS itself, this would work around issues like ACLs
(each file-access path is subject to its dataset rules). There may
still be some confusion around clones and replications though - do
they or do they not inherit the hard-linkage? Will there be a tool
to optionally unlink remote-hardlinked files and remove or retain
them as unique files in this dataset (perhaps with a number of
hardlinks within the dataset), either during cloning/receiving or
as a post-operation?..

HTH,
//Jim Klimov



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Schlacta, Christ
2013-08-02 21:12:35 UTC
Permalink
This post might be inappropriate. Click to display it.
Nico Williams
2013-08-02 21:34:22 UTC
Permalink
Post by Schlacta, Christ
I think this is a use case for cp --reflink support. By creating a cow
file that behaves exactly like a file in a clone would solves at least I've
use case. Having two files each have their own inode with different access
permissions and properties but pointing to exactly the same data on disk is
perfectly acceptable. Simply note in the manual that this may result in
security issues that may be difficult to predict or prevent and allow the
admin to risk shooting themselves in the foot if they so please.
Why note that anyone who can read the thing can make a copy? It makes
no difference, security-wise, that a copy is "copy-on-write" vs. "just
a plain copy".
Schlacta, Christ
2013-08-02 21:36:48 UTC
Permalink
Case a) copy on write
Case b) same data and only one working copy.
Post by Jim Klimov
Post by Schlacta, Christ
I think this is a use case for cp --reflink support. By creating a cow
file that behaves exactly like a file in a clone would solves at least
I've
Post by Schlacta, Christ
use case. Having two files each have their own inode with different
access
Post by Schlacta, Christ
permissions and properties but pointing to exactly the same data on disk
is
Post by Schlacta, Christ
perfectly acceptable. Simply note in the manual that this may result in
security issues that may be difficult to predict or prevent and allow the
admin to risk shooting themselves in the foot if they so please.
Why note that anyone who can read the thing can make a copy? It makes
no difference, security-wise, that a copy is "copy-on-write" vs. "just
a plain copy".
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/23054485-60ad043a
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Simon Casady
2013-08-02 15:28:35 UTC
Permalink
Compression is not much of a problem since file systems can already deal
with mixed compression. However what do you do if one file system is
unmounted ?. To be a hard link the data is only in one place and if that
place is not mounted then its gone even though there may be a "link" in a
still mounted file system. If the data is in the mounted file system then
the metadata in the unmounted system will be out of date and inconsistant
if the link is used. Otherwise is a symlnk.
Post by Eric Sproul
Post by Simon Toedt
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?
I believe this is unfeasible because different filesystems may have
different properties, such as different compression algorithms or no
compression. Those transforms happen below the POSIX layer, so it
would be impractical, if not impossible, to meet the potentially
divergent requirements of those filesystems while maintaining a single
copy of a block that will work for multiple consuming filesystems.
Eric
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/24018577-4d8b86e0
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Jim Klimov
2013-08-02 15:58:55 UTC
Permalink
Post by Simon Casady
Compression is not much of a problem since file systems can already deal
with mixed compression. However what do you do if one file system is
unmounted ?. To be a hard link the data is only in one place and if
that place is not mounted then its gone even though there may be a
"link" in a still mounted file system. If the data is in the mounted
file system then the metadata in the unmounted system will be out of
date and inconsistant if the link is used. Otherwise is a symlnk.
I think the "mountedness" of an FS dataset is also not quite a problem,
at least if the solution is similar to an idea that I've outlined in
another post. Even if user-space processes can't access a dataset, it
is still there in the pool and available for the kernel to update.
For example, while you're zfs-receiving a dataset, the new temporary
dataset is not available for userspace programs, but it is written
into by the kernel quite well; likewise, you can zfs-send an unmounted
dataset. So this in particular is not a problem either, IMHO.

Though this does bring up a question of what should we do about read-
only datasets, and/or rollbacks to earlier snapshots of one/some (but
not all) of datasets which contain a "distributed hardlinked" file?

The user's expectation of a rollback would be to receive the state
of the FS at that past point in time. Appending to that smaller older
file, or changing it mid-way, would indeed likely corrupt data for
newer instances. Should distributed hardlinks be unlinked during
rollback? (My other post speculated about a tool for this, as well)

A similar case would be cloning of a new "live" dataset from an old
snapshot which includes hardlinks to other filesystems...
Simon Casady
2013-08-02 16:16:52 UTC
Permalink
I agree mostly as ZFS already blurs the definition of a file system
somewhat and since it is software you can make it do whatever. One of the
expectations of unmounting a FS is that it won't change while unmounted,
rollbacks and other magic not withstanding, this would be violated. To
really answer the question one will need to define exactly what a FS is and
what (un)mounting and hard links do and don't do. Too much work for me.
Post by Jim Klimov
Post by Simon Casady
Compression is not much of a problem since file systems can already deal
with mixed compression. However what do you do if one file system is
unmounted ?. To be a hard link the data is only in one place and if
that place is not mounted then its gone even though there may be a
"link" in a still mounted file system. If the data is in the mounted
file system then the metadata in the unmounted system will be out of
date and inconsistant if the link is used. Otherwise is a symlnk.
I think the "mountedness" of an FS dataset is also not quite a problem,
at least if the solution is similar to an idea that I've outlined in
another post. Even if user-space processes can't access a dataset, it
is still there in the pool and available for the kernel to update.
For example, while you're zfs-receiving a dataset, the new temporary
dataset is not available for userspace programs, but it is written
into by the kernel quite well; likewise, you can zfs-send an unmounted
dataset. So this in particular is not a problem either, IMHO.
Though this does bring up a question of what should we do about read-
only datasets, and/or rollbacks to earlier snapshots of one/some (but
not all) of datasets which contain a "distributed hardlinked" file?
The user's expectation of a rollback would be to receive the state
of the FS at that past point in time. Appending to that smaller older
file, or changing it mid-way, would indeed likely corrupt data for
newer instances. Should distributed hardlinks be unlinked during
rollback? (My other post speculated about a tool for this, as well)
A similar case would be cloning of a new "live" dataset from an old
snapshot which includes hardlinks to other filesystems...
------------------------------**-------------
illumos-zfs
Archives: https://www.listbox.com/**member/archive/182191/=now<https://www.listbox.com/member/archive/182191/=now>
RSS Feed: https://www.listbox.com/**member/archive/rss/182191/**
24018577-4d8b86e0<https://www.listbox.com/member/archive/rss/182191/24018577-4d8b86e0>
Modify Your Subscription: https://www.listbox.com/**
member/?&id_**secret=24018577-9172319c<https://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Garrett D'Amore
2013-08-03 16:15:00 UTC
Permalink
 
 
Subject says it all. Would it be possible implementation-wise to 
support hardlinks between different filesystems in the same zpool? 
 
 
Sounds like dedupe to me. ;) 
 
zfs set dedup=on filesystemA 
zfs set dedup=on filesystemB 
 
rm filesystemB/file1 
cp filesystemA/file1 filesystemB/file1 
 
rm filesystemA/file1 
cp filesystemB/file1 filesystemA/file1 
 
Probably not what you wanted to hear, though. 
 
Since hardlinks work be setting inode numbers, and inode number are filesystem-specific (or filesystem-independent, however you want to see it), I don't see how this would work. Unless you want a more ZFS-specific version of a hard-link acting below the POSIX layer. 
No, the point is: Can have all files in a single zpool share a single 
inode number? Like having a range of inode numbers reserved for 
hardlinks across filesystems of the same zpool? 


ZFS doesn't have "inodes" per se (it has "znodes"), but the way they are exposed as inodes to applications like "find", the numbers are limited to 32-bits.  Of course, even in a single pool, ZFS can have more than 4 billion files, which means you can indeed have "inode collisions".

Applications that rely on inodes need to take extra caution to make sure that the entire inode matches, not just the number.

I *think* that znode numbers are a system wide resource, not allocated per filesystem.  They are allocated out of the DMU / dnode layer, as far as I can tell, which is well beneath the filesystem/object layer.  These are 64-bit quantities, so again, you have the considerations about 32-bit space collisions.

- Garrett


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Simon Toedt
2013-08-03 16:22:45 UTC
Permalink
Post by Simon Toedt
Post by Simon Toedt
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?
Sounds like dedupe to me. ;)
zfs set dedup=on filesystemA
zfs set dedup=on filesystemB
rm filesystemB/file1
cp filesystemA/file1 filesystemB/file1
rm filesystemA/file1
cp filesystemB/file1 filesystemA/file1
Probably not what you wanted to hear, though.
Since hardlinks work be setting inode numbers, and inode number are
filesystem-specific (or filesystem-independent, however you want to see it),
I don't see how this would work. Unless you want a more ZFS-specific version
of a hard-link acting below the POSIX layer.
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?
ZFS doesn't have "inodes" per se (it has "znodes"), but the way they are
exposed as inodes to applications like "find", the numbers are limited to
32-bits. Of course, even in a single pool, ZFS can have more than 4 billion
files, which means you can indeed have "inode collisions".
The but the device number is then different, right? Otherwise you'd
break NFSv2/3 and almost every ftw() implementation under this sun...
Post by Simon Toedt
Applications that rely on inodes need to take extra caution to make sure
that the entire inode matches, not just the number.
How should the application do this?
Post by Simon Toedt
I *think* that znode numbers are a system wide resource,
System-wide or zpool-wide?
Post by Simon Toedt
not allocated per
filesystem. They are allocated out of the DMU / dnode layer, as far as I
can tell, which is well beneath the filesystem/object layer. These are
64-bit quantities, so again, you have the considerations about 32-bit space
collisions.
Isn't ino_t really a ino64_t for a 64bit process?

Simon


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Joerg Schilling
2013-08-03 16:43:07 UTC
Permalink
Post by Simon Toedt
Post by Garrett D'Amore
exposed as inodes to applications like "find", the numbers are limited to
32-bits. Of course, even in a single pool, ZFS can have more than 4 billion
files, which means you can indeed have "inode collisions".
The but the device number is then different, right? Otherwise you'd
break NFSv2/3 and almost every ftw() implementation under this sun...
If he was right, then ZFS was broken...
Post by Simon Toedt
Post by Garrett D'Amore
not allocated per
filesystem. They are allocated out of the DMU / dnode layer, as far as I
can tell, which is well beneath the filesystem/object layer. These are
64-bit quantities, so again, you have the considerations about 32-bit space
collisions.
Isn't ino_t really a ino64_t for a 64bit process?
I remember that ZFS uses 32 bit inode numbers as long as there are less than
4bilion files. Once there are 64 bit inode numbers, a 32 bit process cannot
open related files.

Jörg
--
EMail:***@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
***@cs.tu-berlin.de (uni)
***@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-08-04 19:20:33 UTC
Permalink
On Sat, Aug 3, 2013 at 9:43 AM, Joerg Schilling <
Post by Garrett D'Amore
Post by Simon Toedt
Post by Garrett D'Amore
exposed as inodes to applications like "find", the numbers are limited
to
Post by Simon Toedt
Post by Garrett D'Amore
32-bits. Of course, even in a single pool, ZFS can have more than 4
billion
Post by Simon Toedt
Post by Garrett D'Amore
files, which means you can indeed have "inode collisions".
The but the device number is then different, right? Otherwise you'd
break NFSv2/3 and almost every ftw() implementation under this sun...
If he was right, then ZFS was broken...
Post by Simon Toedt
Post by Garrett D'Amore
not allocated per
filesystem. They are allocated out of the DMU / dnode layer, as far
as I
Post by Simon Toedt
Post by Garrett D'Amore
can tell, which is well beneath the filesystem/object layer. These are
64-bit quantities, so again, you have the considerations about 32-bit
space
Post by Simon Toedt
Post by Garrett D'Amore
collisions.
Isn't ino_t really a ino64_t for a 64bit process?
I remember that ZFS uses 32 bit inode numbers as long as there are less than
4bilion files. Once there are 64 bit inode numbers, a 32 bit process cannot
open related files.
I'm pretty sure that 32-bit processes can open files with large (>32-bit)
inode numbers, they just can't stat() them.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-08-04 19:15:55 UTC
Permalink
Post by Freddie Cash
Post by Simon Toedt
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?
Sounds like dedupe to me. ;)
zfs set dedup=on filesystemA
zfs set dedup=on filesystemB
rm filesystemB/file1
cp filesystemA/file1 filesystemB/file1
rm filesystemA/file1
cp filesystemB/file1 filesystemA/file1
Probably not what you wanted to hear, though.
Since hardlinks work be setting inode numbers, and inode number are
filesystem-specific (or filesystem-independent, however you want to see
it), I don't see how this would work. Unless you want a more ZFS-specific
version of a hard-link acting below the POSIX layer.
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?
ZFS doesn't have "inodes" per se (it has "znodes"), but the way they are
exposed as inodes to applications like "find", the numbers are limited to
32-bits.
That is not correct. 64-bit applications, or 32-bit applications using
stat64, see 64-bit inode numbers. See the definitions of struct stat and
struct stat64.
Post by Freddie Cash
Of course, even in a single pool, ZFS can have more than 4 billion files,
which means you can indeed have "inode collisions".
That is not correct. If you have more than 4 billion files, then inode
numbers will exceed 32-bits, so 32-bit applications calling stat() will get
EOVERFLOW, as documented in the manpage:

EOVERFLOW
The file size in bytes or the number of blocks
allocated to the file or the file serial number
cannot be represented correctly in the struc-
ture pointed to by buf.
Post by Freddie Cash
Applications that rely on inodes need to take extra caution to make sure
that the entire inode matches, not just the number.
I don't know what this means. Applications do not have access to the
"entire inode". They should check that the inode number and filesystem
match, as it has been for the past 20+ years.
Post by Freddie Cash
I *think* that znode numbers are a system wide resource, not allocated per
filesystem.
That is not correct. Inode (and znode) numbers are specific to each
filesystem (as they have been on UFS, etc.)
Post by Freddie Cash
They are allocated out of the DMU / dnode layer, as far as I can tell,
which is well beneath the filesystem/object layer.
The DMU *is* the object layer, which is just below the filesystem (ZPL).

--matt



-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Darren Reed
2013-08-05 21:33:48 UTC
Permalink
Post by Simon Toedt
Post by Simon Toedt
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?
Sounds like dedupe to me. ;)
zfs set dedup=on filesystemA
zfs set dedup=on filesystemB
rm filesystemB/file1
cp filesystemA/file1 filesystemB/file1
rm filesystemA/file1
cp filesystemB/file1 filesystemA/file1
Probably not what you wanted to hear, though.
Since hardlinks work be setting inode numbers, and inode number are filesystem-specific (or filesystem-independent, however you want to see it), I don't see how this would work. Unless you want a more ZFS-specific version of a hard-link acting below the POSIX layer.
No, the point is: Can have all files in a single zpool share a single
inode number? Like having a range of inode numbers reserved for
hardlinks across filesystems of the same zpool?
ZFS doesn't have "inodes" per se (it has "znodes"), but the way they are exposed as inodes to applications like "find", the numbers are limited to 32-bits. Of course, even in a single pool, ZFS can have more than 4 billion files, which means you can indeed have "inode collisions".
Applications that rely on inodes need to take extra caution to make sure that the entire inode matches, not just the number.
The inode is per filesystem and I don't think that matching inodes could ever be taken as an indication that two files are the same. Thinking back to tools like tripwire, a file is always identified by (devid, inode).

The goal here is to be able to do:

$ ln /zpool/zfs1/file /zpool/zfs2/file

or

$ mv /zpool/zfs1/file /zpool/zfs2/file

an operation that takes the amount of time required to write a new znode rather than the time it takes to copy however many GB of data is in the file (as would be required with cp).

I suppose the question is, can two znodes refer to the same object in a zpool if each znode is in a different filesystem?

Darren




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Nico Williams
2013-08-05 22:16:47 UTC
Permalink
Post by Darren Reed
The inode is per filesystem and I don't think that matching inodes could
ever be taken as an indication that two files are the same. Thinking back to
tools like tripwire, a file is always identified by (devid, inode).
Backup software, things like rdist, rsync, git, and so on, almost
certainly compare {path, st_dev, st_ino, st_mtime, st_ctime}.
Post by Darren Reed
$ ln /zpool/zfs1/file /zpool/zfs2/file
or
$ mv /zpool/zfs1/file /zpool/zfs2/file
an operation that takes the amount of time required to write a new znode
rather than the time it takes to copy however many GB of data is in the file
(as would be required with cp).
To stick to POSIX it'd have to be "cp is really fast", not
"cross-filesystem hardlinks" nor "cross-filesystem renames". But
perhaps you can interpret POSIX as to allow it (if it's missing text
about {st_dev, st_ino} uniqueness? or if the file's inode # is not in
use in the target filesystem? or if the link(2)/rename(2) changes the
inode number at the target?!).
Post by Darren Reed
I suppose the question is, can two znodes refer to the same object in a
zpool if each znode is in a different filesystem?
The answer today is "no". They can refer to the same *data*, via
dedup (or snapshots/clones), but that's it today.

Nico
--

Nico Williams
2013-08-02 19:42:28 UTC
Permalink
Post by Simon Toedt
Subject says it all. Would it be possible implementation-wise to
support hardlinks between different filesystems in the same zpool?
This often comes up in the guise of supporting rename(2) across
datasets in the same pool.

The answer is that it'd be really difficult for various reasons, such
as the fact that each dataset has its own dnode number namespace, and
the way accounting is done.

With dedup a cross-dataset copy would work just fine though. But
link(2)'s semantics aren't "copy" semantics, so you'd need a new
system call, and for it to get wide usage quickly you'd need cp(1) to
use it where possible.

Nico
--
Nico Williams
2013-08-02 19:55:26 UTC
Permalink
FYI, each dataset has its own dnode namespace and its own FUID table.
The latter is needed because different datasets might be in different
zones that have different ID mapping schemes and might even have
conflicting SIDs for all we care.

You can at best hope for copy-sharing of file indirect and data
blocks, using dedup for accounting for the extra references. Doesn't
mean you'd have to have dedup enabled, but such a copy operation must
update the DDT.
Nico Williams
2013-08-02 20:07:25 UTC
Permalink
Althought, to be fair, even in the face of SID conflicts, a single
pool-wide FUID table would work, and it'd have been better if it'd
been that way.
Loading...