Discussion:
restore_object vs dnode_sync
Andriy Gapon
2013-04-01 21:37:42 UTC
Permalink
It appears that restore_object restores an object in two transactions: one for a
dnode and another for a bonus data:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dmu_send.c#1084

dmu_object_claim is explicitly called in a separate transaction and
dmu_object_reclaim starts and commits a transaction internally.

I think that it is possible that the transactions may get assigned to two
different transaction groups and thus a dnode with un-initialized bonus data may
be seen by dnode_sync. That could trip over an assert about a type of the bonus.

I am not confident, but I suspect that the above could explain the following:
http://thread.gmane.org/gmane.os.freebsd.current/148819/focus=148822

What do you think?

P.S. I also presume that when a new dnode is allocated its dn_phys / on-disk
data may initially contain some garbage.
--
Andriy Gapon
Matthew Ahrens
2013-04-02 01:58:18 UTC
Permalink
Post by Andriy Gapon
It appears that restore_object restores an object in two transactions: one for a
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dmu_send.c#1084
dmu_object_claim is explicitly called in a separate transaction and
dmu_object_reclaim starts and commits a transaction internally.
I think that it is possible that the transactions may get assigned to two
different transaction groups
Yes.
Post by Andriy Gapon
and thus a dnode with un-initialized bonus data may
be seen by dnode_sync.
I don't see how it could be uninitialized. However, I think it's possible,
in the dnode_reallocate() case, that the bonus data is stale -- it has the
contents that it had in its previous incarnation.
Post by Andriy Gapon
That could trip over an assert about a type of the bonus.
http://thread.gmane.org/gmane.os.freebsd.current/148819/focus=148822
What do you think?
It seems plausible, and easy enough to attempt to reproduce -- just put a
delay (or txg_wait_open()!) between those two transactions. Even if this
particular crash is not caused by this bug, It does seem like given the way
the user space accounting works, the object needs to be allocated and have
its bonus buffer filled in as part of the same transaction.
Post by Andriy Gapon
P.S. I also presume that when a new dnode is allocated its dn_phys / on-disk
data may initially contain some garbage.
No, the unused fields of a newly-allocated dnode_phys_t should be all zero.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Andriy Gapon
2013-04-02 11:10:46 UTC
Permalink
Post by Andriy Gapon
It appears that restore_object restores an object in two transactions: one for a
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dmu_send.c#1084
dmu_object_claim is explicitly called in a separate transaction and
dmu_object_reclaim starts and commits a transaction internally.
I think that it is possible that the transactions may get assigned to two
different transaction groups
Yes.
and thus a dnode with un-initialized bonus data may
be seen by dnode_sync.
I don't see how it could be uninitialized. However, I think it's possible, in the
dnode_reallocate() case, that the bonus data is stale -- it has the contents that
it had in its previous incarnation.
I think that this also means that in dnode_allocate case the bonus data may be all
zeros. And that could be invalid e.g. with respect to sa magic validation.
Post by Andriy Gapon
That could trip over an assert about a type of the bonus.
http://thread.gmane.org/gmane.os.freebsd.current/148819/focus=148822
What do you think?
It seems plausible, and easy enough to attempt to reproduce -- just put a delay
(or txg_wait_open()!) between those two transactions. Even if this particular
crash is not caused by this bug, It does seem like given the way the user space
accounting works, the object needs to be allocated and have its bonus buffer
filled in as part of the same transaction.
Additionally, I think that if a crash happens such that the first transaction
group makes it to permanent storage but the second transaction group is lost, then
we could end up with an invalid ("partially valid") on-disk dnode.
Very low probability, of course.
Post by Andriy Gapon
P.S. I also presume that when a new dnode is allocated its dn_phys / on-disk
data may initially contain some garbage.
No, the unused fields of a newly-allocated dnode_phys_t should be all zero.
Ah, right! There is even an assertion for that.

Thank you very much for the information and suggestions!
--
Andriy Gapon
Andriy Gapon
2013-04-06 17:02:51 UTC
Permalink
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for diagnosing the issue and
filing the bug report! Will you be working to get the patch integrated into
illumos? I believe the next step would be to post the patch or webrev on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on illumos. I can help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development environment.
So I will certainly appreciate the help.
In dmu_object_reclaim(), you've changed the dmu_free_long_range() to a
dnode_free_range(). This will cause a problem is the object is very large. I
think we'll need to do the dmu_free_long_range() from restore_object(), before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range as I expected
that there was a good reason to use the former in the first place.
On the other hand, I noticed that dmu_object_free uses dnode_free_range and it
is called from lots of places, so I decided so it was not that horrible. But
certainly dmu_free_long_range is preferable for performance.

I've just looked again at the dmu_object_reclaim code and it seems that
splitting off of the dmu_free_long_range code requires either inlining a good
chunk of dmu_object_reclaim into restore_object or creating a new function that
would duplicate a bit of code from dmu_object_reclaim. I mean the calculations
and checks that precede dmu_free_long_range call.

What do you think?
--
Andriy Gapon
Matthew Ahrens
2013-04-08 19:04:04 UTC
Permalink
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for diagnosing the issue
and
filing the bug report! Will you be working to get the patch integrated
into
illumos? I believe the next step would be to post the patch or webrev
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on illumos. I can
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development environment.
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just saw one issue
In dmu_object_reclaim(), you've changed the dmu_free_long_range() to a
dnode_free_range(). This will cause a problem is the object is very
large. I
think we'll need to do the dmu_free_long_range() from restore_object(),
before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range as I expected
that there was a good reason to use the former in the first place.
On the other hand, I noticed that dmu_object_free uses dnode_free_range and
it
is called from lots of places, so I decided so it was not that horrible.
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small objects
from the sync thread (e.g. bpobj_free()), where we don't really have a
choice anyway; or small objects from the ZPL (e.g. zfs_znode_delete, where
it should have already been free_long_range()'d). The one exception would
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.

restore_object() could easily be overwriting an arbitrarily large object,
which would cause substantial performance problems without the use of
dmu_free_long_range(). e.g. the txg sync could take arbitrarily long.
I've just looked again at the dmu_object_reclaim code and it seems that
splitting off of the dmu_free_long_range code requires either inlining a good
chunk of dmu_object_reclaim into restore_object or creating a new function that
would duplicate a bit of code from dmu_object_reclaim. I mean the calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the reclaim() is
really needed, based if the types, block sizes, bonus type and bonus
lengths all match. And we can get this from dmu_object_info().
Essentially, restore_object() will figure out if this is indeed a
different generation of this object #, and if so call dmu_free_long_range()
and then dmu_object_reclaim().

In other words, you need to handle this case in restore_object:
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}

But you don't need to to the other stuff, like the nblkptr ||datablksz
check, which is just an unnecessary optimization. The end result might
actually be less code as dmu_object_reclaim() won't need to compute nblkptr
and do the free_long_range() anymore.

--matt



-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2013-11-22 20:14:42 UTC
Permalink
Andriy, what's the status of this? Were you able to look into the
free_long_range issue?

--matt
Post by Andriy Gapon
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for diagnosing the
issue and
filing the bug report! Will you be working to get the patch integrated
into
illumos? I believe the next step would be to post the patch or webrev
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on illumos. I can
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development environment.
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just saw one
In dmu_object_reclaim(), you've changed the dmu_free_long_range() to a
dnode_free_range(). This will cause a problem is the object is very
large. I
think we'll need to do the dmu_free_long_range() from restore_object(),
before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range as I expected
that there was a good reason to use the former in the first place.
On the other hand, I noticed that dmu_object_free uses dnode_free_range
and it
is called from lots of places, so I decided so it was not that horrible.
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small objects
from the sync thread (e.g. bpobj_free()), where we don't really have a
choice anyway; or small objects from the ZPL (e.g. zfs_znode_delete, where
it should have already been free_long_range()'d). The one exception would
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.
restore_object() could easily be overwriting an arbitrarily large object,
which would cause substantial performance problems without the use of
dmu_free_long_range(). e.g. the txg sync could take arbitrarily long.
I've just looked again at the dmu_object_reclaim code and it seems that
splitting off of the dmu_free_long_range code requires either inlining a good
chunk of dmu_object_reclaim into restore_object or creating a new function that
would duplicate a bit of code from dmu_object_reclaim. I mean the calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the reclaim() is
really needed, based if the types, block sizes, bonus type and bonus
lengths all match. And we can get this from dmu_object_info().
Essentially, restore_object() will figure out if this is indeed a
different generation of this object #, and if so call dmu_free_long_range()
and then dmu_object_reclaim().
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
But you don't need to to the other stuff, like the nblkptr ||datablksz
check, which is just an unnecessary optimization. The end result might
actually be less code as dmu_object_reclaim() won't need to compute nblkptr
and do the free_long_range() anymore.
--matt
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Andriy Gapon
2013-11-30 10:20:04 UTC
Permalink
Post by Matthew Ahrens
Andriy, what's the status of this? Were you able to look into the
free_long_range issue?
I have been procrastinating about this simple issue for far too long.
Thanks for giving me a nudge ;-)

Could you please check if the following change really follows what you suggested:
http://people.freebsd.org/~avg/restore_object.v2.diff

The change reduces dmu_object_reclaim() to a very trivial wrapper around
dnode_reallocate(). Perhaps it should be reduced even further to assert, rather
than check, that it is called only when at least one of the essential attributes
is mutating.

dmu_free_long_range() is now called from restore_object().
dn_nblkptr check has been dropped, so dmu_free_long_range() is called in more
cases than before.

dmu_object_claim()/dmu_object_reclaim() are now called from the same transaction
as the bonus manipulations, which is the primary goal of the change.
Post by Matthew Ahrens
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for diagnosing the
issue and
filing the bug report! Will you be working to get the patch
integrated into
illumos? I believe the next step would be to post the patch or webrev
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I
haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on illumos. I can
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development environment.
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just saw one
In dmu_object_reclaim(), you've changed the dmu_free_long_range() to a
dnode_free_range(). This will cause a problem is the object is very
large. I
think we'll need to do the dmu_free_long_range() from
restore_object(), before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range as I
expected
that there was a good reason to use the former in the first place.
On the other hand, I noticed that dmu_object_free uses dnode_free_range
and it
is called from lots of places, so I decided so it was not that horrible.
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small objects
from the sync thread (e.g. bpobj_free()), where we don't really have a
choice anyway; or small objects from the ZPL (e.g. zfs_znode_delete, where
it should have already been free_long_range()'d). The one exception would
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.
restore_object() could easily be overwriting an arbitrarily large object,
which would cause substantial performance problems without the use of
dmu_free_long_range(). e.g. the txg sync could take arbitrarily long.
I've just looked again at the dmu_object_reclaim code and it seems that
splitting off of the dmu_free_long_range code requires either inlining a
good
chunk of dmu_object_reclaim into restore_object or creating a new
function that
would duplicate a bit of code from dmu_object_reclaim. I mean the
calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the reclaim() is
really needed, based if the types, block sizes, bonus type and bonus lengths
all match. And we can get this from dmu_object_info(). Essentially,
restore_object() will figure out if this is indeed a different generation of
this object #, and if so call dmu_free_long_range() and then
dmu_object_reclaim().
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
But you don't need to to the other stuff, like the nblkptr ||datablksz
check, which is just an unnecessary optimization. The end result might
actually be less code as dmu_object_reclaim() won't need to compute nblkptr
and do the free_long_range() anymore.
--matt
--
Andriy Gapon
Matthew Ahrens
2013-12-04 00:25:45 UTC
Permalink
Yes, that diff is basically what I had in mind.

--matt
Post by Andriy Gapon
Post by Matthew Ahrens
Andriy, what's the status of this? Were you able to look into the
free_long_range issue?
I have been procrastinating about this simple issue for far too long.
Thanks for giving me a nudge ;-)
http://people.freebsd.org/~avg/restore_object.v2.diff
The change reduces dmu_object_reclaim() to a very trivial wrapper around
dnode_reallocate(). Perhaps it should be reduced even further to assert, rather
than check, that it is called only when at least one of the essential attributes
is mutating.
dmu_free_long_range() is now called from restore_object().
dn_nblkptr check has been dropped, so dmu_free_long_range() is called in more
cases than before.
dmu_object_claim()/dmu_object_reclaim() are now called from the same transaction
as the bonus manipulations, which is the primary goal of the change.
Post by Matthew Ahrens
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for diagnosing
the
Post by Matthew Ahrens
issue and
filing the bug report! Will you be working to get the patch
integrated into
illumos? I believe the next step would be to post the patch
or webrev
Post by Matthew Ahrens
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I
haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on
illumos. I can
Post by Matthew Ahrens
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development
environment.
Post by Matthew Ahrens
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just saw
one
Post by Matthew Ahrens
In dmu_object_reclaim(), you've changed the
dmu_free_long_range() to a
Post by Matthew Ahrens
dnode_free_range(). This will cause a problem is the object
is very
Post by Matthew Ahrens
large. I
think we'll need to do the dmu_free_long_range() from
restore_object(), before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range
as I
Post by Matthew Ahrens
expected
that there was a good reason to use the former in the first
place.
Post by Matthew Ahrens
On the other hand, I noticed that dmu_object_free uses
dnode_free_range
Post by Matthew Ahrens
and it
is called from lots of places, so I decided so it was not that
horrible.
Post by Matthew Ahrens
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small
objects
Post by Matthew Ahrens
from the sync thread (e.g. bpobj_free()), where we don't really have
a
Post by Matthew Ahrens
choice anyway; or small objects from the ZPL (e.g. zfs_znode_delete,
where
Post by Matthew Ahrens
it should have already been free_long_range()'d). The one exception
would
Post by Matthew Ahrens
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.
restore_object() could easily be overwriting an arbitrarily large
object,
Post by Matthew Ahrens
which would cause substantial performance problems without the use of
dmu_free_long_range(). e.g. the txg sync could take arbitrarily
long.
Post by Matthew Ahrens
I've just looked again at the dmu_object_reclaim code and it
seems that
Post by Matthew Ahrens
splitting off of the dmu_free_long_range code requires either
inlining a
Post by Matthew Ahrens
good
chunk of dmu_object_reclaim into restore_object or creating a new
function that
would duplicate a bit of code from dmu_object_reclaim. I mean
the
Post by Matthew Ahrens
calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the reclaim()
is
Post by Matthew Ahrens
really needed, based if the types, block sizes, bonus type and bonus
lengths
Post by Matthew Ahrens
all match. And we can get this from dmu_object_info(). Essentially,
restore_object() will figure out if this is indeed a different
generation of
Post by Matthew Ahrens
this object #, and if so call dmu_free_long_range() and then
dmu_object_reclaim().
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
But you don't need to to the other stuff, like the nblkptr
||datablksz
Post by Matthew Ahrens
check, which is just an unnecessary optimization. The end result
might
Post by Matthew Ahrens
actually be less code as dmu_object_reclaim() won't need to compute
nblkptr
Post by Matthew Ahrens
and do the free_long_range() anymore.
--matt
--
Andriy Gapon
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens
2014-01-17 23:23:19 UTC
Permalink
Hey Andriy, have you had a chance to finish up this work? Has it been
integrated into FreeBSD? Do you need help getting it into illumos?

--matt
Post by Matthew Ahrens
Yes, that diff is basically what I had in mind.
--matt
Post by Andriy Gapon
Post by Matthew Ahrens
Andriy, what's the status of this? Were you able to look into the
free_long_range issue?
I have been procrastinating about this simple issue for far too long.
Thanks for giving me a nudge ;-)
http://people.freebsd.org/~avg/restore_object.v2.diff
The change reduces dmu_object_reclaim() to a very trivial wrapper around
dnode_reallocate(). Perhaps it should be reduced even further to assert, rather
than check, that it is called only when at least one of the essential attributes
is mutating.
dmu_free_long_range() is now called from restore_object().
dn_nblkptr check has been dropped, so dmu_free_long_range() is called in more
cases than before.
dmu_object_claim()/dmu_object_reclaim() are now called from the same transaction
as the bonus manipulations, which is the primary goal of the change.
Post by Matthew Ahrens
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for
diagnosing the
Post by Matthew Ahrens
issue and
filing the bug report! Will you be working to get the patch
integrated into
illumos? I believe the next step would be to post the patch
or webrev
Post by Matthew Ahrens
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because
I
Post by Matthew Ahrens
haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on
illumos. I can
Post by Matthew Ahrens
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development
environment.
Post by Matthew Ahrens
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just
saw one
Post by Matthew Ahrens
In dmu_object_reclaim(), you've changed the
dmu_free_long_range() to a
Post by Matthew Ahrens
dnode_free_range(). This will cause a problem is the object
is very
Post by Matthew Ahrens
large. I
think we'll need to do the dmu_free_long_range() from
restore_object(), before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range
as I
Post by Matthew Ahrens
expected
that there was a good reason to use the former in the first
place.
Post by Matthew Ahrens
On the other hand, I noticed that dmu_object_free uses
dnode_free_range
Post by Matthew Ahrens
and it
is called from lots of places, so I decided so it was not that
horrible.
Post by Matthew Ahrens
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small
objects
Post by Matthew Ahrens
from the sync thread (e.g. bpobj_free()), where we don't really
have a
Post by Matthew Ahrens
choice anyway; or small objects from the ZPL (e.g.
zfs_znode_delete, where
Post by Matthew Ahrens
it should have already been free_long_range()'d). The one
exception would
Post by Matthew Ahrens
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.
restore_object() could easily be overwriting an arbitrarily large
object,
Post by Matthew Ahrens
which would cause substantial performance problems without the use
of
Post by Matthew Ahrens
dmu_free_long_range(). e.g. the txg sync could take arbitrarily
long.
Post by Matthew Ahrens
I've just looked again at the dmu_object_reclaim code and it
seems that
Post by Matthew Ahrens
splitting off of the dmu_free_long_range code requires either
inlining a
Post by Matthew Ahrens
good
chunk of dmu_object_reclaim into restore_object or creating a
new
Post by Matthew Ahrens
function that
would duplicate a bit of code from dmu_object_reclaim. I mean
the
Post by Matthew Ahrens
calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the
reclaim() is
Post by Matthew Ahrens
really needed, based if the types, block sizes, bonus type and
bonus lengths
Post by Matthew Ahrens
all match. And we can get this from dmu_object_info().
Essentially,
Post by Matthew Ahrens
restore_object() will figure out if this is indeed a different
generation of
Post by Matthew Ahrens
this object #, and if so call dmu_free_long_range() and then
dmu_object_reclaim().
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
But you don't need to to the other stuff, like the nblkptr
||datablksz
Post by Matthew Ahrens
check, which is just an unnecessary optimization. The end result
might
Post by Matthew Ahrens
actually be less code as dmu_object_reclaim() won't need to compute
nblkptr
Post by Matthew Ahrens
and do the free_long_range() anymore.
--matt
--
Andriy Gapon
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Andriy Gapon
2014-01-21 16:06:24 UTC
Permalink
Post by Matthew Ahrens
Hey Andriy, have you had a chance to finish up this work? Has it been
integrated into FreeBSD? Do you need help getting it into illumos?
Matt,

I have not integrated this change into FreeBSD yet.
However it is committed to HybridCluster tree and seems to hold well in our
usage. The committed change is exactly the same as I posted here.
I would love to see the change in illumos if you consider it to be good and worthy.

Thanks!
Post by Matthew Ahrens
Yes, that diff is basically what I had in mind.
--matt
Post by Matthew Ahrens
Andriy, what's the status of this? Were you able to look into the
free_long_range issue?
I have been procrastinating about this simple issue for far too long.
Thanks for giving me a nudge ;-)
Could you please check if the following change really follows what you
http://people.freebsd.org/~avg/restore_object.v2.diff
The change reduces dmu_object_reclaim() to a very trivial wrapper around
dnode_reallocate(). Perhaps it should be reduced even further to
assert, rather
than check, that it is called only when at least one of the essential
attributes
is mutating.
dmu_free_long_range() is now called from restore_object().
dn_nblkptr check has been dropped, so dmu_free_long_range() is called in
more
cases than before.
dmu_object_claim()/dmu_object_reclaim() are now called from the same
transaction
as the bonus manipulations, which is the primary goal of the change.
Post by Matthew Ahrens
Andriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for
diagnosing the
Post by Matthew Ahrens
issue and
filing the bug report! Will you be working to get the patch
integrated into
illumos? I believe the next step would be to post the patch
or webrev
Post by Matthew Ahrens
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I
haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on
illumos. I can
Post by Matthew Ahrens
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development
environment.
Post by Matthew Ahrens
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just
saw one
Post by Matthew Ahrens
In dmu_object_reclaim(), you've changed the
dmu_free_long_range() to a
Post by Matthew Ahrens
dnode_free_range(). This will cause a problem is the object
is very
Post by Matthew Ahrens
large. I
think we'll need to do the dmu_free_long_range() from
restore_object(), before
creating the transaction.
I was also unsure about dmu_free_long_range vs
dnode_free_range as I
Post by Matthew Ahrens
expected
that there was a good reason to use the former in the first place.
On the other hand, I noticed that dmu_object_free uses
dnode_free_range
Post by Matthew Ahrens
and it
is called from lots of places, so I decided so it was not that
horrible.
Post by Matthew Ahrens
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small
objects
Post by Matthew Ahrens
from the sync thread (e.g. bpobj_free()), where we don't really have a
choice anyway; or small objects from the ZPL (e.g.
zfs_znode_delete, where
Post by Matthew Ahrens
it should have already been free_long_range()'d). The one
exception would
Post by Matthew Ahrens
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.
restore_object() could easily be overwriting an arbitrarily large
object,
Post by Matthew Ahrens
which would cause substantial performance problems without the use of
dmu_free_long_range(). e.g. the txg sync could take arbitrarily long.
I've just looked again at the dmu_object_reclaim code and it
seems that
Post by Matthew Ahrens
splitting off of the dmu_free_long_range code requires either
inlining a
Post by Matthew Ahrens
good
chunk of dmu_object_reclaim into restore_object or creating a new
function that
would duplicate a bit of code from dmu_object_reclaim. I mean the
calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the
reclaim() is
Post by Matthew Ahrens
really needed, based if the types, block sizes, bonus type and
bonus lengths
Post by Matthew Ahrens
all match. And we can get this from dmu_object_info(). Essentially,
restore_object() will figure out if this is indeed a different
generation of
Post by Matthew Ahrens
this object #, and if so call dmu_free_long_range() and then
dmu_object_reclaim().
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
But you don't need to to the other stuff, like the nblkptr ||datablksz
check, which is just an unnecessary optimization. The end result
might
Post by Matthew Ahrens
actually be less code as dmu_object_reclaim() won't need to
compute nblkptr
Post by Matthew Ahrens
and do the free_long_range() anymore.
--matt
--
Andriy Gapon
--
Andriy Gapon
Matthew Ahrens
2013-04-05 19:00:00 UTC
Permalink
Andriy, I just saw the bug that you filed, with the patch (
https://www.illumos.org/issues/3693). Thanks for diagnosing the issue and
filing the bug report! Will you be working to get the patch integrated
into illumos? I believe the next step would be to post the patch or webrev
on this mailing list to solicit code reviews.

We will also need to do at least some basic testing on illumos. I can help
with that if you haven't already done it.

I took a quick look at the patch and it looks good, I just saw one issue
with it:

In dmu_object_reclaim(), you've changed the dmu_free_long_range() to a
dnode_free_range(). This will cause a problem is the object is very large.
I think we'll need to do the dmu_free_long_range() from restore_object(),
before creating the transaction.

--matt
Post by Andriy Gapon
Post by Andriy Gapon
It appears that restore_object restores an object in two
transactions: one for a
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dmu_send.c#1084
Post by Andriy Gapon
dmu_object_claim is explicitly called in a separate transaction and
dmu_object_reclaim starts and commits a transaction internally.
I think that it is possible that the transactions may get assigned
to two
Post by Andriy Gapon
different transaction groups
Yes.
and thus a dnode with un-initialized bonus data may
be seen by dnode_sync.
I don't see how it could be uninitialized. However, I think it's
possible, in the
Post by Andriy Gapon
dnode_reallocate() case, that the bonus data is stale -- it has the
contents that
Post by Andriy Gapon
it had in its previous incarnation.
I think that this also means that in dnode_allocate case the bonus data may be all
zeros. And that could be invalid e.g. with respect to sa magic validation.
Post by Andriy Gapon
That could trip over an assert about a type of the bonus.
I am not confident, but I suspect that the above could explain the
http://thread.gmane.org/gmane.os.freebsd.current/148819/focus=148822
What do you think?
It seems plausible, and easy enough to attempt to reproduce -- just put
a delay
Post by Andriy Gapon
(or txg_wait_open()!) between those two transactions. Even if this
particular
Post by Andriy Gapon
crash is not caused by this bug, It does seem like given the way the
user space
Post by Andriy Gapon
accounting works, the object needs to be allocated and have its bonus
buffer
Post by Andriy Gapon
filled in as part of the same transaction.
Additionally, I think that if a crash happens such that the first transaction
group makes it to permanent storage but the second transaction group is lost, then
we could end up with an invalid ("partially valid") on-disk dnode.
Very low probability, of course.
Post by Andriy Gapon
P.S. I also presume that when a new dnode is allocated its dn_phys /
on-disk
Post by Andriy Gapon
data may initially contain some garbage.
No, the unused fields of a newly-allocated dnode_phys_t should be all
zero.
Ah, right! There is even an assertion for that.
Thank you very much for the information and suggestions!
--
Andriy Gapon
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Loading...