I have been procrastinating about this simple issue for far too long.
Could you please check if the following change really follows what you suggested:
dnode_reallocate(). Perhaps it should be reduced even further to assert, rather
is mutating.
dmu_free_long_range() is now called from restore_object().
cases than before.
as the bonus manipulations, which is the primary goal of the change.
Post by Matthew AhrensAndriy, I just saw the bug that you filed, with the patch
(https://www.illumos.org/issues/3693). Thanks for diagnosing the
issue and
filing the bug report! Will you be working to get the patch
integrated into
illumos? I believe the next step would be to post the patch or webrev
on this
mailing list to solicit code reviews.
Yes, I would like to get this change into illumos.
Once we decide that the patch is good I will post it here.
I can also try to use webrev, if that's preferable, but because I
haven't used
it before I may need some help with it.
We will also need to do at least some basic testing on illumos. I can
help with
that if you haven't already done it.
No, unfortunately I do not have an illumos development environment.
So I will certainly appreciate the help.
I took a quick look at the patch and it looks good, I just saw one
In dmu_object_reclaim(), you've changed the dmu_free_long_range() to a
dnode_free_range(). This will cause a problem is the object is very
large. I
think we'll need to do the dmu_free_long_range() from
restore_object(), before
creating the transaction.
I was also unsure about dmu_free_long_range vs dnode_free_range as I
expected
that there was a good reason to use the former in the first place.
On the other hand, I noticed that dmu_object_free uses dnode_free_range
and it
is called from lots of places, so I decided so it was not that horrible.
But
certainly dmu_free_long_range is preferable for performance.
The cases that use dmu_object_free() are either relatively small objects
from the sync thread (e.g. bpobj_free()), where we don't really have a
choice anyway; or small objects from the ZPL (e.g. zfs_znode_delete, where
it should have already been free_long_range()'d). The one exception would
be zap_destroy() -- if you had a directory with millions of entries,
dmu_object_free() would be called with a large object.
restore_object() could easily be overwriting an arbitrarily large object,
which would cause substantial performance problems without the use of
dmu_free_long_range(). e.g. the txg sync could take arbitrarily long.
I've just looked again at the dmu_object_reclaim code and it seems that
splitting off of the dmu_free_long_range code requires either inlining a
good
chunk of dmu_object_reclaim into restore_object or creating a new
function that
would duplicate a bit of code from dmu_object_reclaim. I mean the
calculations
and checks that precede dmu_free_long_range call.
I don't think you'd have to replicate all 30 lines of code from
dmu_object_reclaim(). You would just need to check if the reclaim() is
really needed, based if the types, block sizes, bonus type and bonus lengths
all match. And we can get this from dmu_object_info(). Essentially,
restore_object() will figure out if this is indeed a different generation of
this object #, and if so call dmu_free_long_range() and then
dmu_object_reclaim().
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
But you don't need to to the other stuff, like the nblkptr ||datablksz
check, which is just an unnecessary optimization. The end result might
actually be less code as dmu_object_reclaim() won't need to compute nblkptr
and do the free_long_range() anymore.
--matt