A zfs receive that failed on SmartOS-20130822 succeeded on SmartOS-20130808

Discussion:

Nahum Shalman

2013-08-27 16:34:05 UTC

With apologies that I don't have time to hunt this down myself, here's
what I've got.

"zfs send -R <pool>@<snap>" sent from an old Nexenta Core system.
"zfs receive -F <pool>" on the receiving SmartOS machine (running
20130822) would error out with
"internal error: File too large"

(Rolling the machine back to the 20130808 release made the problem go away.)

I managed to isolate it down to a particular snapshot that would trigger
the error. I could clear up the receiving side and resend that single
snapshot to be received and trigger the error reliably.

from 'dtrace -n sdt:zfs:: -p `pgrep zfs`' the last 3 errors were

0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold+0x27
zfs`dmu_buf_hold_array_by_dnode+0x15c
zfs`dmu_write_uio_dnode+0x5f
zfs`dmu_write_uio_dbuf+0x54
zfs`zfs_write+0xc54
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9

0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold_level+0x2a
zfs`dmu_tx_check_ioerr+0x56
zfs`dmu_tx_count_write+0x395
zfs`dmu_tx_hold_write+0x5a
zfs`zfs_write+0x3bb
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9

0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold+0x27
zfs`dmu_buf_hold_array_by_dnode+0x15c
zfs`dmu_write_uio_dnode+0x5f
zfs`dmu_write_uio_dbuf+0x54
zfs`zfs_write+0xc54
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9

At Rich Lowe's suggestion I got this stack trace as well:

# dtrace -n 'sdt:::set-error/ arg0 == EFBIG / { stack(); }' -p `pgrep zfs`
dtrace: description 'sdt:::set-error' matched 388 probes
CPU ID FUNCTION:NAME
2 4277 dmu_tx_count_write:set-error
zfs`dmu_tx_hold_free+0xbc
zfs`dmu_free_long_range_impl+0x92
zfs`dmu_free_long_range+0x53
zfs`restore_free+0x73
zfs`dmu_recv_stream+0x54f
zfs`zfs_ioc_recv+0x1e7
zfs`zfsdev_ioctl+0x4a7
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x149

dtrace: pid 5885 has exited

The snapshot is of home directory data, so I can't publish it.
I also have the core files dropped by the "zfs receive" command.
I can try to find a way to provide access to those files to individuals
if necessary.

-Nahum

Richard Lowe

2013-08-27 16:42:03 UTC

Permalink

While SmartOS is missing a tag for the working side of this, the zfs
changes in this range look to be:

713d6c2 4047 panic from dbuf_free_range() from dmu_free_object()
while doing zfs
a7027df 3996 want a libzfs_core API to rollback to latest snapshot
18e6497 4039 zfs_rename()/zfs_link() needs stronger test for XDEV

And possibly:

b4952e1 3956 ::vdev -r should work with pipelines 3957 ztest should
update the c
be9000c 3955 ztest failure: assertion
refcount_count(&tx->tx_space_written) + de
2c1e2b4 3949 ztest fault injection should avoid resilvering devices
3950 ztest:

Christopher Siden

2013-08-27 18:01:25 UTC

Permalink

We hit something similar at Delphix, I filed Matt's analysis as
https://www.illumos.org/issues/4082 which was a regression caused by
https://www.illumos.org/issues/4047. We had a few more changes in this
area in our repo between 4047 and 4082 so I'll need to make sure the
fix still applies, but if it does I'll put out a review.

Chris

Post by Richard Lowe
While SmartOS is missing a tag for the working side of this, the zfs
713d6c2 4047 panic from dbuf_free_range() from dmu_free_object()
while doing zfs
a7027df 3996 want a libzfs_core API to rollback to latest snapshot
18e6497 4039 zfs_rename()/zfs_link() needs stronger test for XDEV
b4952e1 3956 ::vdev -r should work with pipelines 3957 ztest should
update the c
be9000c 3955 ztest failure: assertion
refcount_count(&tx->tx_space_written) + de
2c1e2b4 3949 ztest fault injection should avoid resilvering devices
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/21639088-b97104e3
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

George Wilson

2013-08-27 18:14:41 UTC

Permalink

We have a fix for this and I will work with Chris to get this pushed
upstream.

Thanks,
George

Post by Nahum Shalman
With apologies that I don't have time to hunt this down myself, here's
what I've got.
"zfs receive -F <pool>" on the receiving SmartOS machine (running
20130822) would error out with
"internal error: File too large"
(Rolling the machine back to the 20130808 release made the problem go away.)
I managed to isolate it down to a particular snapshot that would
trigger the error. I could clear up the receiving side and resend that
single snapshot to be received and trigger the error reliably.
from 'dtrace -n sdt:zfs:: -p `pgrep zfs`' the last 3 errors were
0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold+0x27
zfs`dmu_buf_hold_array_by_dnode+0x15c
zfs`dmu_write_uio_dnode+0x5f
zfs`dmu_write_uio_dbuf+0x54
zfs`zfs_write+0xc54
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9
0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold_level+0x2a
zfs`dmu_tx_check_ioerr+0x56
zfs`dmu_tx_count_write+0x395
zfs`dmu_tx_hold_write+0x5a
zfs`zfs_write+0x3bb
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9
0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold+0x27
zfs`dmu_buf_hold_array_by_dnode+0x15c
zfs`dmu_write_uio_dnode+0x5f
zfs`dmu_write_uio_dbuf+0x54
zfs`zfs_write+0xc54
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9
# dtrace -n 'sdt:::set-error/ arg0 == EFBIG / { stack(); }' -p `pgrep zfs`
dtrace: description 'sdt:::set-error' matched 388 probes
CPU ID FUNCTION:NAME
2 4277 dmu_tx_count_write:set-error
zfs`dmu_tx_hold_free+0xbc
zfs`dmu_free_long_range_impl+0x92
zfs`dmu_free_long_range+0x53
zfs`restore_free+0x73
zfs`dmu_recv_stream+0x54f
zfs`zfs_ioc_recv+0x1e7
zfs`zfsdev_ioctl+0x4a7
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x149
dtrace: pid 5885 has exited
The snapshot is of home directory data, so I can't publish it.
I also have the core files dropped by the "zfs receive" command.
I can try to find a way to provide access to those files to
individuals if necessary.
-Nahum
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

Christopher Siden

2013-08-30 19:10:00 UTC

Permalink

This should be fixed by:

commit 5253393b09789ec67bec153b866d7285a1cf1645
Author: Matthew Ahrens <***@delphix.com>
Date: Fri Aug 30 02:19:35 2013

4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <***@delphix.com>
Reviewed by: Christopher Siden <***@delphix.com>
Reviewed by: George Wilson <***@delphix.com>
Approved by: Richard Lowe <***@richlowe.net>

Chris

On Tue, Aug 27, 2013 at 11:14 AM, George Wilson

Post by George Wilson
We have a fix for this and I will work with Chris to get this pushed
upstream.
Thanks,
George

Post by Nahum Shalman
With apologies that I don't have time to hunt this down myself, here's
what I've got.
"zfs receive -F <pool>" on the receiving SmartOS machine (running
20130822) would error out with
"internal error: File too large"
(Rolling the machine back to the 20130808 release made the problem go away.)
I managed to isolate it down to a particular snapshot that would trigger
the error. I could clear up the receiving side and resend that single
snapshot to be received and trigger the error reliably.
from 'dtrace -n sdt:zfs:: -p `pgrep zfs`' the last 3 errors were
0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold+0x27
zfs`dmu_buf_hold_array_by_dnode+0x15c
zfs`dmu_write_uio_dnode+0x5f
zfs`dmu_write_uio_dbuf+0x54
zfs`zfs_write+0xc54
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9
0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold_level+0x2a
zfs`dmu_tx_check_ioerr+0x56
zfs`dmu_tx_count_write+0x395
zfs`dmu_tx_hold_write+0x5a
zfs`zfs_write+0x3bb
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9
0 4326 dbuf_findbp:set-error
zfs`dbuf_hold_impl+0xe1
zfs`dbuf_hold+0x27
zfs`dmu_buf_hold_array_by_dnode+0x15c
zfs`dmu_write_uio_dnode+0x5f
zfs`dmu_write_uio_dbuf+0x54
zfs`zfs_write+0xc54
genunix`fop_write+0x8b
genunix`vn_rdwr+0x27a
genunix`core_write+0xaf
elfexec`copy_scn+0xac
elfexec`process_scns+0x941
elfexec`elf32core+0x7cc
genunix`do_core+0x18a
genunix`dump_one_core+0x139
genunix`core+0x360
genunix`psig+0x594
genunix`post_syscall+0x82d
genunix`syscall_exit+0x68
unix`0xfffffffffb800ed9
# dtrace -n 'sdt:::set-error/ arg0 == EFBIG / { stack(); }' -p `pgrep zfs`
dtrace: description 'sdt:::set-error' matched 388 probes
CPU ID FUNCTION:NAME
2 4277 dmu_tx_count_write:set-error
zfs`dmu_tx_hold_free+0xbc
zfs`dmu_free_long_range_impl+0x92
zfs`dmu_free_long_range+0x53
zfs`restore_free+0x73
zfs`dmu_recv_stream+0x54f
zfs`zfs_ioc_recv+0x1e7
zfs`zfsdev_ioctl+0x4a7
genunix`cdev_ioctl+0x39
specfs`spec_ioctl+0x60
genunix`fop_ioctl+0x55
genunix`ioctl+0x9b
unix`_sys_sysenter_post_swapgs+0x149
dtrace: pid 5885 has exited
The snapshot is of home directory data, so I can't publish it.
I also have the core files dropped by the "zfs receive" command.
I can try to find a way to provide access to those files to individuals if
necessary.
-Nahum
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/22008002-303f2ff4
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21639088-b97104e3
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com