Discussion:
zfs lock contention - total system melt down - help?
jason matthews via illumos-zfs
2014-05-08 19:17:01 UTC
Permalink
Yesterday I experienced a melt down on six systems. Kernel cpu utilization shot up to 99%. hotkernel from dtrace pointed its finger at unix`mutex_delay_default` at 69.9%

unix`atomic_add_64 137446 1.7%
unix`mutex_vector_enter 204722 2.5%
unix`mutex_enter 219500 2.6%
unix`default_lock_delay 1167331 14.0%
unix`mutex_delay_default 5804652 69.9%

that prompted me to run lockstat, whose output is below.


The systems are Dell R820 with 4x 4650L CPUs, 576 GB of RAM. The underlying storage subsystem is comprised of two LSI 9207-8i cards each connected to four DC S3700 800GB drives. The drives are mirrored across the controllers on a passive mid plane (direct attach) to form one zpool of four mirrors. Each system is on OI151a9.

The disk utilization, in terms of GB, on the pool was about 50% according to 'zfs list data' and there were a few hundred snapshots consuming about 300-400GB.

Each physical system runs four zones whose primary job is running postgres and nginx. Ningx simply logs request to a log file where it is picked up every minute by a process that feeds the data into postgres. There are ancillary services that transfer bzip2 versions of log file data to hadoop which resides else where.

We reached a tipping point yesterday when a large batch import, about 30 million records was started on each of the four instances/machines of all sixteen machines. Five of sixteen physical servers virtually stopped responding for minutes at a time. A sixth server was also intermittently none responsive but managed to at least perform some work. The other ten operated as expected.

A week ago I pulled a physical system out of service for intermittent lockups where even the console wouldn't respond. On one occasion it corrected itself and on another i had to power cycle it after an hour of unresponsiveness. At the time I thought it was a hardware issue but now i think it is a lock contention issue.

In yesterdays event, when system would respond, it showed no disk, not a little - actually no activity where iostat -nMxCz 1 printed no data under the output headers. This could go on I/O activity for up to three minutes at a time and then giant burts of 1.4-1.6GB/s for several seconds in a row.

What seemed (anecdotally) to help the problem was removing snapshots and lowering the database insertion load on the system. It took about four hours to talk the systems off the ledge.

Is there someone on the team who can work with me to help analyze and correct this problem? Although I haven't tried, I suspect I can recreate the lock contention.

Anyone? :-)

j.


And as promised, here is the lock stat out put


Adaptive mutex spin: 5019081 events in 10.055 seconds (499161 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2215745 44% 44% 0.00 123925 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
1061877 21% 65% 0.00 43987 0xffffff8e94025a30 spa_config_enter+0x57
349523 7% 72% 0.00 37842 0xffffff8e94025a30 spa_config_exit+0x46
245812 5% 77% 0.00 7968 ARC_mfu+0x58 remove_reference+0x5d
222976 4% 82% 0.00 8744 ARC_mfu+0x58 add_reference+0x79
207838 4% 86% 0.00 131256 0xffffff8e91253518 dmu_tx_wait+0xef
56852 1% 87% 0.00 33106 0xffffff8e91253518 dsl_pool_undirty_space+0x3d
36302 1% 88% 0.00 5485 0xffffff8e94025930 vdev_queue_io_add+0x45
36176 1% 88% 0.00 5630 0xffffff8e94025930 vdev_queue_io_remove+0x45
23650 0% 89% 0.00 293578 0xffffffffba0a8a00 metaslab_group_alloc+0x220
18781 0% 89% 0.00 8441 0xfffffff944d46850 dsl_dir_willuse_space+0x35
18041 0% 90% 0.00 172577 0xffffffffba09cb00 metaslab_group_alloc+0x220
17297 0% 90% 0.00 4377 0xffffff8e91253640 txg_list_add+0x34
16301 0% 90% 0.00 12313 0xfffffff944d46850 dsl_dir_diduse_space+0x69
14572 0% 90% 0.00 5333 0xffffff8e94025930 vdev_queue_pending_add+0x4c
13957 0% 91% 0.00 226460 0xffffffffba0eac80 metaslab_group_alloc+0x220
13790 0% 91% 0.00 340196 0xffffff8e9394cb80 metaslab_group_alloc+0x220
13530 0% 91% 0.00 6607 0xffffff8e7e919850 dsl_dir_willuse_space+0x35
12164 0% 92% 0.00 85681 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
11587 0% 92% 0.00 8122 0xffffff8e7e919850 dsl_dir_diduse_space+0x69
-------------------------------------------------------------------------------

Adaptive mutex block: 12210 events in 10.055 seconds (1214 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1959 16% 16% 0.00 285494409 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
1652 14% 30% 0.00 191089940 0xffffff8e94025a30 spa_config_enter+0x57
631 5% 35% 0.00 5477598 ARC_mfu+0x58 add_reference+0x79
604 5% 40% 0.00 39949693 ARC_mfu+0x58 remove_reference+0x5d
376 3% 43% 0.00 14685155 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
349 3% 46% 0.00 4636179 0xffffffffba0a8a00 metaslab_group_alloc+0x220
316 3% 48% 0.00 1145019 0xffffff8e91253518 dsl_pool_undirty_space+0x3d
312 3% 51% 0.00 8863420 0xffffff8e8292e400 metaslab_group_alloc+0x220
296 2% 53% 0.00 2680136 0xffffff8e4068ee30 vdev_queue_io+0x8a
256 2% 55% 0.00 2483097 0xffffff8e9394cb80 metaslab_group_alloc+0x220
249 2% 57% 0.00 2677796 0xffffff8e8425f630 vdev_queue_io+0x8a
222 2% 59% 0.00 229745532 0xffffff8e94025a30 spa_config_exit+0x46
205 2% 61% 0.00 2102559 0xffffffffba106200 metaslab_group_alloc+0x220
194 2% 62% 0.00 1933110 0xffffff8e40686630 vdev_queue_io+0x8a
189 2% 64% 0.00 2328081 0xffffffffba0eac80 metaslab_group_alloc+0x220
170 1% 65% 0.00 1902491 0xffffffffba09cb00 metaslab_group_alloc+0x220
150 1% 67% 0.00 1666618 0xffffff8e404d2630 vdev_queue_io+0x8a
144 1% 68% 0.00 1409986 0xffffff8e898fde30 vdev_queue_io+0x8a
142 1% 69% 0.00 915900 0xffffff8e7e919850 dsl_dir_diduse_space+0x69
141 1% 70% 0.00 1033689 0xffffff8e7e919850 dsl_dir_tempreserve_impl+0x72
-------------------------------------------------------------------------------

Spin lock spin: 52068 events in 10.055 seconds (5178 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
14204 27% 27% 0.00 29508 turnstile_table+0xe08 disp_lock_enter+0x2e
7483 14% 42% 0.00 13439 turnstile_table+0xb98 disp_lock_enter+0x2e
1355 3% 44% 0.00 4493 turnstile_table+0x8f8 disp_lock_enter+0x2e
1186 2% 47% 0.00 23117 turnstile_table+0x818 disp_lock_enter+0x2e
1024 2% 48% 0.00 2650 0xffffff8e3e6007c8 disp_lock_enter+0x2e
996 2% 50% 0.00 7511 turnstile_table+0xb78 disp_lock_enter+0x2e
937 2% 52% 0.00 95362 turnstile_table+0xc58 disp_lock_enter+0x2e
854 2% 54% 0.00 112245 turnstile_table+0xf28 disp_lock_enter+0x2e
816 2% 55% 0.00 3426 0xffffff8e3e5ce750 disp_lock_enter+0x2e
716 1% 57% 0.00 4600 0xffffff8e3e5cef30 disp_lock_enter+0x2e
663 1% 58% 0.00 3041 0xffffff8e3e600f38 disp_lock_enter+0x2e
641 1% 59% 0.00 12592 turnstile_table+0xfb8 disp_lock_enter+0x2e
593 1% 60% 0.00 4950 0xffffff8e3e6e9098 disp_lock_enter+0x2e
589 1% 62% 0.00 3170 0xffffff8e3e73d148 disp_lock_enter+0x2e
525 1% 63% 0.00 3677 0xffffff8e3e437f08 disp_lock_enter+0x2e
510 1% 64% 0.00 37573 turnstile_table+0xc88 disp_lock_enter+0x2e
478 1% 64% 0.00 3558 0xffffff8e3e73d8b8 disp_lock_enter+0x2e
450 1% 65% 0.00 5147 0xffffff8e3e5ceb40 disp_lock_enter+0x2e
450 1% 66% 0.00 4708 0xffffff8e3e5ce360 disp_lock_enter+0x2e
448 1% 67% 0.00 3556 0xffffff8e3e8029e0 disp_lock_enter+0x2e
-------------------------------------------------------------------------------

Thread lock spin: 521 events in 10.055 seconds (52 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
41 8% 8% 0.00 3252 sleepq_head+0x4a88 sysdc_update+0x123
34 7% 14% 0.00 24940 transition_lock ts_tick+0x29
30 6% 20% 0.00 4276 turnstile_table+0xe08 ts_tick+0x29
27 5% 25% 0.00 4115 sleepq_head+0x45a8 sysdc_update+0x123
18 3% 29% 0.00 4308 turnstile_table+0xc58 sysdc_update+0x123
16 3% 32% 0.00 4988 transition_lock sysdc_tick+0x1c
6 1% 33% 0.00 16062 turnstile_table+0xf68 sysdc_update+0x123
6 1% 34% 0.00 463 shuttle_lock ts_update_list+0x54
6 1% 35% 0.00 912 0xffffff8e3e7b0930 lwp_continue+0x35
6 1% 36% 0.00 1194 0xffffff8e3e6007c8 lwp_continue+0x35
6 1% 38% 0.00 1966 0xffffff8e3e5ce750 lwp_continue+0x35
5 1% 39% 0.00 3903 turnstile_table+0xf28 sysdc_update+0x123
4 1% 39% 0.00 3337 turnstile_table+0xfb8 turnstile_block+0x354
4 1% 40% 0.00 5077 turnstile_table+0xfb8 sysdc_update+0x123
4 1% 41% 0.00 5732 turnstile_table+0xb98 ts_tick+0x29
4 1% 42% 0.00 9637 turnstile_table+0x9f8 sysdc_update+0x123
4 1% 42% 0.00 3384 turnstile_table+0x878 sysdc_tick+0x1c
4 1% 43% 0.00 1196 0xffffff8e3e6e9920 lwp_continue+0x35
4 1% 44% 0.00 952 0xffffff8e3e4ec3c0 lwp_continue+0x35
3 1% 45% 0.00 3389 sleepq_head+0x45a8 sysdc_tick+0x1c
-------------------------------------------------------------------------------

R/W writer blocked by writer: 2 events in 10.055 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 100% 100% 0.00 1433015 0xffffffd6ab27b3d0 dce_lookup_and_add_v4+0x80
-------------------------------------------------------------------------------

R/W writer blocked by readers: 3 events in 10.055 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 418666 0xffffffd72d753fc0 dmu_zfetch+0x1eb
1 33% 100% 0.00 20446 0xfffffffad83332f0 dmu_zfetch+0x1eb
-------------------------------------------------------------------------------

R/W reader blocked by writer: 3022 events in 10.055 seconds (301 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1323 44% 44% 0.00 5210993 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1117 37% 81% 0.00 4044542 0xffffffcd3d7f8fc0 dmu_zfetch_find+0x4b
517 17% 98% 0.00 3823541 0xfffffffda8ea9eb0 dmu_zfetch_find+0x4b
36 1% 99% 0.00 186619 0xffffffff6f78a640 dmu_zfetch_find+0x4b
10 0% 99% 0.00 8816 0xffffffcd1c997e70 dmu_zfetch_find+0x4b
9 0% 100% 0.00 47399 0xfffffffad83332f0 dmu_zfetch_find+0x4b
9 0% 100% 0.00 23308 0xffffffd6d9689f38 dmu_zfetch_find+0x4b
1 0% 100% 0.00 35777 0xffffffd719dd9c08 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------

R/W reader blocked by write wanted: 3 events in 10.055 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 9466038 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1 33% 100% 0.00 29600 0xfffffffad83332f0 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------


This is typical of what our write pattern looks like during and after the five second flush

extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
369.0 7224.3 2.2 484.1 0.1 2.1 0.0 0.3 0 204 c3
54.0 835.9 0.2 57.4 0.0 0.3 0.0 0.3 1 25 c3t50015178F36592C1d0
47.0 711.9 0.2 54.1 0.0 0.2 0.0 0.3 1 24 c3t50015178F3659379d0
34.0 1357.9 0.3 74.5 0.0 0.3 0.0 0.2 2 33 c3t55CD2E404B43F086d0
32.0 729.9 0.2 56.4 0.0 0.2 0.0 0.3 1 23 c3t55CD2E404B43EFB2d0
73.0 826.9 0.4 57.4 0.0 0.3 0.0 0.3 1 27 c3t55CD2E404B479EFCd0
48.0 724.9 0.5 54.1 0.0 0.2 0.0 0.3 1 22 c3t55CD2E404B4833ABd0
45.0 1311.9 0.2 72.9 0.0 0.3 0.0 0.2 2 27 c3t55CD2E404B4833CFd0
36.0 724.9 0.2 57.3 0.0 0.3 0.0 0.3 1 24 c3t55CD2E404B48342Ad0
369.0 7162.3 2.2 484.3 1902.2 2.6 252.6 0.3 45 49 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
546.9 136.0 2.2 1.2 0.0 0.1 0.0 0.1 0 8 c3
69.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
44.0 16.0 0.2 0.1 0.0 0.0 0.0 0.2 0 1 c3t50015178F3659379d0
56.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43F086d0
39.0 18.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
103.0 18.0 0.5 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
142.0 16.0 0.3 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
62.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
32.0 18.0 0.1 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B48342Ad0
546.9 68.0 2.2 1.2 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
648.2 148.0 2.6 1.5 0.0 0.1 0.0 0.1 0 9 c3
58.0 18.0 0.2 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
53.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1 c3t50015178F3659379d0
103.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1 c3t55CD2E404B43F086d0
53.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
89.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
137.0 20.0 0.6 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B4833ABd0
113.0 20.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
42.0 16.0 0.3 0.1 0.0 0.0 0.0 0.2 0 1 c3t55CD2E404B48342Ad0
648.2 74.0 2.6 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
488.0 2186.1 1.9 128.3 0.0 0.4 0.0 0.2 0 41 c3
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5 c3t50015178F36592C1d0
28.0 271.0 0.1 16.5 0.0 0.0 0.0 0.2 0 5 c3t50015178F3659379d0
52.0 272.0 0.3 15.8 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B43F086d0
45.0 278.0 0.2 15.6 0.0 0.0 0.0 0.1 0 5 c3t55CD2E404B43EFB2d0
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B479EFCd0
155.0 271.0 0.3 16.5 0.0 0.1 0.0 0.1 0 6 c3t55CD2E404B4833ABd0
46.0 272.0 0.2 15.8 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B4833CFd0
54.0 278.0 0.2 15.6 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B48342Ad0
488.0 1378.1 1.9 128.3 0.0 0.5 0.0 0.3 0 22 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
841.6 127.3 4.5 1.4 0.0 0.1 0.0 0.1 0 12 c3
89.5 17.9 0.6 0.2 0.0 0.0 0.0 0.2 0 2 c3t50015178F36592C1d0
91.5 17.9 0.7 0.2 0.0 0.0 0.0 0.2 0 2 c3t50015178F3659379d0
132.3 15.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B43F086d0
107.4 11.9 0.6 0.1 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B43EFB2d0
104.5 17.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B479EFCd0
114.4 17.9 0.7 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
123.4 15.9 0.6 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
78.6 11.9 0.5 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B48342Ad0
841.6 63.7 4.5 1.4 0.0 0.2 0.0 0.2 0 11 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
559.9 128.7 2.8 1.5 0.0 0.1 0.0 0.1 0 9 c3
67.4 16.1 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
41.2 12.1 0.1 0.1 0.0 0.0 0.0 0.1 0 1 c3t50015178F3659379d0
110.6 16.1 0.3 0.2 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B43F086d0
67.4 20.1 0.5 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
74.4 16.1 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
37.2 12.1 0.3 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
46.2 16.1 0.2 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
115.6 20.1 0.7 0.2 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B48342Ad0
559.9 64.3 2.8 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
1834.4 16207.9 14.7 782.2 0.2 4.3 0.0 0.2 0 346 c3
251.8 1591.6 2.2 94.5 0.0 0.5 0.0 0.3 2 39 c3t50015178F36592C1d0
208.8 1372.8 1.7 94.2 0.0 0.5 0.0 0.3 2 44 c3t50015178F3659379d0
258.8 2668.7 2.7 108.1 0.0 0.6 0.0 0.2 3 48 c3t55CD2E404B43F086d0
232.8 2453.9 1.7 94.4 0.0 0.6 0.0 0.2 3 46 c3t55CD2E404B43EFB2d0
220.8 1608.6 1.6 94.5 0.0 0.5 0.0 0.3 2 39 c3t55CD2E404B479EFCd0
221.8 1378.8 1.5 94.2 0.0 0.5 0.0 0.3 2 39 c3t55CD2E404B4833ABd0
198.8 2685.7 1.6 108.1 0.0 0.7 0.0 0.2 3 50 c3t55CD2E404B4833CFd0
240.8 2447.9 1.6 94.4 0.0 0.5 0.0 0.2 3 41 c3t55CD2E404B48342Ad0
1834.4 15620.4 14.7 782.2 7044.4 5.2 403.6 0.3 67 77 data
jason matthews
2014-05-08 19:15:40 UTC
Permalink
Yesterday I experienced a melt down on six systems. Kernel cpu utilization shot up to 99%. hotkernel from dtrace pointed its finger at unix`mutex_delay_default` at 69.9%

unix`atomic_add_64 137446 1.7%
unix`mutex_vector_enter 204722 2.5%
unix`mutex_enter 219500 2.6%
unix`default_lock_delay 1167331 14.0%
unix`mutex_delay_default 5804652 69.9%

that prompted me to run lockstat, whose output is below.


The systems are Dell R820 with 4x 4650L CPUs, 576 GB of RAM. The underlying storage subsystem is comprised of two LSI 9207-8i cards each connected to four DC S3700 800GB drives. The drives are mirrored across the controllers on a passive mid plane (direct attach) to form one zpool of four mirrors. Each system is on OI151a9.

The disk utilization, in terms of GB, on the pool was about 50% according to 'zfs list data' and there were a few hundred snapshots consuming about 300-400GB.

Each physical system runs four zones whose primary job is running postgres and nginx. Ningx simply logs request to a log file where it is picked up every minute by a process that feeds the data into postgres. There are ancillary services that transfer bzip2 versions of log file data to hadoop which resides else where.

We reached a tipping point yesterday when a large batch import, about 30 million records was started on each of the four instances/machines of all sixteen machines. Five of sixteen physical servers virtually stopped responding for minutes at a time. A sixth server was also intermittently none responsive but managed to at least perform some work. The other ten operated as expected.

A week ago I pulled a physical system out of service for intermittent lockups where even the console wouldn't respond. On one occasion it corrected itself and on another i had to power cycle it after an hour of unresponsiveness. At the time I thought it was a hardware issue but now i think it is a lock contention issue.

In yesterdays event, when system would respond, it showed no disk, not a little - actually no activity where iostat -nMxCz 1 printed no data under the output headers. This could go on I/O activity for up to three minutes at a time and then giant burts of 1.4-1.6GB/s for several seconds in a row.

What seemed (anecdotally) to help the problem was removing snapshots and lowering the database insertion load on the system. It took about four hours to talk the systems off the ledge.

Is there someone on the team who can work with me to help analyze and correct this problem? Although I haven't tried, I suspect I can recreate the lock contention.

Anyone? :-)

j.


And as promised, here is the lock stat out put


Adaptive mutex spin: 5019081 events in 10.055 seconds (499161 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2215745 44% 44% 0.00 123925 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
1061877 21% 65% 0.00 43987 0xffffff8e94025a30 spa_config_enter+0x57
349523 7% 72% 0.00 37842 0xffffff8e94025a30 spa_config_exit+0x46
245812 5% 77% 0.00 7968 ARC_mfu+0x58 remove_reference+0x5d
222976 4% 82% 0.00 8744 ARC_mfu+0x58 add_reference+0x79
207838 4% 86% 0.00 131256 0xffffff8e91253518 dmu_tx_wait+0xef
56852 1% 87% 0.00 33106 0xffffff8e91253518 dsl_pool_undirty_space+0x3d
36302 1% 88% 0.00 5485 0xffffff8e94025930 vdev_queue_io_add+0x45
36176 1% 88% 0.00 5630 0xffffff8e94025930 vdev_queue_io_remove+0x45
23650 0% 89% 0.00 293578 0xffffffffba0a8a00 metaslab_group_alloc+0x220
18781 0% 89% 0.00 8441 0xfffffff944d46850 dsl_dir_willuse_space+0x35
18041 0% 90% 0.00 172577 0xffffffffba09cb00 metaslab_group_alloc+0x220
17297 0% 90% 0.00 4377 0xffffff8e91253640 txg_list_add+0x34
16301 0% 90% 0.00 12313 0xfffffff944d46850 dsl_dir_diduse_space+0x69
14572 0% 90% 0.00 5333 0xffffff8e94025930 vdev_queue_pending_add+0x4c
13957 0% 91% 0.00 226460 0xffffffffba0eac80 metaslab_group_alloc+0x220
13790 0% 91% 0.00 340196 0xffffff8e9394cb80 metaslab_group_alloc+0x220
13530 0% 91% 0.00 6607 0xffffff8e7e919850 dsl_dir_willuse_space+0x35
12164 0% 92% 0.00 85681 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
11587 0% 92% 0.00 8122 0xffffff8e7e919850 dsl_dir_diduse_space+0x69
-------------------------------------------------------------------------------

Adaptive mutex block: 12210 events in 10.055 seconds (1214 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1959 16% 16% 0.00 285494409 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
1652 14% 30% 0.00 191089940 0xffffff8e94025a30 spa_config_enter+0x57
631 5% 35% 0.00 5477598 ARC_mfu+0x58 add_reference+0x79
604 5% 40% 0.00 39949693 ARC_mfu+0x58 remove_reference+0x5d
376 3% 43% 0.00 14685155 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
349 3% 46% 0.00 4636179 0xffffffffba0a8a00 metaslab_group_alloc+0x220
316 3% 48% 0.00 1145019 0xffffff8e91253518 dsl_pool_undirty_space+0x3d
312 3% 51% 0.00 8863420 0xffffff8e8292e400 metaslab_group_alloc+0x220
296 2% 53% 0.00 2680136 0xffffff8e4068ee30 vdev_queue_io+0x8a
256 2% 55% 0.00 2483097 0xffffff8e9394cb80 metaslab_group_alloc+0x220
249 2% 57% 0.00 2677796 0xffffff8e8425f630 vdev_queue_io+0x8a
222 2% 59% 0.00 229745532 0xffffff8e94025a30 spa_config_exit+0x46
205 2% 61% 0.00 2102559 0xffffffffba106200 metaslab_group_alloc+0x220
194 2% 62% 0.00 1933110 0xffffff8e40686630 vdev_queue_io+0x8a
189 2% 64% 0.00 2328081 0xffffffffba0eac80 metaslab_group_alloc+0x220
170 1% 65% 0.00 1902491 0xffffffffba09cb00 metaslab_group_alloc+0x220
150 1% 67% 0.00 1666618 0xffffff8e404d2630 vdev_queue_io+0x8a
144 1% 68% 0.00 1409986 0xffffff8e898fde30 vdev_queue_io+0x8a
142 1% 69% 0.00 915900 0xffffff8e7e919850 dsl_dir_diduse_space+0x69
141 1% 70% 0.00 1033689 0xffffff8e7e919850 dsl_dir_tempreserve_impl+0x72
-------------------------------------------------------------------------------

Spin lock spin: 52068 events in 10.055 seconds (5178 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
14204 27% 27% 0.00 29508 turnstile_table+0xe08 disp_lock_enter+0x2e
7483 14% 42% 0.00 13439 turnstile_table+0xb98 disp_lock_enter+0x2e
1355 3% 44% 0.00 4493 turnstile_table+0x8f8 disp_lock_enter+0x2e
1186 2% 47% 0.00 23117 turnstile_table+0x818 disp_lock_enter+0x2e
1024 2% 48% 0.00 2650 0xffffff8e3e6007c8 disp_lock_enter+0x2e
996 2% 50% 0.00 7511 turnstile_table+0xb78 disp_lock_enter+0x2e
937 2% 52% 0.00 95362 turnstile_table+0xc58 disp_lock_enter+0x2e
854 2% 54% 0.00 112245 turnstile_table+0xf28 disp_lock_enter+0x2e
816 2% 55% 0.00 3426 0xffffff8e3e5ce750 disp_lock_enter+0x2e
716 1% 57% 0.00 4600 0xffffff8e3e5cef30 disp_lock_enter+0x2e
663 1% 58% 0.00 3041 0xffffff8e3e600f38 disp_lock_enter+0x2e
641 1% 59% 0.00 12592 turnstile_table+0xfb8 disp_lock_enter+0x2e
593 1% 60% 0.00 4950 0xffffff8e3e6e9098 disp_lock_enter+0x2e
589 1% 62% 0.00 3170 0xffffff8e3e73d148 disp_lock_enter+0x2e
525 1% 63% 0.00 3677 0xffffff8e3e437f08 disp_lock_enter+0x2e
510 1% 64% 0.00 37573 turnstile_table+0xc88 disp_lock_enter+0x2e
478 1% 64% 0.00 3558 0xffffff8e3e73d8b8 disp_lock_enter+0x2e
450 1% 65% 0.00 5147 0xffffff8e3e5ceb40 disp_lock_enter+0x2e
450 1% 66% 0.00 4708 0xffffff8e3e5ce360 disp_lock_enter+0x2e
448 1% 67% 0.00 3556 0xffffff8e3e8029e0 disp_lock_enter+0x2e
-------------------------------------------------------------------------------

Thread lock spin: 521 events in 10.055 seconds (52 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
41 8% 8% 0.00 3252 sleepq_head+0x4a88 sysdc_update+0x123
34 7% 14% 0.00 24940 transition_lock ts_tick+0x29
30 6% 20% 0.00 4276 turnstile_table+0xe08 ts_tick+0x29
27 5% 25% 0.00 4115 sleepq_head+0x45a8 sysdc_update+0x123
18 3% 29% 0.00 4308 turnstile_table+0xc58 sysdc_update+0x123
16 3% 32% 0.00 4988 transition_lock sysdc_tick+0x1c
6 1% 33% 0.00 16062 turnstile_table+0xf68 sysdc_update+0x123
6 1% 34% 0.00 463 shuttle_lock ts_update_list+0x54
6 1% 35% 0.00 912 0xffffff8e3e7b0930 lwp_continue+0x35
6 1% 36% 0.00 1194 0xffffff8e3e6007c8 lwp_continue+0x35
6 1% 38% 0.00 1966 0xffffff8e3e5ce750 lwp_continue+0x35
5 1% 39% 0.00 3903 turnstile_table+0xf28 sysdc_update+0x123
4 1% 39% 0.00 3337 turnstile_table+0xfb8 turnstile_block+0x354
4 1% 40% 0.00 5077 turnstile_table+0xfb8 sysdc_update+0x123
4 1% 41% 0.00 5732 turnstile_table+0xb98 ts_tick+0x29
4 1% 42% 0.00 9637 turnstile_table+0x9f8 sysdc_update+0x123
4 1% 42% 0.00 3384 turnstile_table+0x878 sysdc_tick+0x1c
4 1% 43% 0.00 1196 0xffffff8e3e6e9920 lwp_continue+0x35
4 1% 44% 0.00 952 0xffffff8e3e4ec3c0 lwp_continue+0x35
3 1% 45% 0.00 3389 sleepq_head+0x45a8 sysdc_tick+0x1c
-------------------------------------------------------------------------------

R/W writer blocked by writer: 2 events in 10.055 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 100% 100% 0.00 1433015 0xffffffd6ab27b3d0 dce_lookup_and_add_v4+0x80
-------------------------------------------------------------------------------

R/W writer blocked by readers: 3 events in 10.055 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 418666 0xffffffd72d753fc0 dmu_zfetch+0x1eb
1 33% 100% 0.00 20446 0xfffffffad83332f0 dmu_zfetch+0x1eb
-------------------------------------------------------------------------------

R/W reader blocked by writer: 3022 events in 10.055 seconds (301 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1323 44% 44% 0.00 5210993 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1117 37% 81% 0.00 4044542 0xffffffcd3d7f8fc0 dmu_zfetch_find+0x4b
517 17% 98% 0.00 3823541 0xfffffffda8ea9eb0 dmu_zfetch_find+0x4b
36 1% 99% 0.00 186619 0xffffffff6f78a640 dmu_zfetch_find+0x4b
10 0% 99% 0.00 8816 0xffffffcd1c997e70 dmu_zfetch_find+0x4b
9 0% 100% 0.00 47399 0xfffffffad83332f0 dmu_zfetch_find+0x4b
9 0% 100% 0.00 23308 0xffffffd6d9689f38 dmu_zfetch_find+0x4b
1 0% 100% 0.00 35777 0xffffffd719dd9c08 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------

R/W reader blocked by write wanted: 3 events in 10.055 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 9466038 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1 33% 100% 0.00 29600 0xfffffffad83332f0 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------


This is typical of what our write pattern looks like during and after the five second flush

extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
369.0 7224.3 2.2 484.1 0.1 2.1 0.0 0.3 0 204 c3
54.0 835.9 0.2 57.4 0.0 0.3 0.0 0.3 1 25 c3t50015178F36592C1d0
47.0 711.9 0.2 54.1 0.0 0.2 0.0 0.3 1 24 c3t50015178F3659379d0
34.0 1357.9 0.3 74.5 0.0 0.3 0.0 0.2 2 33 c3t55CD2E404B43F086d0
32.0 729.9 0.2 56.4 0.0 0.2 0.0 0.3 1 23 c3t55CD2E404B43EFB2d0
73.0 826.9 0.4 57.4 0.0 0.3 0.0 0.3 1 27 c3t55CD2E404B479EFCd0
48.0 724.9 0.5 54.1 0.0 0.2 0.0 0.3 1 22 c3t55CD2E404B4833ABd0
45.0 1311.9 0.2 72.9 0.0 0.3 0.0 0.2 2 27 c3t55CD2E404B4833CFd0
36.0 724.9 0.2 57.3 0.0 0.3 0.0 0.3 1 24 c3t55CD2E404B48342Ad0
369.0 7162.3 2.2 484.3 1902.2 2.6 252.6 0.3 45 49 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
546.9 136.0 2.2 1.2 0.0 0.1 0.0 0.1 0 8 c3
69.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
44.0 16.0 0.2 0.1 0.0 0.0 0.0 0.2 0 1 c3t50015178F3659379d0
56.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43F086d0
39.0 18.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
103.0 18.0 0.5 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
142.0 16.0 0.3 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
62.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
32.0 18.0 0.1 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B48342Ad0
546.9 68.0 2.2 1.2 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
648.2 148.0 2.6 1.5 0.0 0.1 0.0 0.1 0 9 c3
58.0 18.0 0.2 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
53.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1 c3t50015178F3659379d0
103.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1 c3t55CD2E404B43F086d0
53.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
89.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
137.0 20.0 0.6 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B4833ABd0
113.0 20.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
42.0 16.0 0.3 0.1 0.0 0.0 0.0 0.2 0 1 c3t55CD2E404B48342Ad0
648.2 74.0 2.6 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
488.0 2186.1 1.9 128.3 0.0 0.4 0.0 0.2 0 41 c3
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5 c3t50015178F36592C1d0
28.0 271.0 0.1 16.5 0.0 0.0 0.0 0.2 0 5 c3t50015178F3659379d0
52.0 272.0 0.3 15.8 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B43F086d0
45.0 278.0 0.2 15.6 0.0 0.0 0.0 0.1 0 5 c3t55CD2E404B43EFB2d0
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B479EFCd0
155.0 271.0 0.3 16.5 0.0 0.1 0.0 0.1 0 6 c3t55CD2E404B4833ABd0
46.0 272.0 0.2 15.8 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B4833CFd0
54.0 278.0 0.2 15.6 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B48342Ad0
488.0 1378.1 1.9 128.3 0.0 0.5 0.0 0.3 0 22 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
841.6 127.3 4.5 1.4 0.0 0.1 0.0 0.1 0 12 c3
89.5 17.9 0.6 0.2 0.0 0.0 0.0 0.2 0 2 c3t50015178F36592C1d0
91.5 17.9 0.7 0.2 0.0 0.0 0.0 0.2 0 2 c3t50015178F3659379d0
132.3 15.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B43F086d0
107.4 11.9 0.6 0.1 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B43EFB2d0
104.5 17.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B479EFCd0
114.4 17.9 0.7 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
123.4 15.9 0.6 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
78.6 11.9 0.5 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B48342Ad0
841.6 63.7 4.5 1.4 0.0 0.2 0.0 0.2 0 11 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
559.9 128.7 2.8 1.5 0.0 0.1 0.0 0.1 0 9 c3
67.4 16.1 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
41.2 12.1 0.1 0.1 0.0 0.0 0.0 0.1 0 1 c3t50015178F3659379d0
110.6 16.1 0.3 0.2 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B43F086d0
67.4 20.1 0.5 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
74.4 16.1 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
37.2 12.1 0.3 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
46.2 16.1 0.2 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
115.6 20.1 0.7 0.2 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B48342Ad0
559.9 64.3 2.8 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
1834.4 16207.9 14.7 782.2 0.2 4.3 0.0 0.2 0 346 c3
251.8 1591.6 2.2 94.5 0.0 0.5 0.0 0.3 2 39 c3t50015178F36592C1d0
208.8 1372.8 1.7 94.2 0.0 0.5 0.0 0.3 2 44 c3t50015178F3659379d0
258.8 2668.7 2.7 108.1 0.0 0.6 0.0 0.2 3 48 c3t55CD2E404B43F086d0
232.8 2453.9 1.7 94.4 0.0 0.6 0.0 0.2 3 46 c3t55CD2E404B43EFB2d0
220.8 1608.6 1.6 94.5 0.0 0.5 0.0 0.3 2 39 c3t55CD2E404B479EFCd0
221.8 1378.8 1.5 94.2 0.0 0.5 0.0 0.3 2 39 c3t55CD2E404B4833ABd0
198.8 2685.7 1.6 108.1 0.0 0.7 0.0 0.2 3 50 c3t55CD2E404B4833CFd0
240.8 2447.9 1.6 94.4 0.0 0.5 0.0 0.2 3 41 c3t55CD2E404B48342Ad0
1834.4 15620.4 14.7 782.2 7044.4 5.2 403.6 0.3 67 77 data
Richard Elling via illumos-zfs
2014-05-09 01:09:40 UTC
Permalink
Post by jason matthews via illumos-zfs
Yesterday I experienced a melt down on six systems. Kernel cpu utilization shot up to 99%. hotkernel from dtrace pointed its finger at unix`mutex_delay_default` at 69.9%
unix`atomic_add_64 137446 1.7%
unix`mutex_vector_enter 204722 2.5%
unix`mutex_enter 219500 2.6%
unix`default_lock_delay 1167331 14.0%
unix`mutex_delay_default 5804652 69.9%
that prompted me to run lockstat, whose output is below.
The systems are Dell R820 with 4x 4650L CPUs, 576 GB of RAM. The underlying storage subsystem is comprised of two LSI 9207-8i cards each connected to four DC S3700 800GB drives. The drives are mirrored across the controllers on a passive mid plane (direct attach) to form one zpool of four mirrors. Each system is on OI151a9.
The disk utilization, in terms of GB, on the pool was about 50% according to 'zfs list data' and there were a few hundred snapshots consuming about 300-400GB.
Each physical system runs four zones whose primary job is running postgres and nginx. Ningx simply logs request to a log file where it is picked up every minute by a process that feeds the data into postgres. There are ancillary services that transfer bzip2 versions of log file data to hadoop which resides else where.
We reached a tipping point yesterday when a large batch import, about 30 million records was started on each of the four instances/machines of all sixteen machines. Five of sixteen physical servers virtually stopped responding for minutes at a time. A sixth server was also intermittently none responsive but managed to at least perform some work. The other ten operated as expected.
A week ago I pulled a physical system out of service for intermittent lockups where even the console wouldn't respond. On one occasion it corrected itself and on another i had to power cycle it after an hour of unresponsiveness. At the time I thought it was a hardware issue but now i think it is a lock contention issue.
In yesterdays event, when system would respond, it showed no disk, not a little - actually no activity where iostat -nMxCz 1 printed no data under the output headers. This could go on I/O activity for up to three minutes at a time and then giant burts of 1.4-1.6GB/s for several seconds in a row.
What seemed (anecdotally) to help the problem was removing snapshots and lowering the database insertion load on the system. It took about four hours to talk the systems off the ledge.
Shot in the dark, but I've seen very bizarre behaviour when free space gets really low (100% "full").
Some of the behaviours could be similar, such as frequent txg commits (many per second).
-- richard
Post by jason matthews via illumos-zfs
Is there someone on the team who can work with me to help analyze and correct this problem? Although I haven't tried, I suspect I can recreate the lock contention.
Anyone? :-)
j.
And as promised, here is the lock stat out put
Adaptive mutex spin: 5019081 events in 10.055 seconds (499161 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2215745 44% 44% 0.00 123925 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
1061877 21% 65% 0.00 43987 0xffffff8e94025a30 spa_config_enter+0x57
349523 7% 72% 0.00 37842 0xffffff8e94025a30 spa_config_exit+0x46
245812 5% 77% 0.00 7968 ARC_mfu+0x58 remove_reference+0x5d
222976 4% 82% 0.00 8744 ARC_mfu+0x58 add_reference+0x79
207838 4% 86% 0.00 131256 0xffffff8e91253518 dmu_tx_wait+0xef
56852 1% 87% 0.00 33106 0xffffff8e91253518 dsl_pool_undirty_space+0x3d
36302 1% 88% 0.00 5485 0xffffff8e94025930 vdev_queue_io_add+0x45
36176 1% 88% 0.00 5630 0xffffff8e94025930 vdev_queue_io_remove+0x45
23650 0% 89% 0.00 293578 0xffffffffba0a8a00 metaslab_group_alloc+0x220
18781 0% 89% 0.00 8441 0xfffffff944d46850 dsl_dir_willuse_space+0x35
18041 0% 90% 0.00 172577 0xffffffffba09cb00 metaslab_group_alloc+0x220
17297 0% 90% 0.00 4377 0xffffff8e91253640 txg_list_add+0x34
16301 0% 90% 0.00 12313 0xfffffff944d46850 dsl_dir_diduse_space+0x69
14572 0% 90% 0.00 5333 0xffffff8e94025930 vdev_queue_pending_add+0x4c
13957 0% 91% 0.00 226460 0xffffffffba0eac80 metaslab_group_alloc+0x220
13790 0% 91% 0.00 340196 0xffffff8e9394cb80 metaslab_group_alloc+0x220
13530 0% 91% 0.00 6607 0xffffff8e7e919850 dsl_dir_willuse_space+0x35
12164 0% 92% 0.00 85681 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
11587 0% 92% 0.00 8122 0xffffff8e7e919850 dsl_dir_diduse_space+0x69
-------------------------------------------------------------------------------
Adaptive mutex block: 12210 events in 10.055 seconds (1214 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1959 16% 16% 0.00 285494409 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
1652 14% 30% 0.00 191089940 0xffffff8e94025a30 spa_config_enter+0x57
631 5% 35% 0.00 5477598 ARC_mfu+0x58 add_reference+0x79
604 5% 40% 0.00 39949693 ARC_mfu+0x58 remove_reference+0x5d
376 3% 43% 0.00 14685155 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
349 3% 46% 0.00 4636179 0xffffffffba0a8a00 metaslab_group_alloc+0x220
316 3% 48% 0.00 1145019 0xffffff8e91253518 dsl_pool_undirty_space+0x3d
312 3% 51% 0.00 8863420 0xffffff8e8292e400 metaslab_group_alloc+0x220
296 2% 53% 0.00 2680136 0xffffff8e4068ee30 vdev_queue_io+0x8a
256 2% 55% 0.00 2483097 0xffffff8e9394cb80 metaslab_group_alloc+0x220
249 2% 57% 0.00 2677796 0xffffff8e8425f630 vdev_queue_io+0x8a
222 2% 59% 0.00 229745532 0xffffff8e94025a30 spa_config_exit+0x46
205 2% 61% 0.00 2102559 0xffffffffba106200 metaslab_group_alloc+0x220
194 2% 62% 0.00 1933110 0xffffff8e40686630 vdev_queue_io+0x8a
189 2% 64% 0.00 2328081 0xffffffffba0eac80 metaslab_group_alloc+0x220
170 1% 65% 0.00 1902491 0xffffffffba09cb00 metaslab_group_alloc+0x220
150 1% 67% 0.00 1666618 0xffffff8e404d2630 vdev_queue_io+0x8a
144 1% 68% 0.00 1409986 0xffffff8e898fde30 vdev_queue_io+0x8a
142 1% 69% 0.00 915900 0xffffff8e7e919850 dsl_dir_diduse_space+0x69
141 1% 70% 0.00 1033689 0xffffff8e7e919850 dsl_dir_tempreserve_impl+0x72
-------------------------------------------------------------------------------
Spin lock spin: 52068 events in 10.055 seconds (5178 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
14204 27% 27% 0.00 29508 turnstile_table+0xe08 disp_lock_enter+0x2e
7483 14% 42% 0.00 13439 turnstile_table+0xb98 disp_lock_enter+0x2e
1355 3% 44% 0.00 4493 turnstile_table+0x8f8 disp_lock_enter+0x2e
1186 2% 47% 0.00 23117 turnstile_table+0x818 disp_lock_enter+0x2e
1024 2% 48% 0.00 2650 0xffffff8e3e6007c8 disp_lock_enter+0x2e
996 2% 50% 0.00 7511 turnstile_table+0xb78 disp_lock_enter+0x2e
937 2% 52% 0.00 95362 turnstile_table+0xc58 disp_lock_enter+0x2e
854 2% 54% 0.00 112245 turnstile_table+0xf28 disp_lock_enter+0x2e
816 2% 55% 0.00 3426 0xffffff8e3e5ce750 disp_lock_enter+0x2e
716 1% 57% 0.00 4600 0xffffff8e3e5cef30 disp_lock_enter+0x2e
663 1% 58% 0.00 3041 0xffffff8e3e600f38 disp_lock_enter+0x2e
641 1% 59% 0.00 12592 turnstile_table+0xfb8 disp_lock_enter+0x2e
593 1% 60% 0.00 4950 0xffffff8e3e6e9098 disp_lock_enter+0x2e
589 1% 62% 0.00 3170 0xffffff8e3e73d148 disp_lock_enter+0x2e
525 1% 63% 0.00 3677 0xffffff8e3e437f08 disp_lock_enter+0x2e
510 1% 64% 0.00 37573 turnstile_table+0xc88 disp_lock_enter+0x2e
478 1% 64% 0.00 3558 0xffffff8e3e73d8b8 disp_lock_enter+0x2e
450 1% 65% 0.00 5147 0xffffff8e3e5ceb40 disp_lock_enter+0x2e
450 1% 66% 0.00 4708 0xffffff8e3e5ce360 disp_lock_enter+0x2e
448 1% 67% 0.00 3556 0xffffff8e3e8029e0 disp_lock_enter+0x2e
-------------------------------------------------------------------------------
Thread lock spin: 521 events in 10.055 seconds (52 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
41 8% 8% 0.00 3252 sleepq_head+0x4a88 sysdc_update+0x123
34 7% 14% 0.00 24940 transition_lock ts_tick+0x29
30 6% 20% 0.00 4276 turnstile_table+0xe08 ts_tick+0x29
27 5% 25% 0.00 4115 sleepq_head+0x45a8 sysdc_update+0x123
18 3% 29% 0.00 4308 turnstile_table+0xc58 sysdc_update+0x123
16 3% 32% 0.00 4988 transition_lock sysdc_tick+0x1c
6 1% 33% 0.00 16062 turnstile_table+0xf68 sysdc_update+0x123
6 1% 34% 0.00 463 shuttle_lock ts_update_list+0x54
6 1% 35% 0.00 912 0xffffff8e3e7b0930 lwp_continue+0x35
6 1% 36% 0.00 1194 0xffffff8e3e6007c8 lwp_continue+0x35
6 1% 38% 0.00 1966 0xffffff8e3e5ce750 lwp_continue+0x35
5 1% 39% 0.00 3903 turnstile_table+0xf28 sysdc_update+0x123
4 1% 39% 0.00 3337 turnstile_table+0xfb8 turnstile_block+0x354
4 1% 40% 0.00 5077 turnstile_table+0xfb8 sysdc_update+0x123
4 1% 41% 0.00 5732 turnstile_table+0xb98 ts_tick+0x29
4 1% 42% 0.00 9637 turnstile_table+0x9f8 sysdc_update+0x123
4 1% 42% 0.00 3384 turnstile_table+0x878 sysdc_tick+0x1c
4 1% 43% 0.00 1196 0xffffff8e3e6e9920 lwp_continue+0x35
4 1% 44% 0.00 952 0xffffff8e3e4ec3c0 lwp_continue+0x35
3 1% 45% 0.00 3389 sleepq_head+0x45a8 sysdc_tick+0x1c
-------------------------------------------------------------------------------
R/W writer blocked by writer: 2 events in 10.055 seconds (0 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 100% 100% 0.00 1433015 0xffffffd6ab27b3d0 dce_lookup_and_add_v4+0x80
-------------------------------------------------------------------------------
R/W writer blocked by readers: 3 events in 10.055 seconds (0 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 418666 0xffffffd72d753fc0 dmu_zfetch+0x1eb
1 33% 100% 0.00 20446 0xfffffffad83332f0 dmu_zfetch+0x1eb
-------------------------------------------------------------------------------
R/W reader blocked by writer: 3022 events in 10.055 seconds (301 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1323 44% 44% 0.00 5210993 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1117 37% 81% 0.00 4044542 0xffffffcd3d7f8fc0 dmu_zfetch_find+0x4b
517 17% 98% 0.00 3823541 0xfffffffda8ea9eb0 dmu_zfetch_find+0x4b
36 1% 99% 0.00 186619 0xffffffff6f78a640 dmu_zfetch_find+0x4b
10 0% 99% 0.00 8816 0xffffffcd1c997e70 dmu_zfetch_find+0x4b
9 0% 100% 0.00 47399 0xfffffffad83332f0 dmu_zfetch_find+0x4b
9 0% 100% 0.00 23308 0xffffffd6d9689f38 dmu_zfetch_find+0x4b
1 0% 100% 0.00 35777 0xffffffd719dd9c08 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------
R/W reader blocked by write wanted: 3 events in 10.055 seconds (0 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 9466038 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1 33% 100% 0.00 29600 0xfffffffad83332f0 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------
This is typical of what our write pattern looks like during and after the five second flush
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
369.0 7224.3 2.2 484.1 0.1 2.1 0.0 0.3 0 204 c3
54.0 835.9 0.2 57.4 0.0 0.3 0.0 0.3 1 25 c3t50015178F36592C1d0
47.0 711.9 0.2 54.1 0.0 0.2 0.0 0.3 1 24 c3t50015178F3659379d0
34.0 1357.9 0.3 74.5 0.0 0.3 0.0 0.2 2 33 c3t55CD2E404B43F086d0
32.0 729.9 0.2 56.4 0.0 0.2 0.0 0.3 1 23 c3t55CD2E404B43EFB2d0
73.0 826.9 0.4 57.4 0.0 0.3 0.0 0.3 1 27 c3t55CD2E404B479EFCd0
48.0 724.9 0.5 54.1 0.0 0.2 0.0 0.3 1 22 c3t55CD2E404B4833ABd0
45.0 1311.9 0.2 72.9 0.0 0.3 0.0 0.2 2 27 c3t55CD2E404B4833CFd0
36.0 724.9 0.2 57.3 0.0 0.3 0.0 0.3 1 24 c3t55CD2E404B48342Ad0
369.0 7162.3 2.2 484.3 1902.2 2.6 252.6 0.3 45 49 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
546.9 136.0 2.2 1.2 0.0 0.1 0.0 0.1 0 8 c3
69.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
44.0 16.0 0.2 0.1 0.0 0.0 0.0 0.2 0 1 c3t50015178F3659379d0
56.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43F086d0
39.0 18.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
103.0 18.0 0.5 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
142.0 16.0 0.3 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
62.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
32.0 18.0 0.1 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B48342Ad0
546.9 68.0 2.2 1.2 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
648.2 148.0 2.6 1.5 0.0 0.1 0.0 0.1 0 9 c3
58.0 18.0 0.2 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
53.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1 c3t50015178F3659379d0
103.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1 c3t55CD2E404B43F086d0
53.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
89.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
137.0 20.0 0.6 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B4833ABd0
113.0 20.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
42.0 16.0 0.3 0.1 0.0 0.0 0.0 0.2 0 1 c3t55CD2E404B48342Ad0
648.2 74.0 2.6 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
488.0 2186.1 1.9 128.3 0.0 0.4 0.0 0.2 0 41 c3
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5 c3t50015178F36592C1d0
28.0 271.0 0.1 16.5 0.0 0.0 0.0 0.2 0 5 c3t50015178F3659379d0
52.0 272.0 0.3 15.8 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B43F086d0
45.0 278.0 0.2 15.6 0.0 0.0 0.0 0.1 0 5 c3t55CD2E404B43EFB2d0
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B479EFCd0
155.0 271.0 0.3 16.5 0.0 0.1 0.0 0.1 0 6 c3t55CD2E404B4833ABd0
46.0 272.0 0.2 15.8 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B4833CFd0
54.0 278.0 0.2 15.6 0.0 0.1 0.0 0.2 0 5 c3t55CD2E404B48342Ad0
488.0 1378.1 1.9 128.3 0.0 0.5 0.0 0.3 0 22 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
841.6 127.3 4.5 1.4 0.0 0.1 0.0 0.1 0 12 c3
89.5 17.9 0.6 0.2 0.0 0.0 0.0 0.2 0 2 c3t50015178F36592C1d0
91.5 17.9 0.7 0.2 0.0 0.0 0.0 0.2 0 2 c3t50015178F3659379d0
132.3 15.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B43F086d0
107.4 11.9 0.6 0.1 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B43EFB2d0
104.5 17.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2 c3t55CD2E404B479EFCd0
114.4 17.9 0.7 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
123.4 15.9 0.6 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
78.6 11.9 0.5 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B48342Ad0
841.6 63.7 4.5 1.4 0.0 0.2 0.0 0.2 0 11 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
559.9 128.7 2.8 1.5 0.0 0.1 0.0 0.1 0 9 c3
67.4 16.1 0.3 0.2 0.0 0.0 0.0 0.1 0 1 c3t50015178F36592C1d0
41.2 12.1 0.1 0.1 0.0 0.0 0.0 0.1 0 1 c3t50015178F3659379d0
110.6 16.1 0.3 0.2 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B43F086d0
67.4 20.1 0.5 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B43EFB2d0
74.4 16.1 0.4 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B479EFCd0
37.2 12.1 0.3 0.1 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833ABd0
46.2 16.1 0.2 0.2 0.0 0.0 0.0 0.1 0 1 c3t55CD2E404B4833CFd0
115.6 20.1 0.7 0.2 0.0 0.0 0.0 0.2 0 2 c3t55CD2E404B48342Ad0
559.9 64.3 2.8 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
1834.4 16207.9 14.7 782.2 0.2 4.3 0.0 0.2 0 346 c3
251.8 1591.6 2.2 94.5 0.0 0.5 0.0 0.3 2 39 c3t50015178F36592C1d0
208.8 1372.8 1.7 94.2 0.0 0.5 0.0 0.3 2 44 c3t50015178F3659379d0
258.8 2668.7 2.7 108.1 0.0 0.6 0.0 0.2 3 48 c3t55CD2E404B43F086d0
232.8 2453.9 1.7 94.4 0.0 0.6 0.0 0.2 3 46 c3t55CD2E404B43EFB2d0
220.8 1608.6 1.6 94.5 0.0 0.5 0.0 0.3 2 39 c3t55CD2E404B479EFCd0
221.8 1378.8 1.5 94.2 0.0 0.5 0.0 0.3 2 39 c3t55CD2E404B4833ABd0
198.8 2685.7 1.6 108.1 0.0 0.7 0.0 0.2 3 50 c3t55CD2E404B4833CFd0
240.8 2447.9 1.6 94.4 0.0 0.5 0.0 0.2 3 41 c3t55CD2E404B48342Ad0
1834.4 15620.4 14.7 782.2 7044.4 5.2 403.6 0.3 67 77 data
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
--

***@RichardElling.com
+1-760-896-4422












-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Matthew Ahrens via illumos-zfs
2014-05-08 22:31:04 UTC
Permalink
On Thu, May 8, 2014 at 12:17 PM, jason matthews via illumos-zfs <
Post by jason matthews via illumos-zfs
Yesterday I experienced a melt down on six systems. Kernel cpu utilization
shot up to 99%. hotkernel from dtrace pointed its finger at
unix`mutex_delay_default` at 69.9%
unix`atomic_add_64 137446 1.7%
unix`mutex_vector_enter 204722 2.5%
unix`mutex_enter 219500 2.6%
unix`default_lock_delay 1167331 14.0%
unix`mutex_delay_default 5804652 69.9%
That's definitely bad. What is the workload on the machine? What
recordsize (default is 128K), and how many MB/s reads and writes is the
application doing to each machine? And what is the request size or #
requests / second from the application?
Post by jason matthews via illumos-zfs
that prompted me to run lockstat, whose output is below.
2215745 44% 44% 0.00 123925 0xffffff8e91253518
Post by jason matthews via illumos-zfs
dsl_pool_need_dirty_delay+0x4d
I have tested up to ~150,000 iops (8k writes) with 16 cores, and seen
plenty of other lock contention but not this, which makes it somewhat
curious. Your iostat data looks like the average throughput to disk is <
200MB/s. You are seeing > 200,000 lock spins per second from
dsl_pool_need_dirty_delay(). This routine should be called once per
transaction (i.e. write request from the application). So we are writing
at most 1KB to disk for each transaction. Are you using a very small
blocksize and/or getting very good compression? Or is the application
doing very small writes? If possible, you might look at increasing the
write size from the application to at least 128KB. You can use this to
examine write size:
dtrace -n 'dmu_tx_hold_write:entry{@=quantize(args[3])}'

In my high iops testing, even when most CPU is spinning in
mutex_delay_default(), the system was not totally unresponsive -- e.g.
iostat -xnpcz 1 still printed an update every second.

You can down on the hold times of this mutex by using a command like:
lockstat -xaggsize=10m -l 0xffffff8e91253518 -s 20 -H sleep 10

I wonder if there isn't something going on in addition to the lock
contention. E.g. memory utilization? Can you get the output of ::stacks,
::memstat, and ::kmastat while the system is hung (using kmdb if necessary)?
Post by jason matthews via illumos-zfs
The systems are Dell R820 with 4x 4650L CPUs, 576 GB of RAM. The
underlying storage subsystem is comprised of two LSI 9207-8i cards each
connected to four DC S3700 800GB drives. The drives are mirrored across the
controllers on a passive mid plane (direct attach) to form one zpool of
four mirrors. Each system is on OI151a9.
The disk utilization, in terms of GB, on the pool was about 50% according
to 'zfs list data' and there were a few hundred snapshots consuming about
300-400GB.
Each physical system runs four zones whose primary job is running postgres
and nginx. Ningx simply logs request to a log file where it is picked up
every minute by a process that feeds the data into postgres. There are
ancillary services that transfer bzip2 versions of log file data to hadoop
which resides else where.
We reached a tipping point yesterday when a large batch import, about 30
million records was started on each of the four instances/machines of all
sixteen machines. Five of sixteen physical servers virtually stopped
responding for minutes at a time. A sixth server was also intermittently
none responsive but managed to at least perform some work. The other ten
operated as expected.
A week ago I pulled a physical system out of service for intermittent
lockups where even the console wouldn't respond. On one occasion it
corrected itself and on another i had to power cycle it after an hour of
unresponsiveness. At the time I thought it was a hardware issue but now i
think it is a lock contention issue.
In yesterdays event, when system would respond, it showed no disk, not a
little - actually no activity where iostat -nMxCz 1 printed no data under
the output headers. This could go on I/O activity for up to three minutes
at a time and then giant burts of 1.4-1.6GB/s for several seconds in a row.
What seemed (anecdotally) to help the problem was removing snapshots and
lowering the database insertion load on the system. It took about four
hours to talk the systems off the ledge.
I'm surprised that freeing up space by removing snapshots would have an
impact on lock contention. Especially given that you had 50% free space.
I suspect there may be no actual correlation between snapshot removal and
performance increase for this workload.

Reducing the workload could definitely help things.

--matt
Post by jason matthews via illumos-zfs
Is there someone on the team who can work with me to help analyze and
correct this problem? Although I haven't tried, I suspect I can recreate
the lock contention.
Anyone? :-)
j.
And as promised, here is the lock stat out put
Adaptive mutex spin: 5019081 events in 10.055 seconds (499161 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2215745 44% 44% 0.00 123925 0xffffff8e91253518
dsl_pool_need_dirty_delay+0x4d
1061877 21% 65% 0.00 43987 0xffffff8e94025a30 spa_config_enter+0x57
Post by jason matthews via illumos-zfs
349523 7% 72% 0.00 37842 0xffffff8e94025a30 spa_config_exit+0x46
245812 5% 77% 0.00 7968 ARC_mfu+0x58 remove_reference+0x5d
222976 4% 82% 0.00 8744 ARC_mfu+0x58 add_reference+0x79
207838 4% 86% 0.00 131256 0xffffff8e91253518 dmu_tx_wait+0xef
56852 1% 87% 0.00 33106 0xffffff8e91253518
dsl_pool_undirty_space+0x3d
36302 1% 88% 0.00 5485 0xffffff8e94025930 vdev_queue_io_add+0x45
36176 1% 88% 0.00 5630 0xffffff8e94025930
vdev_queue_io_remove+0x45
23650 0% 89% 0.00 293578 0xffffffffba0a8a00
metaslab_group_alloc+0x220
18781 0% 89% 0.00 8441 0xfffffff944d46850
dsl_dir_willuse_space+0x35
18041 0% 90% 0.00 172577 0xffffffffba09cb00
metaslab_group_alloc+0x220
17297 0% 90% 0.00 4377 0xffffff8e91253640 txg_list_add+0x34
16301 0% 90% 0.00 12313 0xfffffff944d46850
dsl_dir_diduse_space+0x69
14572 0% 90% 0.00 5333 0xffffff8e94025930
vdev_queue_pending_add+0x4c
13957 0% 91% 0.00 226460 0xffffffffba0eac80
metaslab_group_alloc+0x220
13790 0% 91% 0.00 340196 0xffffff8e9394cb80
metaslab_group_alloc+0x220
13530 0% 91% 0.00 6607 0xffffff8e7e919850
dsl_dir_willuse_space+0x35
12164 0% 92% 0.00 85681 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
11587 0% 92% 0.00 8122 0xffffff8e7e919850
dsl_dir_diduse_space+0x69
-------------------------------------------------------------------------------
Adaptive mutex block: 12210 events in 10.055 seconds (1214 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1959 16% 16% 0.00 285494409 0xffffff8e91253518
dsl_pool_need_dirty_delay+0x4d
1652 14% 30% 0.00 191089940 0xffffff8e94025a30 spa_config_enter+0x57
631 5% 35% 0.00 5477598 ARC_mfu+0x58 add_reference+0x79
604 5% 40% 0.00 39949693 ARC_mfu+0x58 remove_reference+0x5d
376 3% 43% 0.00 14685155 0xfffffff41d11b8b0 dmu_zfetch_find+0x17a
349 3% 46% 0.00 4636179 0xffffffffba0a8a00
metaslab_group_alloc+0x220
316 3% 48% 0.00 1145019 0xffffff8e91253518
dsl_pool_undirty_space+0x3d
312 3% 51% 0.00 8863420 0xffffff8e8292e400
metaslab_group_alloc+0x220
296 2% 53% 0.00 2680136 0xffffff8e4068ee30 vdev_queue_io+0x8a
256 2% 55% 0.00 2483097 0xffffff8e9394cb80
metaslab_group_alloc+0x220
249 2% 57% 0.00 2677796 0xffffff8e8425f630 vdev_queue_io+0x8a
222 2% 59% 0.00 229745532 0xffffff8e94025a30 spa_config_exit+0x46
205 2% 61% 0.00 2102559 0xffffffffba106200
metaslab_group_alloc+0x220
194 2% 62% 0.00 1933110 0xffffff8e40686630 vdev_queue_io+0x8a
189 2% 64% 0.00 2328081 0xffffffffba0eac80
metaslab_group_alloc+0x220
170 1% 65% 0.00 1902491 0xffffffffba09cb00
metaslab_group_alloc+0x220
150 1% 67% 0.00 1666618 0xffffff8e404d2630 vdev_queue_io+0x8a
144 1% 68% 0.00 1409986 0xffffff8e898fde30 vdev_queue_io+0x8a
142 1% 69% 0.00 915900 0xffffff8e7e919850
dsl_dir_diduse_space+0x69
141 1% 70% 0.00 1033689 0xffffff8e7e919850
dsl_dir_tempreserve_impl+0x72
-------------------------------------------------------------------------------
Spin lock spin: 52068 events in 10.055 seconds (5178 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
14204 27% 27% 0.00 29508 turnstile_table+0xe08 disp_lock_enter+0x2e
7483 14% 42% 0.00 13439 turnstile_table+0xb98 disp_lock_enter+0x2e
1355 3% 44% 0.00 4493 turnstile_table+0x8f8 disp_lock_enter+0x2e
1186 2% 47% 0.00 23117 turnstile_table+0x818 disp_lock_enter+0x2e
1024 2% 48% 0.00 2650 0xffffff8e3e6007c8 disp_lock_enter+0x2e
996 2% 50% 0.00 7511 turnstile_table+0xb78 disp_lock_enter+0x2e
937 2% 52% 0.00 95362 turnstile_table+0xc58 disp_lock_enter+0x2e
854 2% 54% 0.00 112245 turnstile_table+0xf28 disp_lock_enter+0x2e
816 2% 55% 0.00 3426 0xffffff8e3e5ce750 disp_lock_enter+0x2e
716 1% 57% 0.00 4600 0xffffff8e3e5cef30 disp_lock_enter+0x2e
663 1% 58% 0.00 3041 0xffffff8e3e600f38 disp_lock_enter+0x2e
641 1% 59% 0.00 12592 turnstile_table+0xfb8 disp_lock_enter+0x2e
593 1% 60% 0.00 4950 0xffffff8e3e6e9098 disp_lock_enter+0x2e
589 1% 62% 0.00 3170 0xffffff8e3e73d148 disp_lock_enter+0x2e
525 1% 63% 0.00 3677 0xffffff8e3e437f08 disp_lock_enter+0x2e
510 1% 64% 0.00 37573 turnstile_table+0xc88 disp_lock_enter+0x2e
478 1% 64% 0.00 3558 0xffffff8e3e73d8b8 disp_lock_enter+0x2e
450 1% 65% 0.00 5147 0xffffff8e3e5ceb40 disp_lock_enter+0x2e
450 1% 66% 0.00 4708 0xffffff8e3e5ce360 disp_lock_enter+0x2e
448 1% 67% 0.00 3556 0xffffff8e3e8029e0 disp_lock_enter+0x2e
-------------------------------------------------------------------------------
Thread lock spin: 521 events in 10.055 seconds (52 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
41 8% 8% 0.00 3252 sleepq_head+0x4a88 sysdc_update+0x123
34 7% 14% 0.00 24940 transition_lock ts_tick+0x29
30 6% 20% 0.00 4276 turnstile_table+0xe08 ts_tick+0x29
27 5% 25% 0.00 4115 sleepq_head+0x45a8 sysdc_update+0x123
18 3% 29% 0.00 4308 turnstile_table+0xc58 sysdc_update+0x123
16 3% 32% 0.00 4988 transition_lock sysdc_tick+0x1c
6 1% 33% 0.00 16062 turnstile_table+0xf68 sysdc_update+0x123
6 1% 34% 0.00 463 shuttle_lock ts_update_list+0x54
6 1% 35% 0.00 912 0xffffff8e3e7b0930 lwp_continue+0x35
6 1% 36% 0.00 1194 0xffffff8e3e6007c8 lwp_continue+0x35
6 1% 38% 0.00 1966 0xffffff8e3e5ce750 lwp_continue+0x35
5 1% 39% 0.00 3903 turnstile_table+0xf28 sysdc_update+0x123
4 1% 39% 0.00 3337 turnstile_table+0xfb8 turnstile_block+0x354
4 1% 40% 0.00 5077 turnstile_table+0xfb8 sysdc_update+0x123
4 1% 41% 0.00 5732 turnstile_table+0xb98 ts_tick+0x29
4 1% 42% 0.00 9637 turnstile_table+0x9f8 sysdc_update+0x123
4 1% 42% 0.00 3384 turnstile_table+0x878 sysdc_tick+0x1c
4 1% 43% 0.00 1196 0xffffff8e3e6e9920 lwp_continue+0x35
4 1% 44% 0.00 952 0xffffff8e3e4ec3c0 lwp_continue+0x35
3 1% 45% 0.00 3389 sleepq_head+0x45a8 sysdc_tick+0x1c
-------------------------------------------------------------------------------
R/W writer blocked by writer: 2 events in 10.055 seconds (0 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 100% 100% 0.00 1433015 0xffffffd6ab27b3d0
dce_lookup_and_add_v4+0x80
-------------------------------------------------------------------------------
R/W writer blocked by readers: 3 events in 10.055 seconds (0 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 418666 0xffffffd72d753fc0 dmu_zfetch+0x1eb
1 33% 100% 0.00 20446 0xfffffffad83332f0 dmu_zfetch+0x1eb
-------------------------------------------------------------------------------
R/W reader blocked by writer: 3022 events in 10.055 seconds (301 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1323 44% 44% 0.00 5210993 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1117 37% 81% 0.00 4044542 0xffffffcd3d7f8fc0 dmu_zfetch_find+0x4b
517 17% 98% 0.00 3823541 0xfffffffda8ea9eb0 dmu_zfetch_find+0x4b
36 1% 99% 0.00 186619 0xffffffff6f78a640 dmu_zfetch_find+0x4b
10 0% 99% 0.00 8816 0xffffffcd1c997e70 dmu_zfetch_find+0x4b
9 0% 100% 0.00 47399 0xfffffffad83332f0 dmu_zfetch_find+0x4b
9 0% 100% 0.00 23308 0xffffffd6d9689f38 dmu_zfetch_find+0x4b
1 0% 100% 0.00 35777 0xffffffd719dd9c08 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------
R/W reader blocked by write wanted: 3 events in 10.055 seconds (0 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
2 67% 67% 0.00 9466038 0xffffffd72d753fc0 dmu_zfetch_find+0x4b
1 33% 100% 0.00 29600 0xfffffffad83332f0 dmu_zfetch_find+0x4b
-------------------------------------------------------------------------------
This is typical of what our write pattern looks like during and after the five second flush
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
369.0 7224.3 2.2 484.1 0.1 2.1 0.0 0.3 0 204 c3
54.0 835.9 0.2 57.4 0.0 0.3 0.0 0.3 1 25
c3t50015178F36592C1d0
47.0 711.9 0.2 54.1 0.0 0.2 0.0 0.3 1 24
c3t50015178F3659379d0
34.0 1357.9 0.3 74.5 0.0 0.3 0.0 0.2 2 33
c3t55CD2E404B43F086d0
32.0 729.9 0.2 56.4 0.0 0.2 0.0 0.3 1 23
c3t55CD2E404B43EFB2d0
73.0 826.9 0.4 57.4 0.0 0.3 0.0 0.3 1 27
c3t55CD2E404B479EFCd0
48.0 724.9 0.5 54.1 0.0 0.2 0.0 0.3 1 22
c3t55CD2E404B4833ABd0
45.0 1311.9 0.2 72.9 0.0 0.3 0.0 0.2 2 27
c3t55CD2E404B4833CFd0
36.0 724.9 0.2 57.3 0.0 0.3 0.0 0.3 1 24
c3t55CD2E404B48342Ad0
369.0 7162.3 2.2 484.3 1902.2 2.6 252.6 0.3 45 49 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
546.9 136.0 2.2 1.2 0.0 0.1 0.0 0.1 0 8 c3
69.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1
c3t50015178F36592C1d0
44.0 16.0 0.2 0.1 0.0 0.0 0.0 0.2 0 1
c3t50015178F3659379d0
56.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B43F086d0
39.0 18.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B43EFB2d0
103.0 18.0 0.5 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B479EFCd0
142.0 16.0 0.3 0.1 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833ABd0
62.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833CFd0
32.0 18.0 0.1 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B48342Ad0
546.9 68.0 2.2 1.2 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
648.2 148.0 2.6 1.5 0.0 0.1 0.0 0.1 0 9 c3
58.0 18.0 0.2 0.2 0.0 0.0 0.0 0.1 0 1
c3t50015178F36592C1d0
53.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1
c3t50015178F3659379d0
103.0 20.0 0.3 0.2 0.0 0.0 0.0 0.2 0 1
c3t55CD2E404B43F086d0
53.0 16.0 0.2 0.1 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B43EFB2d0
89.0 18.0 0.4 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B479EFCd0
137.0 20.0 0.6 0.2 0.0 0.0 0.0 0.1 0 2
c3t55CD2E404B4833ABd0
113.0 20.0 0.3 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833CFd0
42.0 16.0 0.3 0.1 0.0 0.0 0.0 0.2 0 1
c3t55CD2E404B48342Ad0
648.2 74.0 2.6 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
488.0 2186.1 1.9 128.3 0.0 0.4 0.0 0.2 0 41 c3
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5
c3t50015178F36592C1d0
28.0 271.0 0.1 16.5 0.0 0.0 0.0 0.2 0 5
c3t50015178F3659379d0
52.0 272.0 0.3 15.8 0.0 0.1 0.0 0.2 0 5
c3t55CD2E404B43F086d0
45.0 278.0 0.2 15.6 0.0 0.0 0.0 0.1 0 5
c3t55CD2E404B43EFB2d0
54.0 272.0 0.3 16.2 0.0 0.1 0.0 0.2 0 5
c3t55CD2E404B479EFCd0
155.0 271.0 0.3 16.5 0.0 0.1 0.0 0.1 0 6
c3t55CD2E404B4833ABd0
46.0 272.0 0.2 15.8 0.0 0.1 0.0 0.2 0 5
c3t55CD2E404B4833CFd0
54.0 278.0 0.2 15.6 0.0 0.1 0.0 0.2 0 5
c3t55CD2E404B48342Ad0
488.0 1378.1 1.9 128.3 0.0 0.5 0.0 0.3 0 22 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
841.6 127.3 4.5 1.4 0.0 0.1 0.0 0.1 0 12 c3
89.5 17.9 0.6 0.2 0.0 0.0 0.0 0.2 0 2
c3t50015178F36592C1d0
91.5 17.9 0.7 0.2 0.0 0.0 0.0 0.2 0 2
c3t50015178F3659379d0
132.3 15.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2
c3t55CD2E404B43F086d0
107.4 11.9 0.6 0.1 0.0 0.0 0.0 0.2 0 2
c3t55CD2E404B43EFB2d0
104.5 17.9 0.5 0.2 0.0 0.0 0.0 0.1 0 2
c3t55CD2E404B479EFCd0
114.4 17.9 0.7 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833ABd0
123.4 15.9 0.6 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833CFd0
78.6 11.9 0.5 0.1 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B48342Ad0
841.6 63.7 4.5 1.4 0.0 0.2 0.0 0.2 0 11 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
559.9 128.7 2.8 1.5 0.0 0.1 0.0 0.1 0 9 c3
67.4 16.1 0.3 0.2 0.0 0.0 0.0 0.1 0 1
c3t50015178F36592C1d0
41.2 12.1 0.1 0.1 0.0 0.0 0.0 0.1 0 1
c3t50015178F3659379d0
110.6 16.1 0.3 0.2 0.0 0.0 0.0 0.2 0 2
c3t55CD2E404B43F086d0
67.4 20.1 0.5 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B43EFB2d0
74.4 16.1 0.4 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B479EFCd0
37.2 12.1 0.3 0.1 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833ABd0
46.2 16.1 0.2 0.2 0.0 0.0 0.0 0.1 0 1
c3t55CD2E404B4833CFd0
115.6 20.1 0.7 0.2 0.0 0.0 0.0 0.2 0 2
c3t55CD2E404B48342Ad0
559.9 64.3 2.8 1.5 0.0 0.1 0.0 0.2 0 7 data
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
1834.4 16207.9 14.7 782.2 0.2 4.3 0.0 0.2 0 346 c3
251.8 1591.6 2.2 94.5 0.0 0.5 0.0 0.3 2 39
c3t50015178F36592C1d0
208.8 1372.8 1.7 94.2 0.0 0.5 0.0 0.3 2 44
c3t50015178F3659379d0
258.8 2668.7 2.7 108.1 0.0 0.6 0.0 0.2 3 48
c3t55CD2E404B43F086d0
232.8 2453.9 1.7 94.4 0.0 0.6 0.0 0.2 3 46
c3t55CD2E404B43EFB2d0
220.8 1608.6 1.6 94.5 0.0 0.5 0.0 0.3 2 39
c3t55CD2E404B479EFCd0
221.8 1378.8 1.5 94.2 0.0 0.5 0.0 0.3 2 39
c3t55CD2E404B4833ABd0
198.8 2685.7 1.6 108.1 0.0 0.7 0.0 0.2 3 50
c3t55CD2E404B4833CFd0
240.8 2447.9 1.6 94.4 0.0 0.5 0.0 0.2 3 41
c3t55CD2E404B48342Ad0
1834.4 15620.4 14.7 782.2 7044.4 5.2 403.6 0.3 67 77 data
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
jason matthews via illumos-zfs
2014-05-09 01:18:27 UTC
Permalink
That's definitely bad. What is the workload on the machine? What recordsize (default is 128K), and how many MB/s reads and writes is the application doing to each machine? And what is the request size or # requests / second from the application
the work load is write heavy oltp with some some other ancillary work loads. the postgres file systems are 8k. all others are 128k.
that prompted me to run lockstat, whose output is below.
2215745 44% 44% 0.00 123925 0xffffff8e91253518 dsl_pool_need_dirty_delay+0x4d
The primary application, postgres, does 8k writes. this size won't be changing. log files were also being compressed by bzip2 off hand i don't know the size it writes at.


i use lzjb for the time being, and the compress ratio is about 2x.
@dba001:~# for i in `zoneadm list |egrep -v "global|zc"`; do zfs get compressratio data/zones/$i/postgres; done | grep ratio

data/zones/shard001a.apsalar.com/postgres compressratio 1.98x -
data/zones/shard033a.apsalar.com/postgres compressratio 1.98x -
data/zones/shard035a.apsalar.com/postgres compressratio 1.97x -
data/zones/shard002a.apsalar.com/postgres compressratio 2.03x -

I have no idea why we would be seeing 1kb writes. I suspect that is part of the pathology.

here is the output from dtrace under normal conditions:

***@dba001:~# dtrace -n 'dmu_tx_hold_write:entry{@=quantize(args[3])}'
dtrace: description 'dmu_tx_hold_write:entry' matched 1 probe

^C


value ------------- Distribution ------------- count
0 | 0
1 | 1578
2 | 1
4 | 1054
8 | 4
16 | 1670
32 | 2595
64 |@ 3490
128 | 1693
256 |@@@ 16268
512 | 2676
1024 | 1662
2048 | 663
4096 | 84
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 217837
16384 | 68
32768 | 58
65536 | 58
131072 | 43
262144 | 0


There were hundreds of free gigabytes of memory on two of the systems that melted down. I didn't check the other four.
In my high iops testing, even when most CPU is spinning in mutex_delay_default(), the system was not totally unresponsive -- e.g. iostat -xnpcz 1 still printed an update every second.
lockstat -xaggsize=10m -l 0xffffff8e91253518 -s 20 -H sleep 10
I wonder if there isn't something going on in addition to the lock contention. E.g. memory utilization? Can you get the output of ::stacks, ::memstat, and ::kmastat while the system is hung (using kmdb if necessary)?
I will look into reproducing the problem tomorrow and see if I can get that data.

thanks for getting back to me.

j.





-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
jason matthews via illumos-zfs
2014-05-19 21:26:20 UTC
Permalink
This post might be inappropriate. Click to display it.
jason matthews via illumos-zfs
2014-05-19 21:38:21 UTC
Permalink
Post by Matthew Ahrens via illumos-zfs
Reducing the workload could definitely help things.
I can’t reduce the work load but I could throw another mirror of dc s3700 at the problem or I could split the zpool into two pools. Do you see either of those two things helping?

j.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Loading...