* soft lockup - CPU#0 stuck - Kernel 3.17.2
@ 2014-11-13 13:32 Patrick Schmid
2014-11-13 14:49 ` Chris Mason
0 siblings, 1 reply; 10+ messages in thread
From: Patrick Schmid @ 2014-11-13 13:32 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]
Hi all,
we run a > 500 TiB backup system on iSCSI targets using 19 BTRFS
filesystems (the biggest of which is 110 TiB) on Ubuntu 14.04 LTS and
various kernel versions. Btrfs-Progs v3.17.1. The hardware is a 24 core
Xeon E5-2620 on an Intel S2600GZ board with 128 GiB RAM.
Since btrfs has changed to kworkers (I think in 3.15) the frontend
server somewhat randomly crashes with soft lockups (see attachment). The
system is rock solid with the 3.14.22 kernel.
The lockups happen during the nightly cron-controlled rsync backups and
occur at random times during this process.
We are totally aware of the fact that this tends to be one of
those âit doesnât workâ bug reports, but itâs really hard to pin
down the source of the problem other than it seems to be related to the
kworkers. Weâd love to provide any feedback we can, please let us know
what you need.
Regards
Patrick
--
Patrick Schmid <schmid@phys.ethz.ch> support: +41 44 633 2668
IT Services Group, HPT H 8 voice: +41 44 633 3997
Departement Physik, ETH Zurich
CH-8093 Zurich, Switzerland
[-- Attachment #2: NMI_soft_lockup_crash.txt --]
[-- Type: text/plain, Size: 5999 bytes --]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207104] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u481:26:108963]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207147] Modules linked in: btrfs(E) xor(E) raid6_pq(E) tcp_diag(E) inet_diag(E) autofs4(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_sa(E) ib_mad(E) ib_core(E) ib_addr(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) mousedev(E) cryptd(E) ioatdma(E) sb_edac(E) microcode(E) ipmi_si(E) edac_core(E) lpc_ich(E) mei_me(E) ipmi_msghandler(E) tpm_tis(E) mei(E) wmi(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) sunrpc(E) fscache(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) igb(E) ixgbe(E) i2c_algo_bit(E) dca(E) isci(E) ptp(E) ahci(E) libsas(E) scsi_transport_sas(E) libahci(E) mdio(E) arcmsr(E) pps_core(E)
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207152] CPU: 0 PID: 108963 Comm: kworker/u481:26 Tainted: G EL 3.17.2-stable.slub #6
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207154] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207185] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207186] task: ffff8802e34a8000 ti: ffff88070a5a8000 task.ti: ffff88070a5a8000
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207194] RIP: 0010:[<ffffffff810b0b35>] [<ffffffff810b0b35>] queue_read_lock_slowpath+0xb5/0xd0
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207195] RSP: 0018:ffff88070a5aba00 EFLAGS: 00000206
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207196] RAX: 00000000000041b8 RBX: ffff8806bdac3a18 RCX: 0000000000003bcc
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207197] RDX: ffff8800a2c4f350 RSI: 0000000000003bcc RDI: ffff8800a2c4f354
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207198] RBP: ffff88070a5aba08 R08: 0000000000003bc6 R09: 0000000000000000
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207199] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88081ee14300
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207200] R13: ffff88100e6e0000 R14: ffffffff810946ac R15: ffff88070a5ab9a8
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207202] FS: 0000000000000000(0000) GS:ffff88081ee00000(0000) knlGS:0000000000000000
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207204] CR2: 0000000002b97fc8 CR3: 0000000001c16000 CR4: 00000000000407f0
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207205] Stack:
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207207] ffffffff8173b07c ffff88070a5aba68 ffffffffa04d8a3b 0000000000000000
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207209] ffff88070a5aba78 ffffffffa04757af 00003f66a0497f6e ffff88061c29af68
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207211] ffff8800a2c4f2e0 ffff88100f36d800 ffff880000000000 0000160000000000
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207212] Call Trace:
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207218] [<ffffffff8173b07c>] ? _raw_read_lock+0x1c/0x30
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207233] [<ffffffffa04d8a3b>] btrfs_tree_read_lock+0x5b/0x120 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207241] [<ffffffffa04757af>] ? leaf_space_used+0xcf/0x110 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207249] [<ffffffffa0477d6b>] btrfs_read_lock_root_node+0x3b/0x50 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207258] [<ffffffffa047cbee>] btrfs_search_slot+0x50e/0xa10 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207269] [<ffffffffa0494257>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207282] [<ffffffffa04b35da>] __btrfs_drop_extents+0x16a/0xd90 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207285] [<ffffffff810946ac>] ? try_to_wake_up+0x1fc/0x340
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207299] [<ffffffffa04bc65b>] ? __set_extent_bit+0x15b/0x540 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207302] [<ffffffff811b0a12>] ? kmem_cache_alloc+0x122/0x130
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207311] [<ffffffffa0477aea>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207323] [<ffffffffa04a36ce>] insert_reserved_file_extent.constprop.59+0x9e/0x2f0 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207335] [<ffffffffa04a94c5>] btrfs_finish_ordered_io+0x2e5/0x5f0 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207345] [<ffffffffa04a9ad5>] finish_ordered_fn+0x15/0x20 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207358] [<ffffffffa04cf3e2>] normal_work_helper+0xc2/0x2b0 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207362] [<ffffffff8107fe09>] ? pwq_activate_delayed_work+0x39/0x80
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207374] [<ffffffffa04cf742>] btrfs_endio_write_helper+0x12/0x20 [btrfs]
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207377] [<ffffffff81082000>] process_one_work+0x150/0x3f0
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207379] [<ffffffff810826f1>] worker_thread+0x121/0x520
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207381] [<ffffffff810825d0>] ? rescuer_thread+0x330/0x330
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207385] [<ffffffff81087992>] kthread+0xd2/0xf0
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207388] [<ffffffff810878c0>] ? kthread_create_on_node+0x180/0x180
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207390] [<ffffffff8173b6bc>] ret_from_fork+0x7c/0xb0
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207393] [<ffffffff810878c0>] ? kthread_create_on_node+0x180/0x180
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207413] Code: 8b 02 3c ff 74 f8 f3 c3 55 48 89 e5 e8 a8 df 67 00 5d c3 83 e1 fe 0f b7 f1 b8 00 80 00 00 44 0f b7 42 04 66 44 39 c1 74 83 f3 90 <83> e8 01 75 ee 66 66 66 90 66 66 90 eb e0 66 2e 0f 1f 84 00 00
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: soft lockup - CPU#0 stuck - Kernel 3.17.2
2014-11-13 13:32 soft lockup - CPU#0 stuck - Kernel 3.17.2 Patrick Schmid
@ 2014-11-13 14:49 ` Chris Mason
2014-11-13 19:07 ` Patrick Schmid
0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2014-11-13 14:49 UTC (permalink / raw)
To: Patrick Schmid; +Cc: linux-btrfs
On Thu, Nov 13, 2014 at 8:32 AM, Patrick Schmid <schmid@phys.ethz.ch>
wrote:
> Hi all,
>
> we run a > 500 TiB backup system on iSCSI targets using 19 BTRFS
> filesystems (the biggest of which is 110 TiB) on Ubuntu 14.04 LTS and
> various kernel versions. Btrfs-Progs v3.17.1. The hardware is a 24
> core
> Xeon E5-2620 on an Intel S2600GZ board with 128 GiB RAM.
>
> Since btrfs has changed to kworkers (I think in 3.15) the frontend
> server somewhat randomly crashes with soft lockups (see attachment).
> The
> system is rock solid with the 3.14.22 kernel.
>
> The lockups happen during the nightly cron-controlled rsync backups
> and
> occur at random times during this process.
> We are totally aware of the fact that this tends to be one of
> those “it doesn’t work†bug reports, but
> it’s really hard to pin
> down the source of the problem other than it seems to be related to
> the
> kworkers. We’d love to provide any feedback we can, please let
> us know
> what you need.
Hi,
This may actually be related to a different btrfs change in the 3.15
kernel. Do you see more than one soft lockup? After the softlockup,
does the box recover or is it stuck forever?
-chris
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: soft lockup - CPU#0 stuck - Kernel 3.17.2
2014-11-13 14:49 ` Chris Mason
@ 2014-11-13 19:07 ` Patrick Schmid
2014-11-13 19:12 ` Chris Mason
0 siblings, 1 reply; 10+ messages in thread
From: Patrick Schmid @ 2014-11-13 19:07 UTC (permalink / raw)
To: Chris Mason; +Cc: linux-btrfs
On 11/13/2014 03:49 PM, Chris Mason wrote:
>
>
> On Thu, Nov 13, 2014 at 8:32 AM, Patrick Schmid <schmid@phys.ethz.ch>
> wrote:
>> Hi all,
>>
>> we run a > 500 TiB backup system on iSCSI targets using 19 BTRFS
>> filesystems (the biggest of which is 110 TiB) on Ubuntu 14.04 LTS and
>> various kernel versions. Btrfs-Progs v3.17.1. The hardware is a 24
>> core
>> Xeon E5-2620 on an Intel S2600GZ board with 128 GiB RAM.
>>
>> Since btrfs has changed to kworkers (I think in 3.15) the frontend
>> server somewhat randomly crashes with soft lockups (see attachment).
>> The
>> system is rock solid with the 3.14.22 kernel.
>>
>> The lockups happen during the nightly cron-controlled rsync backups
>> and
>> occur at random times during this process.
>> We are totally aware of the fact that this tends to be one of
>> those “it doesn’t work†bug reports, but
>> it’s really hard to pin
>> down the source of the problem other than it seems to be related to
>> the
>> kworkers. We’d love to provide any feedback we can, please let
>> us know
>> what you need.
>
> Hi,
>
> This may actually be related to a different btrfs change in the 3.15
> kernel. Do you see more than one soft lockup? After the softlockup,
> does the box recover or is it stuck forever?
>
> -chris
>
Hi Chris
"Normaly" are there more than one soft lockup and the load goes up to
sky and the server stuck forever until hard reset.
If you want, i send you tomorrow morning the whole kernel log?
regards
Patrick
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: soft lockup - CPU#0 stuck - Kernel 3.17.2
2014-11-13 19:07 ` Patrick Schmid
@ 2014-11-13 19:12 ` Chris Mason
[not found] ` <54659FDB.6070300@phys.ethz.ch>
0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2014-11-13 19:12 UTC (permalink / raw)
To: Patrick Schmid; +Cc: linux-btrfs
On Thu, Nov 13, 2014 at 2:07 PM, Patrick Schmid <schmid@phys.ethz.ch>
wrote:
> On 11/13/2014 03:49 PM, Chris Mason wrote:
>>
>>
>> On Thu, Nov 13, 2014 at 8:32 AM, Patrick Schmid <schmid@phys.ethz.ch>
>> wrote:
>>> Hi all,
>>>
>>> we run a > 500 TiB backup system on iSCSI targets using 19 BTRFS
>>> filesystems (the biggest of which is 110 TiB) on Ubuntu 14.04 LTS
>>> and
>>> various kernel versions. Btrfs-Progs v3.17.1. The hardware is a 24
>>> core
>>> Xeon E5-2620 on an Intel S2600GZ board with 128 GiB RAM.
>>>
>>> Since btrfs has changed to kworkers (I think in 3.15) the frontend
>>> server somewhat randomly crashes with soft lockups (see attachment).
>>> The
>>> system is rock solid with the 3.14.22 kernel.
>>>
>>> The lockups happen during the nightly cron-controlled rsync backups
>>> and
>>> occur at random times during this process.
>>> We are totally aware of the fact that this tends to be one of
>>> those “it doesn’t work†bug reports, but
>>> it’s really hard to pin
>>> down the source of the problem other than it seems to be related to
>>> the
>>> kworkers. We’d love to provide any feedback we can, please
>>> let
>>> us know
>>> what you need.
>>
>> Hi,
>>
>> This may actually be related to a different btrfs change in the 3.15
>> kernel. Do you see more than one soft lockup? After the softlockup,
>> does the box recover or is it stuck forever?
>>
>> -chris
>>
>
> Hi Chris
>
> "Normaly" are there more than one soft lockup and the load goes up to
> sky and the server stuck forever until hard reset.
>
> If you want, i send you tomorrow morning the whole kernel log?
Yes, the whole log would be great, thanks!
-chris
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-11-21 13:16 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-13 13:32 soft lockup - CPU#0 stuck - Kernel 3.17.2 Patrick Schmid
2014-11-13 14:49 ` Chris Mason
2014-11-13 19:07 ` Patrick Schmid
2014-11-13 19:12 ` Chris Mason
[not found] ` <54659FDB.6070300@phys.ethz.ch>
2014-11-14 17:39 ` Chris Mason
2014-11-14 18:23 ` Patrick Schmid
2014-11-14 18:31 ` Chris Mason
2014-11-14 23:47 ` Chris Mason
2014-11-21 13:01 ` Patrick Schmid
2014-11-21 13:16 ` Chris Mason
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox