* [BUG] kernel panics with drbd
@ 2015-08-04 13:52 Matthew Vernon
2015-08-04 14:22 ` Ian Campbell
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Vernon @ 2015-08-04 13:52 UTC (permalink / raw)
To: xen-devel
[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 2529 bytes --]
Hi,
I'm getting dom0 kernel panics, relating to moderately heavy use of
drbd. I think this is a Xen bug.
My Xen hosts are Debian jessie amd64 boxes, on slightly elderly Intel
kit.
Linux ophon 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04-24) x86_64 GNU/Linux
Linux opus 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) x86_64 GNU/Linux
Both have the standard jessie versions of Xen - 4.4.1-9+deb8u1 and
xen-tools - 4.5-1
I have disable_sendpage enabled for drbd:
root@opus:~# cat /etc/modprobe.d/drbd.conf
options drbd disable_sendpage=1
root@opus:~# cat /sys/module/drbd/parameters/disable_sendpage
Y
root@ophon:~# cat /etc/modprobe.d/drbd.conf
options drbd disable_sendpage=1
root@ophon:~# cat /sys/module/drbd/parameters/disable_sendpage
Y
I have a script running on "ophon" that sets up a drbd device (itself
as primary, "opus" as secondary), makes an LVM pv+vg on top of that
drbd device, and then calls xen-create-image[0]. "opus" typically kernel
panics shortly after xen-create-image starts.
I attach the relevant bit of kern.log from one such crash to this mail
- you can see the drbd operations happening a second or so before the
crash. I also attach the relevant drbd .res file
The bug is not 100% repeatable, but still fairly reliable (for obvious
reasons, extensive testing and hard-rebooting my kit is not a very
joyous prospect). I did once achieve a similar result by running
drbd-overview on opus, which said
kernel:[ 1127.630208] BUG: soft lockup - CPU#2 stuck for 23s!
[xenstored:864]
on console and then panicked much as before.
The "amusing" quirk is that similar code worked a couple of weeks ago
when I last tried it; that code does now also produce kernel panics
AFAICT (with a not-100%-repeatable bug and long reproduction
timescales 'cos of having to power-cycle etc. it's hard to be
completely certain).
The two hosts are part of a pacemaker cluster, and "opus" is otherwise
able to run guests fine.
I hope that's sufficient information; I'm happy to supply other config
files etc. if necessary.
Regards,
Matthew
[0] The code in question is in fact a python script; running on ophon,
it does the following (using ssh to run commands on opus):
--both hosts--
lvcreate -L 20G -nmwsig-mws-02474 guests
drbdadm -- --force create-md
drbdadm up mws-02474
--ophon only--
drbdadm wait-connect
drbdadm new-current-uuid --clear-bitmap minor-4
drbdadm primary mws-02474
pvcreate /dev/drbd4
vgcreate mws-02474-vg /dev/drbd4
xen-create-image ... --lvm mws-02474-vg
[-- Attachment #2: DRBD resource configuration --]
[-- Type: text/plain, Size: 324 bytes --]
resource mws-02474 {
device /dev/drbd4;
disk /dev/guests/mwsig-mws-02474;
meta-disk internal;
on ophon {
address 131.111.8.218:7792;
}
on opus {
address 131.111.8.217:7792;
}
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
}
[-- Attachment #3: kernel log output --]
[-- Type: text/plain, Size: 13436 bytes --]
Aug 3 16:03:13 opus kernel: [ 1250.026811] drbd mws-priv-1: Starting worker thread (from drbdsetup-84 [12987])
Aug 3 16:03:13 opus kernel: [ 1250.027313] block drbd4: disk( Diskless -> Attaching )
Aug 3 16:03:13 opus kernel: [ 1250.027409] drbd mws-priv-1: Method to ensure write ordering: flush
Aug 3 16:03:13 opus kernel: [ 1250.027413] block drbd4: max BIO size = 4096
Aug 3 16:03:13 opus kernel: [ 1250.027418] block drbd4: drbd_bm_resize called with capacity == 41941688
Aug 3 16:03:13 opus kernel: [ 1250.027558] block drbd4: resync bitmap: bits=5242711 words=81918 pages=160
Aug 3 16:03:13 opus kernel: [ 1250.027561] block drbd4: size = 20 GB (20970844 KB)
Aug 3 16:03:13 opus kernel: [ 1250.032268] block drbd4: Writing the whole bitmap, size changed
Aug 3 16:03:13 opus kernel: [ 1250.047827] block drbd4: bitmap WRITE of 160 pages took 4 jiffies
Aug 3 16:03:13 opus kernel: [ 1250.061634] block drbd4: 20 GB (5242711 bits) marked out-of-sync by on disk bit-map.
Aug 3 16:03:13 opus kernel: [ 1250.180186] block drbd4: bitmap READ of 160 pages took 2 jiffies
Aug 3 16:03:13 opus kernel: [ 1250.180291] block drbd4: recounting of set bits took additional 0 jiffies
Aug 3 16:03:13 opus kernel: [ 1250.180293] block drbd4: 20 GB (5242711 bits) marked out-of-sync by on disk bit-map.
Aug 3 16:03:13 opus kernel: [ 1250.180304] block drbd4: Suspended AL updates
Aug 3 16:03:13 opus kernel: [ 1250.180307] block drbd4: disk( Attaching -> Inconsistent )
Aug 3 16:03:13 opus kernel: [ 1250.180310] block drbd4: attached to UUIDs 0000000000000004:0000000000000000:0000000000000000:0000000000000000
Aug 3 16:03:13 opus kernel: [ 1250.191161] drbd mws-priv-1: conn( StandAlone -> Unconnected )
Aug 3 16:03:13 opus kernel: [ 1250.191183] drbd mws-priv-1: Starting receiver thread (from drbd_w_mws-priv [12989])
Aug 3 16:03:13 opus kernel: [ 1250.191345] drbd mws-priv-1: receiver (re)started
Aug 3 16:03:13 opus kernel: [ 1250.191360] drbd mws-priv-1: conn( Unconnected -> WFConnection )
Aug 3 16:03:13 opus kernel: [ 1250.689576] drbd mws-priv-1: Handshake successful: Agreed network protocol version 101
Aug 3 16:03:13 opus kernel: [ 1250.689580] drbd mws-priv-1: Agreed to support TRIM on protocol level
Aug 3 16:03:13 opus kernel: [ 1250.689616] drbd mws-priv-1: conn( WFConnection -> WFReportParams )
Aug 3 16:03:13 opus kernel: [ 1250.689631] drbd mws-priv-1: Starting asender thread (from drbd_r_mws-priv [12992])
Aug 3 16:03:13 opus kernel: [ 1250.737084] block drbd4: max BIO size = 1048576
Aug 3 16:03:13 opus kernel: [ 1250.737091] block drbd4: drbd_sync_handshake:
Aug 3 16:03:13 opus kernel: [ 1250.737094] block drbd4: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:5242711 flags:0
Aug 3 16:03:13 opus kernel: [ 1250.737096] block drbd4: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:5242711 flags:0
Aug 3 16:03:13 opus kernel: [ 1250.737098] block drbd4: uuid_compare()=0 by rule 10
Aug 3 16:03:13 opus kernel: [ 1250.737100] block drbd4: No resync, but 5242711 bits in bitmap!
Aug 3 16:03:13 opus kernel: [ 1250.737105] block drbd4: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
Aug 3 16:03:13 opus kernel: [ 1250.737109] block drbd4: Resumed AL updates
Aug 3 16:03:14 opus kernel: [ 1250.773903] block drbd4: Accepted new current UUID, preparing to skip initial sync
Aug 3 16:03:14 opus kernel: [ 1250.777061] block drbd4: bitmap WRITE of 160 pages took 1 jiffies
Aug 3 16:03:14 opus kernel: [ 1250.788564] block drbd4: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 3 16:03:14 opus kernel: [ 1250.788573] block drbd4: disk( Inconsistent -> UpToDate ) pdsk( Inconsistent -> UpToDate )
Aug 3 16:03:14 opus kernel: [ 1250.797104] block drbd4: receiver updated UUIDs to 14460554106EF79A:0000000000000000:0000000000000000:0000000000000000
Aug 3 16:03:14 opus kernel: [ 1250.797117] block drbd4: peer( Secondary -> Primary )
Aug 3 16:03:15 opus kernel: [ 1251.748952] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
Aug 3 16:03:15 opus kernel: [ 1251.748983] BUG: unable to handle kernel paging request at ffff8800022f3d88
Aug 3 16:03:15 opus kernel: [ 1251.749016] IP: [<ffff8800022f3d88>] 0xffff8800022f3d88
Aug 3 16:03:15 opus kernel: [ 1251.749041] PGD 1814067 PUD 1815067 PMD 2f81067 PTE 80100000022f3067
Aug 3 16:03:15 opus kernel: [ 1251.749082] Oops: 0011 [#1] SMP
Aug 3 16:03:15 opus kernel: [ 1251.749106] Modules linked in: xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc bridge stp llc intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel joydev hid_generic iTCO_wdt iTCO_vendor_support aesni_intel evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbhid ttm hid drm_kms_helper pcspkr drm i2c_i801 lpc_ich mfd_core i7core_edac ioatdma edac_core tpm_tis tpm ipmi_si ipmi_msghandler button shpchp processor thermal_sys drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ehci_pci uhci_hcd
ehci_hcd scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Aug 3 16:03:15 opus kernel: [ 1251.749691] CPU: 0 PID: 12993 Comm: drbd_a_mws-priv Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u2
Aug 3 16:03:15 opus kernel: [ 1251.749726] Hardware name: Intel Corporation S5500WBV/S5500WB, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012
Aug 3 16:03:15 opus kernel: [ 1251.749765] task: ffff8800171742d0 ti: ffff8800022f0000 task.ti: ffff8800022f0000
Aug 3 16:03:15 opus kernel: [ 1251.749847] RIP: e030:[<ffff8800022f3d88>] [<ffff8800022f3d88>] 0xffff8800022f3d88
Aug 3 16:03:15 opus kernel: [ 1251.749934] RSP: e02b:ffff8800022f3d90 EFLAGS: 00010212
Aug 3 16:03:15 opus kernel: [ 1251.749984] RAX: 00000000fffffffc RBX: ffffffffffffffff RCX: 0000000000000113
Aug 3 16:03:15 opus kernel: [ 1251.750039] RDX: 0000000000000113 RSI: 00000000fffffe01 RDI: ffffffff81463f75
Aug 3 16:03:15 opus kernel: [ 1251.750094] RBP: ffff8800171742d0 R08: ffff8800022f0000 R09: 0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.750150] R10: ffff88001751b890 R11: 0000000000000000 R12: 0000000000000001
Aug 3 16:03:15 opus kernel: [ 1251.750205] R13: 0000000000000000 R14: 0000000000000010 R15: ffff880016c92000
Aug 3 16:03:15 opus kernel: [ 1251.750263] FS: 00007f90c43c0740(0000) GS:ffff88001fa00000(0000) knlGS:0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.750348] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 3 16:03:15 opus kernel: [ 1251.750399] CR2: ffff8800022f3d88 CR3: 000000001073f000 CR4: 0000000000002660
Aug 3 16:03:15 opus kernel: [ 1251.750454] Stack:
Aug 3 16:03:15 opus kernel: [ 1251.750493] ffff8800022f3d88 0000000000000010 0000000000000000 0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.750589] ffff8800022f3d90 0000000000000001 0000000000000000 0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.750686] 0000000000004100 ffffffffa02577be ffff880016c92080 0000001000000000
Aug 3 16:03:15 opus kernel: [ 1251.750783] Call Trace:
Aug 3 16:03:15 opus kernel: [ 1251.750829] [<ffffffffa02577be>] ? drbd_asender+0x27e/0x750 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.750887] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.750947] [<ffffffffa0260d46>] ? drbd_thread_setup+0x46/0x130 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.751006] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.751065] [<ffffffff81087fad>] ? kthread+0xbd/0xe0
Aug 3 16:03:15 opus kernel: [ 1251.751114] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180
Aug 3 16:03:15 opus kernel: [ 1251.751170] [<ffffffff815114d8>] ? ret_from_fork+0x58/0x90
Aug 3 16:03:15 opus kernel: [ 1251.751221] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180
Aug 3 16:03:15 opus kernel: [ 1251.751274] Code: ff ff ff ff ff ff ff ff ff ff ff 88 3d 2f 02 00 88 ff ff 30 e0 00 00 00 00 00 00 12 02 01 00 00 00 00 00 90 3d 2f 02 00 88 ff ff <2b> e0 00 00 00 00 00 00 88 3d 2f 02 00 88 ff ff 10 00 00 00 00
Aug 3 16:03:15 opus kernel: [ 1251.751694] RIP [<ffff8800022f3d88>] 0xffff8800022f3d88
Aug 3 16:03:15 opus kernel: [ 1251.751747] RSP <ffff8800022f3d90>
Aug 3 16:03:15 opus kernel: [ 1251.751790] CR2: ffff8800022f3d88
Aug 3 16:03:15 opus kernel: [ 1251.752128] ---[ end trace 975e04f66c2d9004 ]---
Aug 3 16:03:15 opus kernel: [ 1251.835012] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Aug 3 16:03:15 opus kernel: [ 1251.835235] IP: [<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.835408] PGD 0
Aug 3 16:03:15 opus kernel: [ 1251.835531] Oops: 0002 [#2] SMP
Aug 3 16:03:15 opus kernel: [ 1251.835704] Modules linked in: xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc bridge stp llc intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel joydev hid_generic iTCO_wdt iTCO_vendor_support aesni_intel evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbhid ttm hid drm_kms_helper pcspkr drm i2c_i801 lpc_ich mfd_core i7core_edac ioatdma edac_core tpm_tis tpm ipmi_si ipmi_msghandler button shpchp processor thermal_sys drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ehci_pci uhci_hcd
ehci_hcd scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Aug 3 16:03:15 opus kernel: [ 1251.840379] CPU: 0 PID: 12992 Comm: drbd_r_mws-priv Tainted: G D 3.16.0-4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u2
Aug 3 16:03:15 opus kernel: [ 1251.840515] Hardware name: Intel Corporation S5500WBV/S5500WB, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012
Aug 3 16:03:15 opus kernel: [ 1251.840648] task: ffff880016c1a050 ti: ffff8800173e4000 task.ti: ffff8800173e4000
Aug 3 16:03:15 opus kernel: [ 1251.840771] RIP: e030:[<ffffffffa02453bd>] [<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.840951] RSP: e02b:ffff8800173e7ce0 EFLAGS: 00010097
Aug 3 16:03:15 opus kernel: [ 1251.841041] RAX: 0000000000000000 RBX: ffff880017439700 RCX: 000000000000009c
Aug 3 16:03:15 opus kernel: [ 1251.841137] RDX: 0000000000000000 RSI: ffff88000c850200 RDI: ffff88000c85bed0
Aug 3 16:03:15 opus kernel: [ 1251.841233] RBP: ffff88000cb22800 R08: 0000000000000cce R09: ffff88000c850200
Aug 3 16:03:15 opus kernel: [ 1251.841329] R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff880002b676a0
Aug 3 16:03:15 opus kernel: [ 1251.841424] R13: ffff88001f8463b0 R14: ffff88000cb22bb0 R15: ffff88000cb22800
Aug 3 16:03:15 opus kernel: [ 1251.841523] FS: 00007f90c43c0740(0000) GS:ffff88001fa00000(0000) knlGS:0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.841648] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 3 16:03:15 opus kernel: [ 1251.841739] CR2: 0000000000000008 CR3: 000000001073f000 CR4: 0000000000002660
Aug 3 16:03:15 opus kernel: [ 1251.841835] Stack:
Aug 3 16:03:15 opus kernel: [ 1251.841913] 0000000000000000 0000000000030003 ffff8800174397b8 0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.842216] 0000000000000000 0000000000102800 0000000000400000 0000000000104800
Aug 3 16:03:15 opus kernel: [ 1251.842518] 0000000000000000 0000000000102800 0000000000000000 0000000000000000
Aug 3 16:03:15 opus kernel: [ 1251.842820] Call Trace:
Aug 3 16:03:15 opus kernel: [ 1251.842906] [<ffffffffa0254bb5>] ? drbd_submit_peer_request+0x85/0x330 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.843033] [<ffffffffa02556ea>] ? receive_Data+0x36a/0xe40 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.843130] [<ffffffffa0257407>] ? drbd_receiver+0x117/0x250 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.843228] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.843328] [<ffffffffa0260d46>] ? drbd_thread_setup+0x46/0x130 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.843427] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.843524] [<ffffffff81087fad>] ? kthread+0xbd/0xe0
Aug 3 16:03:15 opus kernel: [ 1251.843614] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180
Aug 3 16:03:15 opus kernel: [ 1251.843709] [<ffffffff815114d8>] ? ret_from_fork+0x58/0x90
Aug 3 16:03:15 opus kernel: [ 1251.843801] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180
Aug 3 16:03:15 opus kernel: [ 1251.843894] Code: 04 48 8b 45 00 48 8d b8 d0 00 00 00 e8 dd bc 2c e1 8b 53 58 49 89 c1 c1 ea 09 01 95 54 02 00 00 49 83 fd ff 48 8b 13 48 8b 43 08 <48> 89 42 08 48 89 10 48 8d 85 c0 03 00 00 48 8b 95 c8 03 00 00
Aug 3 16:03:15 opus kernel: [ 1251.846996] RIP [<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd]
Aug 3 16:03:15 opus kernel: [ 1251.847168] RSP <ffff8800173e7ce0>
Aug 3 16:03:15 opus kernel: [ 1251.847252] CR2: 0000000000000008
Aug 3 16:03:15 opus kernel: [ 1251.847335] ---[ end trace 975e04f66c2d9005 ]---
[-- Attachment #4: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] kernel panics with drbd
2015-08-04 13:52 [BUG] kernel panics with drbd Matthew Vernon
@ 2015-08-04 14:22 ` Ian Campbell
2015-08-04 14:39 ` Matthew Vernon
0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2015-08-04 14:22 UTC (permalink / raw)
To: Matthew Vernon, xen-devel
On Tue, 2015-08-04 at 14:52 +0100, Matthew Vernon wrote:
> Hi,
Hello,
> I'm getting dom0 kernel panics, relating to moderately heavy use of
> drbd. I think this is a Xen bug.
It is remarkably similar looking to
http://blog.chinewalking.com/drbd-kernel-oops-w-trim/ . Do you have trim?
Ian.
>
> My Xen hosts are Debian jessie amd64 boxes, on slightly elderly Intel
> kit.
>
> Linux ophon 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04
> -24) x86_64 GNU/Linux
> Linux opus 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07
> -17) x86_64 GNU/Linux
>
> Both have the standard jessie versions of Xen - 4.4.1-9+deb8u1 and
> xen-tools - 4.5-1
>
> I have disable_sendpage enabled for drbd:
> root@opus:~# cat /etc/modprobe.d/drbd.conf
> options drbd disable_sendpage=1
> root@opus:~# cat /sys/module/drbd/parameters/disable_sendpage
> Y
> root@ophon:~# cat /etc/modprobe.d/drbd.conf
> options drbd disable_sendpage=1
> root@ophon:~# cat /sys/module/drbd/parameters/disable_sendpage
> Y
>
> I have a script running on "ophon" that sets up a drbd device (itself
> as primary, "opus" as secondary), makes an LVM pv+vg on top of that
> drbd device, and then calls xen-create-image[0]. "opus" typically kernel
> panics shortly after xen-create-image starts.
>
> I attach the relevant bit of kern.log from one such crash to this mail
> - you can see the drbd operations happening a second or so before the
> crash. I also attach the relevant drbd .res file
>
> The bug is not 100% repeatable, but still fairly reliable (for obvious
> reasons, extensive testing and hard-rebooting my kit is not a very
> joyous prospect). I did once achieve a similar result by running
> drbd-overview on opus, which said
> kernel:[ 1127.630208] BUG: soft lockup - CPU#2 stuck for 23s!
> [xenstored:864]
> on console and then panicked much as before.
>
> The "amusing" quirk is that similar code worked a couple of weeks ago
> when I last tried it; that code does now also produce kernel panics
> AFAICT (with a not-100%-repeatable bug and long reproduction
> timescales 'cos of having to power-cycle etc. it's hard to be
> completely certain).
>
> The two hosts are part of a pacemaker cluster, and "opus" is otherwise
> able to run guests fine.
>
> I hope that's sufficient information; I'm happy to supply other config
> files etc. if necessary.
>
> Regards,
>
> Matthew
>
> [0] The code in question is in fact a python script; running on ophon,
> it does the following (using ssh to run commands on opus):
> --both hosts--
> lvcreate -L 20G -nmwsig-mws-02474 guests
> drbdadm -- --force create-md
> drbdadm up mws-02474
> --ophon only--
> drbdadm wait-connect
> drbdadm new-current-uuid --clear-bitmap minor-4
> drbdadm primary mws-02474
> pvcreate /dev/drbd4
> vgcreate mws-02474-vg /dev/drbd4
> xen-create-image ... --lvm mws-02474-vg
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] kernel panics with drbd
2015-08-04 14:22 ` Ian Campbell
@ 2015-08-04 14:39 ` Matthew Vernon
2015-08-04 21:24 ` Matthew Vernon
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Vernon @ 2015-08-04 14:39 UTC (permalink / raw)
To: Ian Campbell, xen-devel
Hi,
On 04/08/15 15:22, Ian Campbell wrote:
> On Tue, 2015-08-04 at 14:52 +0100, Matthew Vernon wrote:
>> Hi,
>
> Hello,
>
>> I'm getting dom0 kernel panics, relating to moderately heavy use of
>> drbd. I think this is a Xen bug.
>
> It is remarkably similar looking to
> http://blog.chinewalking.com/drbd-kernel-oops-w-trim/ . Do you have trim?
I seem to, yes:
Aug 4 14:28:24 ophon kernel: [2856757.049680] drbd mws-02474: Agreed to
support TRIM on protocol level
I have DRBD 8.9.2~rc1-2.
Regards,
Matthew
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] kernel panics with drbd
2015-08-04 14:39 ` Matthew Vernon
@ 2015-08-04 21:24 ` Matthew Vernon
2015-08-05 8:48 ` Ian Campbell
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Vernon @ 2015-08-04 21:24 UTC (permalink / raw)
To: Ian Campbell, xen-devel
On 04/08/15 15:39, Matthew Vernon wrote:
> Hi,
>
> On 04/08/15 15:22, Ian Campbell wrote:
>> On Tue, 2015-08-04 at 14:52 +0100, Matthew Vernon wrote:
>>> Hi,
>>
>> Hello,
>>
>>> I'm getting dom0 kernel panics, relating to moderately heavy use of
>>> drbd. I think this is a Xen bug.
>>
>> It is remarkably similar looking to
>> http://blog.chinewalking.com/drbd-kernel-oops-w-trim/ . Do you have trim?
>
> I seem to, yes:
>
> Aug 4 14:28:24 ophon kernel: [2856757.049680] drbd mws-02474: Agreed to
> support TRIM on protocol level
>
> I have DRBD 8.9.2~rc1-2.
[FTAOD, I think this means that this cannot be a repeat of the issue
mentioned in that blog post?]
Matthew
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] kernel panics with drbd
2015-08-04 21:24 ` Matthew Vernon
@ 2015-08-05 8:48 ` Ian Campbell
2015-08-05 16:22 ` Matthew Vernon
0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2015-08-05 8:48 UTC (permalink / raw)
To: Matthew Vernon, xen-devel
On Tue, 2015-08-04 at 22:24 +0100, Matthew Vernon wrote:
> On 04/08/15 15:39, Matthew Vernon wrote:
> > Hi,
> >
> > On 04/08/15 15:22, Ian Campbell wrote:
> > > On Tue, 2015-08-04 at 14:52 +0100, Matthew Vernon wrote:
> > > > Hi,
> > >
> > > Hello,
> > >
> > > > I'm getting dom0 kernel panics, relating to moderately heavy use of
> > > > drbd. I think this is a Xen bug.
> > >
> > > It is remarkably similar looking to
> > > http://blog.chinewalking.com/drbd-kernel-oops-w-trim/ . Do you have
> > > trim?
> >
> > I seem to, yes:
> >
> > Aug 4 14:28:24 ophon kernel: [2856757.049680] drbd mws-02474: Agreed
> > to
> > support TRIM on protocol level
> >
> > I have DRBD 8.9.2~rc1-2.
>
> [FTAOD, I think this means that this cannot be a repeat of the issue
> mentioned in that blog post?]
I must confess I'm a bit confused about the versioning of drbd, while
Jessie ships the 8.9.2~rc1-2 userspace it appears (from inspection of the
kernel source) to contain version 8.4.3 of the kernel side.
I think it's the kernel side which the blog post is referring too, which
would also be consistent with the kernel side crash.
What does /proc/drbd say for you?
My usual first suggestion would be to try a newer kernel from jessie
-backports, but there doesn't appear to be one right now. Depending on the
nature of the systems (e.g. production vs. testlab) you could consider
installing the 4.0.8 kernel from Stretch.
Alternatively I don't know if drbd upstream provides a convenient way to
build and install the modules in an out of tree way?
Or maybe there is some option somewhere to tell drbd to not attempt to do
trim, even if it thinks it can?
Ian.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] kernel panics with drbd
2015-08-05 8:48 ` Ian Campbell
@ 2015-08-05 16:22 ` Matthew Vernon
2015-08-06 8:17 ` Ian Campbell
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Vernon @ 2015-08-05 16:22 UTC (permalink / raw)
To: Ian Campbell, xen-devel
Hi,
On 05/08/15 09:48, Ian Campbell wrote:
> On Tue, 2015-08-04 at 22:24 +0100, Matthew Vernon wrote:
>> On 04/08/15 15:39, Matthew Vernon wrote:
>>> Hi,
>>>
>>> On 04/08/15 15:22, Ian Campbell wrote:
>>>> On Tue, 2015-08-04 at 14:52 +0100, Matthew Vernon wrote:
>>>>> Hi,
>>>>
>>>> Hello,
>>>>
>>>>> I'm getting dom0 kernel panics, relating to moderately heavy use of
>>>>> drbd. I think this is a Xen bug.
>>>>
>>>> It is remarkably similar looking to
>>>> http://blog.chinewalking.com/drbd-kernel-oops-w-trim/ . Do you have
>>>> trim?
>>>
>>> I seem to, yes:
>>>
>>> Aug 4 14:28:24 ophon kernel: [2856757.049680] drbd mws-02474: Agreed
>>> to
>>> support TRIM on protocol level
>>>
>>> I have DRBD 8.9.2~rc1-2.
>>
>> [FTAOD, I think this means that this cannot be a repeat of the issue
>> mentioned in that blog post?]
>
> I must confess I'm a bit confused about the versioning of drbd, while
> Jessie ships the 8.9.2~rc1-2 userspace it appears (from inspection of the
> kernel source) to contain version 8.4.3 of the kernel side.
Those should, AIUI, work together (they both claim '(api:1/proto:86-101)').
> I think it's the kernel side which the blog post is referring too, which
> would also be consistent with the kernel side crash.
Yes, stable does have module version 8.4.3.
> Alternatively I don't know if drbd upstream provides a convenient way to
> build and install the modules in an out of tree way?
I built the 8.4.6 kernel module, and that has resolved the kernel
panics. So this does seem to have been the "trim" issue you suggested,
and a problem with version-compatibility between drbd kernel & userland.
So, sorry for bothering you, and thanks for the help. I'll try and get
Debian to ship a newer drbd kernel module (and DRBD to be more honest
about which versions of kernel and userland play nicely together).
Regards,
Matthew
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] kernel panics with drbd
2015-08-05 16:22 ` Matthew Vernon
@ 2015-08-06 8:17 ` Ian Campbell
0 siblings, 0 replies; 7+ messages in thread
From: Ian Campbell @ 2015-08-06 8:17 UTC (permalink / raw)
To: Matthew Vernon, xen-devel
On Wed, 2015-08-05 at 17:22 +0100, Matthew Vernon wrote:
>
> > Alternatively I don't know if drbd upstream provides a convenient way
> > to
> > build and install the modules in an out of tree way?
>
> I built the 8.4.6 kernel module, and that has resolved the kernel
> panics.
Great!
> So this does seem to have been the "trim" issue you suggested,
> and a problem with version-compatibility between drbd kernel & userland.
FWIW I don't think the fact that some version of the userspace happen to
enable a feature which is buggy on the kernel side is a "version
-compatibility" issue as such, just a plain kernel bug in the 8.4.3 kernel
module which is fixed in the later one.
> So, sorry for bothering you, and thanks for the help.
No problem.
Ian.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-08-06 8:17 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-04 13:52 [BUG] kernel panics with drbd Matthew Vernon
2015-08-04 14:22 ` Ian Campbell
2015-08-04 14:39 ` Matthew Vernon
2015-08-04 21:24 ` Matthew Vernon
2015-08-05 8:48 ` Ian Campbell
2015-08-05 16:22 ` Matthew Vernon
2015-08-06 8:17 ` Ian Campbell
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.