From: Alex Elder <elder@ieee.org>
To: Olivier Bonvalet <ceph.list@daevel.fr>
Cc: ceph-devel@vger.kernel.org
Subject: Re: Issue #5876 : assertion failure in rbd_img_obj_callback()
Date: Fri, 25 Apr 2014 07:17:41 -0500 [thread overview]
Message-ID: <535A5265.6080407@ieee.org> (raw)
In-Reply-To: <1398425826.2927.21.camel@localhost>
On 04/25/2014 06:37 AM, Olivier Bonvalet wrote:
> Le vendredi 04 avril 2014 à 20:57 -0500, Alex Elder a écrit :
>> On 04/04/2014 08:16 PM, Olivier Bonvalet wrote:
>>> Le mardi 25 mars 2014 à 09:39 +0100, Olivier Bonvalet a écrit :
>>>> Hi,
>>>>
>>>> what can/should I do to help fix that problem ?
>>>>
>>>> for now, RBD kernel client hang on :
>>>> Assertion failure in rbd_img_obj_callback() at line 2131:
>>>> rbd_assert(which >= img_request->next_completion);
>>>>
>>>> or on :
>>>> Assertion failure in rbd_img_obj_callback() at line 2127:
>>>> rbd_assert(img_request != NULL);
>>>>
>>>>
>>>> I have both case at least once per week, on latest 3.13.5 kernels.
>>>>
>>>> It seems that the problem occurs only on more loaded servers (I have 4
>>>> near same servers, and crash occurs on two of them. If I move the VM,
>>>> crash follows...).
>>>>
>>>> Olivier
>>>>
>>>> --
>>>
>>> Hi,
>>>
>>> so. After some days without any problems, RBD crashed toonight :
>>
>> Unfortunately this could be a symptom of the same sort of race.
>> When a object request is removed from its image request's list
>> the request count gets decremented. To be honest, all of these
>> assertions in rbd_img_obj_callback() are probably unsafe, at
>> least until I get the patch that does proper reference counting
>> implemented:
>>
>> rbd_assert(img_request != NULL);
>> rbd_assert(img_request->obj_request_count > 0);
>> rbd_assert(which != BAD_WHICH);
>> rbd_assert(which < img_request->obj_request_count);
>>
>> Until then I think you can avoid this by commenting out those
>> assertions. I'm afraid there will remain a (smaller) window
>> of opportunity for a problem to occur, but I believe commenting
>> those out will help for now.
>>
>> I'm very sorry you're hitting these. I'll see if I can get
>> a comprehensive fix this weekend.
>>
>> -Alex
>>
>
> Hi,
>
> I suppose that I should add :
> if (img_request == NULL) goto out;
>
> Right ?
Sure, why not?
To be serious we need to get you a proper fix. I have one
written (I think I've had it for two weeks) but have been
unable to test it at all. And this is one I don't want to
just give to a customer to test, I want to make sure it works
before sending it out.
I was hoping we had made the window of vulnerability small
enough that the problem wouldn't occur. Your new report
shows we're not that lucky. I'll see what I can do.
-Alex
> When commenting the asserts I obtain a NULL pointer dereference :
>
> Apr 25 13:03:15 murmillia kernel: [124049.097927] BUG: unable to handle kernel NULL pointer dereference at 000000000000003c
> Apr 25 13:03:15 murmillia kernel: [124049.098008] IP: [<ffffffff8105d922>] do_raw_spin_lock+0x5/0x22
> Apr 25 13:03:15 murmillia kernel: [124049.098056] PGD 0
> Apr 25 13:03:15 murmillia kernel: [124049.098091] Oops: 0002 [#1] SMP
> Apr 25 13:03:15 murmillia kernel: [124049.098133] Modules linked in: cbc rbd libceph xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables xfs libcrc32c bridge loop iTCO_wdt gpio_ich iTCO_vendor_support serio_raw sb_edac edac_core i2c_i801 evdev lpc_ich mfd_core ioatdma shpchp ipmi_si ipmi_msghandler wmi ac button dm_mod hid_generic usbhid hid sg sd_mod crc_t10dif crct10dif_common megaraid_sas ahci libahci isci ehci_pci ehci_hcd libsas usbcore libata igb ixgbe scsi_transport_sas i2c_algo_bit i2c_core usb_common scsi_mod dca ptp pps_core mdio
> Apr 25 13:03:15 murmillia kernel: [124049.098695] CPU: 0 PID: 31669 Comm: kworker/0:0 Not tainted 3.13-dae-dom0 #1
> Apr 25 13:03:15 murmillia kernel: [124049.098739] Hardware name: Supermicro X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> Apr 25 13:03:15 murmillia kernel: [124049.098809] Workqueue: ceph-msgr con_work [libceph]
> Apr 25 13:03:15 murmillia kernel: [124049.098851] task: ffff8802458b38a0 ti: ffff88023cfcc000 task.ti: ffff88023cfcc000
> Apr 25 13:03:15 murmillia kernel: [124049.098916] RIP: e030:[<ffffffff8105d922>] [<ffffffff8105d922>] do_raw_spin_lock+0x5/0x22
> Apr 25 13:03:15 murmillia kernel: [124049.098987] RSP: e02b:ffff88023cfcdce0 EFLAGS: 00010002
> Apr 25 13:03:15 murmillia kernel: [124049.099026] RAX: 0000000000010000 RBX: ffff88025749a3c8 RCX: 0000000000002201
> Apr 25 13:03:15 murmillia kernel: [124049.099091] RDX: 000000000000003c RSI: ffff88025749a3e0 RDI: 000000000000003c
> Apr 25 13:03:15 murmillia kernel: [124049.099154] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
> Apr 25 13:03:15 murmillia kernel: [124049.099218] R10: ffff88024749d07d R11: ffff8802476929f8 R12: ffff880269f6b701
> Apr 25 13:03:15 murmillia kernel: [124049.099281] R13: 00000000ffffffff R14: ffff8802476927c0 R15: 0000000000000000
> Apr 25 13:03:15 murmillia kernel: [124049.099349] FS: 00007f01384088e0(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
> Apr 25 13:03:15 murmillia kernel: [124049.099415] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 25 13:03:15 murmillia kernel: [124049.099455] CR2: 000000000000003c CR3: 0000000243dec000 CR4: 0000000000042660
> Apr 25 13:03:15 murmillia kernel: [124049.099519] Stack:
> Apr 25 13:03:15 murmillia kernel: [124049.099549] ffffffffa032caad 000000000000003c ffff8802476929f8 0000000000002201
> Apr 25 13:03:15 murmillia kernel: [124049.099629] ffff8802411ea218 ffff8802476927b8 ffff880269f6b718 0000000000000000
> Apr 25 13:03:15 murmillia kernel: [124049.099708] ffff8802476927c0 0000000000000000 ffffffffa030b69b 0000000000000025
> Apr 25 13:03:15 murmillia kernel: [124049.099786] Call Trace:
> Apr 25 13:03:15 murmillia kernel: [124049.099823] [<ffffffffa032caad>] ? rbd_img_obj_callback+0x56/0x308 [rbd]
> Apr 25 13:03:15 murmillia kernel: [124049.099871] [<ffffffffa030b69b>] ? dispatch+0x3e4/0x55e [libceph]
> Apr 25 13:03:15 murmillia kernel: [124049.099915] [<ffffffffa03060fc>] ? con_work+0xf6e/0x1a65 [libceph]
> Apr 25 13:03:15 murmillia kernel: [124049.099959] [<ffffffff8100122a>] ? xen_hypercall_xen_version+0xa/0x20
> Apr 25 13:03:15 murmillia kernel: [124049.100004] [<ffffffff81005959>] ? xen_force_evtchn_callback+0x9/0xa
> Apr 25 13:03:15 murmillia kernel: [124049.100048] [<ffffffff810484e8>] ? process_one_work+0x15a/0x214
> Apr 25 13:03:15 murmillia kernel: [124049.100100] [<ffffffff8104896c>] ? worker_thread+0x139/0x1de
> Apr 25 13:03:15 murmillia kernel: [124049.100141] [<ffffffff81048833>] ? rescuer_thread+0x26e/0x26e
> Apr 25 13:03:15 murmillia kernel: [124049.100183] [<ffffffff8104d007>] ? kthread+0x9e/0xa6
> Apr 25 13:03:15 murmillia kernel: [124049.100223] [<ffffffff8104cf69>] ? __kthread_parkme+0x55/0x55
> Apr 25 13:03:15 murmillia kernel: [124049.100268] [<ffffffff81372a0c>] ? ret_from_fork+0x7c/0xb0
> Apr 25 13:03:15 murmillia kernel: [124049.100309] [<ffffffff8104cf69>] ? __kthread_parkme+0x55/0x55
> Apr 25 13:03:15 murmillia kernel: [124049.100349] Code: d0 f0 0f b1 0f 39 d0 0f 94 c0 0f b6 c0 c3 31 c0 48 81 ff e8 db 36 81 72 0c 31 c0 48 81 ff af df 36 81 0f 92 c0 c3 b8 00 00 01 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 89 d1 74 0c 66 8b 07 66 39
> Apr 25 13:03:15 murmillia kernel: [124049.100727] RIP [<ffffffff8105d922>] do_raw_spin_lock+0x5/0x22
> Apr 25 13:03:15 murmillia kernel: [124049.100773] RSP <ffff88023cfcdce0>
> Apr 25 13:03:15 murmillia kernel: [124049.100807] CR2: 000000000000003c
> Apr 25 13:03:15 murmillia kernel: [124049.101120] ---[ end trace 7f81ace5e0aed716 ]---
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2014-04-25 12:17 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-25 8:39 Issue #5876 : assertion failure in rbd_img_obj_callback() Olivier Bonvalet
2014-03-25 9:04 ` Ilya Dryomov
[not found] ` <1395739214.2823.34.camel@localhost>
2014-03-25 9:52 ` Ilya Dryomov
2014-03-25 11:48 ` Alex Elder
2014-03-25 12:34 ` Ilya Dryomov
2014-03-25 12:51 ` Alex Elder
2014-03-25 12:57 ` Ilya Dryomov
2014-03-25 13:18 ` Olivier Bonvalet
2014-03-25 13:29 ` Alex Elder
2014-03-25 13:31 ` Alex Elder
2014-03-25 14:01 ` Olivier Bonvalet
2014-03-25 17:15 ` Olivier Bonvalet
2014-03-25 17:21 ` Alex Elder
2014-03-25 18:53 ` Olivier Bonvalet
2014-03-25 17:43 ` Alex Elder
2014-03-25 18:53 ` Olivier Bonvalet
2014-03-25 19:03 ` Alex Elder
2014-03-25 20:18 ` Ilya Dryomov
2014-03-25 20:21 ` Olivier Bonvalet
2014-03-25 20:24 ` Alex Elder
2014-03-25 20:29 ` Olivier Bonvalet
2014-03-25 20:44 ` Alex Elder
2014-03-25 21:03 ` Olivier Bonvalet
2014-03-25 20:41 ` Alex Elder
2014-03-25 20:53 ` Olivier Bonvalet
2014-03-25 21:10 ` Olivier Bonvalet
2014-03-25 21:20 ` Ilya Dryomov
[not found] ` <1395782577.2076.23.camel@localhost>
2014-03-25 21:25 ` Ilya Dryomov
2014-03-25 21:41 ` Olivier Bonvalet
2014-03-25 21:49 ` Ilya Dryomov
2014-03-25 21:54 ` Olivier Bonvalet
2014-03-25 22:17 ` Olivier Bonvalet
2014-03-25 22:46 ` Alex Elder
2014-03-25 23:04 ` Olivier Bonvalet
2014-03-26 0:00 ` Alex Elder
2014-03-26 1:33 ` Olivier Bonvalet
2014-03-26 1:50 ` Olivier Bonvalet
2014-03-26 1:55 ` Alex Elder
2014-03-26 2:40 ` Olivier Bonvalet
2014-03-26 2:42 ` Alex Elder
2014-03-26 2:45 ` Olivier Bonvalet
2014-03-26 3:54 ` Alex Elder
2014-03-26 4:00 ` Olivier Bonvalet
2014-03-26 5:00 ` Alex Elder
2014-03-26 11:13 ` Alex Elder
2014-03-26 11:43 ` Ilya Dryomov
2014-03-26 11:47 ` Alex Elder
2014-03-26 12:05 ` Ilya Dryomov
2014-03-26 20:58 ` Alex Elder
2014-03-27 7:48 ` Olivier Bonvalet
2014-03-27 8:45 ` Ilya Dryomov
2014-03-27 8:49 ` Olivier Bonvalet
2014-03-26 2:35 ` Olivier Bonvalet
2014-03-26 2:54 ` Alex Elder
2014-03-26 3:58 ` Olivier Bonvalet
2014-04-05 1:16 ` Olivier Bonvalet
2014-04-05 1:57 ` Alex Elder
2014-04-05 8:09 ` Olivier Bonvalet
2014-04-05 13:08 ` Alex Elder
2014-04-25 11:37 ` Olivier Bonvalet
2014-04-25 12:17 ` Alex Elder [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=535A5265.6080407@ieee.org \
--to=elder@ieee.org \
--cc=ceph-devel@vger.kernel.org \
--cc=ceph.list@daevel.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.