* rbd: assertion failure in rbd_img_obj_callback()
@ 2013-06-20 8:40 Andreas Bluemle
2013-06-20 15:45 ` Sage Weil
0 siblings, 1 reply; 2+ messages in thread
From: Andreas Bluemle @ 2013-06-20 8:40 UTC (permalink / raw)
To: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 1022 bytes --]
Hi,
when working on a large rbd image, I am hitting an
assertion failure in rbd kernel module; the
assertion is in rbd_img_obj_callback()
rbd_assert(img_request != NULL);
The test is a sequential write on an rbd
image, format 1. The size of the image is 1 TByte,
order is 22 (i.e. default object size).
The code of the rbd (and libceph) kernel modules is
from kernel.org, 3.10-rc6, running on a 3.0.80-0.5 kernel.
The problem looks like the rbd_obj_request structure in
question has been released already when processing of
the reponse to the original request takes place.
Please find attached some more details on this problem.
Is this a known issue already?
Best Regards
Andreas Bluemle
--
Andreas Bluemle mailto:Andreas.Bluemle@itxperts.de
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
[-- Attachment #2: README.rbd_img_obj_callback-panic --]
[-- Type: application/octet-stream, Size: 12483 bytes --]
Panic in rbd_img_obj_callback(): assertion failure
static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
{
struct rbd_img_request *img_request;
u32 which = obj_request->which;
bool more = true;
rbd_assert(obj_request_img_data_test(obj_request));
img_request = obj_request->img_request;
dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
>>>>> rbd_assert(img_request != NULL);
rbd_assert(img_request->obj_request_count > 0);
rbd_assert(which != BAD_WHICH);
rbd_assert(which < img_request->obj_request_count);
rbd_assert(which >= img_request->next_completion);
....
Test: single thread sequential write to rbd image of size 1 TB.
From crash:
KERNEL: ./vmlinux-3.0.80-0.5-default
DEBUGINFO: .//vmlinux-3.0.80-0.5-default.debug
DUMPFILE: vmcore
CPUS: 32
DATE: Tue Jun 18 18:09:40 2013
UPTIME: 02:55:38
LOAD AVERAGE: 3.05, 3.04, 4.24
TASKS: 445
NODENAME: cibdb2
RELEASE: 3.0.80-0.5-default
VERSION: #1 SMP Fri May 31 20:04:37 UTC 2013 (f5a0e1f)
MACHINE: x86_64 (2599 Mhz)
MEMORY: 256 GB
PANIC: "[10516.847816] kernel BUG at /home/afrie/rpm/BUILD/ceph-kmp-3.10-rc6/default/rbd.c:2146!"
PID: 7597
COMMAND: "kworker/u:16"
TASK: ffff8813ce302440 [THREAD_INFO: ffff8813ce304000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 7597 TASK: ffff8813ce302440 CPU: 0 COMMAND: "kworker/u:16"
#0 [ffff8813ce305990] machine_kexec at ffffffff810267be
#1 [ffff8813ce3059e0] crash_kexec at ffffffff810a3e4a
#2 [ffff8813ce305ab0] oops_end at ffffffff814484a8
#3 [ffff8813ce305ad0] do_invalid_op at ffffffff810035f4
#4 [ffff8813ce305b70] invalid_op at ffffffff8144fcbb
[exception RIP: rbd_img_obj_callback+711]
RIP: ffffffffa06ec0e7 RSP: ffff8813ce305c20 RFLAGS: 00010286
RAX: 00000000000000bc RBX: ffff8815d9120a28 RCX: 0000000000001a6d
RDX: 0000000000001a6d RSI: 0000000000000007 RDI: 0000000000000246
RBP: ffff88081300c118 R8: ffffffff81d8fac0 R9: 0000000000000000
R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000000
R13: 0000000000000001 R14: 000000000243c1ca R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff8813ce305c58] handle_reply at ffffffffa06a86c5 [libceph]
#6 [ffff8813ce305c78] kernel_recvmsg at ffffffff813772fc
#7 [ffff8813ce305cf8] dispatch at ffffffffa06aa613 [libceph]
#8 [ffff8813ce305d18] process_message at ffffffffa069f975 [libceph]
#9 [ffff8813ce305da8] try_read at ffffffffa06a268b [libceph]
#10 [ffff8813ce305df8] con_work at ffffffffa06a34b5 [libceph]
#11 [ffff8813ce305e28] process_one_work at ffffffff810747cc
#12 [ffff8813ce305e78] worker_thread at ffffffff8107730a
#13 [ffff8813ce305ee8] kthread at ffffffff8107b646
#14 [ffff8813ce305f48] kernel_thread_helper at ffffffff8144fe44
Analysis:
rbd code had set up an rbd_obj_request, wrapped with an
ceph_osd_request and passed that to osd_client code.
The answer for that request is handled at the osd_client layer,
but the reference to the rbd_obj_request seems to be invalid
at the time of processing the answer.
tail of crash log output BEGIN {
[10516.847816] kernel BUG at /home/afrie/rpm/BUILD/ceph-kmp-3.10-rc6/default/rbd.c:2146!
[10516.865227] invalid opcode: 0000 [#1] SMP
[10516.874346] CPU 0
[10516.878380] Modules linked in: rbd libceph edd af_packet rdma_ucm ib_srp scsi_transport_srp scsi_tgt ib_sdp(N) rdma_
cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_mthca ib_mad ib_core cpu
freq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse xfs loop dm_mod btrfs zlib_defl
ate crc32c libcrc32c igb iTCO_wdt ipv6 sr_mod ipv6_lib ac i2c_i801 joydev iTCO_vendor_support dca rtc_cmos cdrom mlx4_c
ore sg i2c_core container button pcspkr acpi_power_meter wmi ext3 jbd mbcache usbhid hid usb_storage ehci_hcd isci(X) u
sbcore libsas scsi_transport_sas usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_emc scsi_dh_hp_sw scs
i_dh_alua scsi_dh_rdac scsi_dh ahci libahci libata megaraid_sas scsi_mod
[10517.032149] Supported: No, Unsupported modules are loaded
[10517.044120]
[10517.047467] Pid: 7597, comm: kworker/u:16 Tainted: G NX 3.0.80-0.5-default #1 FUJITSU PRIMERGY RX300 S7/D2
939-A1
[10517.072005] RIP: 0010:[<ffffffffa06ec0e7>] [<ffffffffa06ec0e7>] rbd_img_obj_callback+0x2c7/0x390 [rbd]
[10517.092887] RSP: 0018:ffff8813ce305c20 EFLAGS: 00010286
[10517.104672] RAX: 00000000000000bc RBX: ffff8815d9120a28 RCX: 0000000000001a6d
[10517.120500] RDX: 0000000000001a6d RSI: 0000000000000007 RDI: 0000000000000246
[10517.136326] RBP: ffff88081300c118 R08: ffffffff81d8fac0 R09: 0000000000000000
[10517.152143] R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000000
[10517.167961] R13: 0000000000000001 R14: 000000000243c1ca R15: 0000000000000000
[10517.183770] FS: 0000000000000000(0000) GS:ffff88207fc00000(0000) knlGS:0000000000000000
[10517.201759] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10517.214496] CR2: 00007f586eaef000 CR3: 0000001fb2faf000 CR4: 00000000000406f0
[10517.230300] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10517.246103] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10517.261906] Process kworker/u:16 (pid: 7597, threadinfo ffff8813ce304000, task ffff8813ce302440)
[10517.281422] Stack:
[10517.285919] ffff8815d9120a40 0000000000000000 ffff88081300c118 ffff881ebcdb2a70
[10517.302417] 0000000000000025 000000000243c1ca 0000000000000000 ffffffffa06a86c5
[10517.318919] ffffffffffffffff 000000000000000d 0000000000000000 ffffffff813772fc
[10517.335439] Call Trace:
[10517.340914] [<ffffffffa06a86c5>] handle_reply+0x405/0x660 [libceph]
[10517.355000] [<ffffffffa06aa613>] dispatch+0x93/0xb0 [libceph]
[10517.367941] [<ffffffffa069f975>] process_message+0x95/0x160 [libceph]
[10517.382425] [<ffffffffa06a268b>] try_read+0x38b/0x720 [libceph]
[10517.395744] [<ffffffffa06a34b5>] con_work+0xb5/0x390 [libceph]
[10517.408869] [<ffffffff810747cc>] process_one_work+0x16c/0x350
[10517.421801] [<ffffffff8107730a>] worker_thread+0x17a/0x410
[10517.434156] [<ffffffff8107b646>] kthread+0x96/0xa0
[10517.444979] [<ffffffff8144fe44>] kernel_thread_helper+0x4/0x10
[10517.458102] Code: c0 e8 88 81 d5 e0 0f 0b eb fe 48 c7 c1 af 10 6f a0 ba 62 08 00 00 48 c7 c6 c0 33 6f a0 48 c7 c7 90
1a 6f a0 31 c0 e8 63 81 d5 e0 <0f> 0b eb fe 49 89 f8 4c 89 e1 48 c7 c2 c0 33 6f a0 48 c7 c6 1a
[10517.500821] RIP [<ffffffffa06ec0e7>] rbd_img_obj_callback+0x2c7/0x390 [rbd]
[10517.516443] RSP <ffff8813ce305c20>
} END tail of crash log output
call flow and data analysis BEGIN {
disassembly of handle_reply shows
0xffffffffa06a86ad <handle_reply+1005>: mov 0x280(%rbp),%rax
0xffffffffa06a86b4 <handle_reply+1012>: test %rax,%rax
0xffffffffa06a86b7 <handle_reply+1015>: je 0xffffffffa06a87cc
0xffffffffa06a86bd <handle_reply+1021>: mov %r12,%rsi
0xffffffffa06a86c0 <handle_reply+1024>: mov %rbp,%rdi
0xffffffffa06a86c3 <handle_reply+1027>: callq *%rax
0xffffffffa06a86c5 <handle_reply+1029>: and $0x4,%r13d
handle_reply+1029 == handle_reply+0x405 as shown in the stack backtrace.
This is the return address after the call to r_callback in
the current ceph_osd_request structure.
RBP register holds the reference to the ceph_osd_request structure,
with offset 0x280 pointing at the r_callback element.
0xffff88081300c118: 0x000000000243c1ca 0xffff8814f0f8d180
0xffff88081300c128: 0x0000000000000000 0x0000000000000000
0xffff88081300c138: 0xffff88081300c138 0xffff88081300c138
0xffff88081300c148: 0xffff88081300c148 0xffff88081300c148
0xffff88081300c158: 0xffff88081300c158 0xffff88081300c158
0xffff88081300c168: 0xffff88081300c168 0xffff88081300c168
0xffff88081300c178: 0x0000000000000000 0x0000000000000002
0xffff88081300c188: 0xffff881ffa78acf8 0x000000d90000007b
0xffff88081300c198: 0x0000000000000000 0x0000000000000000
0xffff88081300c1a8: 0x0000000000000000 0x0000000000000000
0xffff88081300c1b8: 0x0000000000000000 0x0000000000000000
0xffff88081300c1c8: 0x0000000000000000 0x0000000000000002
0xffff88081300c1d8: 0xffff88135f47f988 0xffff881ebcdb2a70
0xffff88081300c1e8: 0x0000000100000024 0x0000000000000001
0xffff88081300c1f8: 0x0000100000002201 0x0000000000000000
0xffff88081300c208: 0x0000000000001000 0x0000000000000000
0xffff88081300c218: 0x0000000000000000 0x0000000000000003
0xffff88081300c228: 0xffff8814f5590c80 0x0000000000001000
0xffff88081300c238: 0x0000000000000000 0x0000000000000000
0xffff88081300c248: 0x0000000000000000 0x0000000000000000
0xffff88081300c258: 0x0000000000000000 0x0000000000000000
0xffff88081300c268: 0x0000000000000000 0x0000000000000000
0xffff88081300c278: 0x0000000000000000 0x0000000000000000
0xffff88081300c288: 0x0000000000000000 0x0000000000000000
0xffff88081300c298: 0x0000000000000000 0x0000000000000000
0xffff88081300c2a8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2b8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2c8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2d8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2e8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2f8: 0xffff880e5142f884 0xffff880e5142f888
0xffff88081300c308: 0xffff880e5142f8a6 0xffff880e5142f8b7
0xffff88081300c318: 0xffff880e5142f926 0xffff880e5142f894
0xffff88081300c328: 0x0000000000000000 0x0000000000000000
0xffff88081300c338: 0x0000000100000000 0x0000000100000000
0xffff88081300c348: 0xffff881fb2fac718 0x0000000000000001
0xffff88081300c358: 0x0000000000000000 0x0000000000000000
0xffff88081300c368: 0xffff88081300c368 0xffff88081300c368
0xffff88081300c378: 0x0000000000000000 0x0000000000000000
0xffff88081300c388: 0xffff88081300c388 0xffff88081300c388
0xffff88081300c398: 0xffffffffa06ed6a0 0x0000000000000000
0xffff88081300c3a8: 0x0000000000006e7c 0x00000000000001bf
0xffff88081300c3b8: 0xffff88081300c3b8 0xffff88081300c3b8
0xffff88081300c3c8: 0x0000000000000000 0xffff8815d9120a28
0xffff88081300c3d8: 0x6630312e302e6272 0x3534623862362e30
0xffff88081300c3e8: 0x30303030302e3736 0x0062363938666530
0xffff88081300c3f8: 0x0000000000000000 0x0000000000000000
0xffff88081300c408: 0x0000000000000000 0x0000000000000000
0xffff88081300c118 + 0x280 = 0xffff88081300c398: 0xffffffffa06ed6a0
crash> dis 0xffffffffa06ed6a0
0xffffffffa06ed6a0 <rbd_osd_req_callback>: push %r13
Hence: the r_callback function is rbd_osd_req_callback().
This will in turn invoke the callback function for the rbd_obj_request:
the rbd_obj_request is contained in element r_priv of ceph_osd_request
at offset 0x2b8
0xffff88081300c118 + 0x2b8 = 0xffff88081300c3d0: 0xffff8815d9120a28
crash> gdb x/20xg 0xffff8815d9120a28
0xffff8815d9120a28: 0x0000000000000000 0x0000000000000000
0xffff8815d9120a38: 0x0000000000001000 0x0000000000000003
0xffff8815d9120a48: 0x0000000000000000 0x000000ef896b0000
0xffff8815d9120a58: 0xdead000000100100 0xdead000000200200
0xffff8815d9120a68: 0x00000001ffffffff 0xffff8814f5590c80
0xffff8815d9120a78: 0x0000000000000000 0x0000000000000000
0xffff8815d9120a88: 0x0000000000000000 0xffff88081300c118
0xffff8815d9120a98: 0x0000000000001000 0x0000000000000000
0xffff8815d9120aa8: 0x0000000000000000 0x0000000000000000
0xffff8815d9120ab8: 0x0000000000000000 0xffff8815d9120ac0
The img_request field at 0xffff8815d9120a48 is NULL: this is the cause
for the assertion.
The img_offset is a valid offset into the 1 TByte rbd image under test:
0xef896b0000 < 0x10000000000 (1 TByte).
The invalid pointers in the links of the rbd_obj_request (which is
a list_head structure) are marked as invalid:
0xffff8815d9120a58: 0xdead000000100100 0xdead000000200200
This suggests that the rbd_obj_request had been invalidated already
before the processing of the answer to the corressponding request
processing takes place.
} END call flow and data analysis
^ permalink raw reply [flat|nested] 2+ messages in thread* Re: rbd: assertion failure in rbd_img_obj_callback()
2013-06-20 8:40 rbd: assertion failure in rbd_img_obj_callback() Andreas Bluemle
@ 2013-06-20 15:45 ` Sage Weil
0 siblings, 0 replies; 2+ messages in thread
From: Sage Weil @ 2013-06-20 15:45 UTC (permalink / raw)
To: Andreas Bluemle; +Cc: Ceph Development
On Thu, 20 Jun 2013, Andreas Bluemle wrote:
> Hi,
>
> when working on a large rbd image, I am hitting an
> assertion failure in rbd kernel module; the
> assertion is in rbd_img_obj_callback()
>
> rbd_assert(img_request != NULL);
>
> The test is a sequential write on an rbd
> image, format 1. The size of the image is 1 TByte,
> order is 22 (i.e. default object size).
>
> The code of the rbd (and libceph) kernel modules is
> from kernel.org, 3.10-rc6, running on a 3.0.80-0.5 kernel.
>
> The problem looks like the rbd_obj_request structure in
> question has been released already when processing of
> the reponse to the original request takes place.
>
> Please find attached some more details on this problem.
>
> Is this a known issue already?
Not yet. Do you mind opening an issue at tracker.ceph.com to track this?
Thanks!
sage
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2013-06-20 15:45 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-20 8:40 rbd: assertion failure in rbd_img_obj_callback() Andreas Bluemle
2013-06-20 15:45 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.