From: Andreas Bluemle <andreas.bluemle@itxperts.de>
To: Ceph Development <ceph-devel@vger.kernel.org>
Subject: rbd: assertion failure in rbd_img_obj_callback()
Date: Thu, 20 Jun 2013 10:40:04 +0200 [thread overview]
Message-ID: <20130620104004.10574583@doppio> (raw)
[-- Attachment #1: Type: text/plain, Size: 1022 bytes --]
Hi,
when working on a large rbd image, I am hitting an
assertion failure in rbd kernel module; the
assertion is in rbd_img_obj_callback()
rbd_assert(img_request != NULL);
The test is a sequential write on an rbd
image, format 1. The size of the image is 1 TByte,
order is 22 (i.e. default object size).
The code of the rbd (and libceph) kernel modules is
from kernel.org, 3.10-rc6, running on a 3.0.80-0.5 kernel.
The problem looks like the rbd_obj_request structure in
question has been released already when processing of
the reponse to the original request takes place.
Please find attached some more details on this problem.
Is this a known issue already?
Best Regards
Andreas Bluemle
--
Andreas Bluemle mailto:Andreas.Bluemle@itxperts.de
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
[-- Attachment #2: README.rbd_img_obj_callback-panic --]
[-- Type: application/octet-stream, Size: 12483 bytes --]
Panic in rbd_img_obj_callback(): assertion failure
static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
{
struct rbd_img_request *img_request;
u32 which = obj_request->which;
bool more = true;
rbd_assert(obj_request_img_data_test(obj_request));
img_request = obj_request->img_request;
dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
>>>>> rbd_assert(img_request != NULL);
rbd_assert(img_request->obj_request_count > 0);
rbd_assert(which != BAD_WHICH);
rbd_assert(which < img_request->obj_request_count);
rbd_assert(which >= img_request->next_completion);
....
Test: single thread sequential write to rbd image of size 1 TB.
From crash:
KERNEL: ./vmlinux-3.0.80-0.5-default
DEBUGINFO: .//vmlinux-3.0.80-0.5-default.debug
DUMPFILE: vmcore
CPUS: 32
DATE: Tue Jun 18 18:09:40 2013
UPTIME: 02:55:38
LOAD AVERAGE: 3.05, 3.04, 4.24
TASKS: 445
NODENAME: cibdb2
RELEASE: 3.0.80-0.5-default
VERSION: #1 SMP Fri May 31 20:04:37 UTC 2013 (f5a0e1f)
MACHINE: x86_64 (2599 Mhz)
MEMORY: 256 GB
PANIC: "[10516.847816] kernel BUG at /home/afrie/rpm/BUILD/ceph-kmp-3.10-rc6/default/rbd.c:2146!"
PID: 7597
COMMAND: "kworker/u:16"
TASK: ffff8813ce302440 [THREAD_INFO: ffff8813ce304000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 7597 TASK: ffff8813ce302440 CPU: 0 COMMAND: "kworker/u:16"
#0 [ffff8813ce305990] machine_kexec at ffffffff810267be
#1 [ffff8813ce3059e0] crash_kexec at ffffffff810a3e4a
#2 [ffff8813ce305ab0] oops_end at ffffffff814484a8
#3 [ffff8813ce305ad0] do_invalid_op at ffffffff810035f4
#4 [ffff8813ce305b70] invalid_op at ffffffff8144fcbb
[exception RIP: rbd_img_obj_callback+711]
RIP: ffffffffa06ec0e7 RSP: ffff8813ce305c20 RFLAGS: 00010286
RAX: 00000000000000bc RBX: ffff8815d9120a28 RCX: 0000000000001a6d
RDX: 0000000000001a6d RSI: 0000000000000007 RDI: 0000000000000246
RBP: ffff88081300c118 R8: ffffffff81d8fac0 R9: 0000000000000000
R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000000
R13: 0000000000000001 R14: 000000000243c1ca R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff8813ce305c58] handle_reply at ffffffffa06a86c5 [libceph]
#6 [ffff8813ce305c78] kernel_recvmsg at ffffffff813772fc
#7 [ffff8813ce305cf8] dispatch at ffffffffa06aa613 [libceph]
#8 [ffff8813ce305d18] process_message at ffffffffa069f975 [libceph]
#9 [ffff8813ce305da8] try_read at ffffffffa06a268b [libceph]
#10 [ffff8813ce305df8] con_work at ffffffffa06a34b5 [libceph]
#11 [ffff8813ce305e28] process_one_work at ffffffff810747cc
#12 [ffff8813ce305e78] worker_thread at ffffffff8107730a
#13 [ffff8813ce305ee8] kthread at ffffffff8107b646
#14 [ffff8813ce305f48] kernel_thread_helper at ffffffff8144fe44
Analysis:
rbd code had set up an rbd_obj_request, wrapped with an
ceph_osd_request and passed that to osd_client code.
The answer for that request is handled at the osd_client layer,
but the reference to the rbd_obj_request seems to be invalid
at the time of processing the answer.
tail of crash log output BEGIN {
[10516.847816] kernel BUG at /home/afrie/rpm/BUILD/ceph-kmp-3.10-rc6/default/rbd.c:2146!
[10516.865227] invalid opcode: 0000 [#1] SMP
[10516.874346] CPU 0
[10516.878380] Modules linked in: rbd libceph edd af_packet rdma_ucm ib_srp scsi_transport_srp scsi_tgt ib_sdp(N) rdma_
cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_mthca ib_mad ib_core cpu
freq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse xfs loop dm_mod btrfs zlib_defl
ate crc32c libcrc32c igb iTCO_wdt ipv6 sr_mod ipv6_lib ac i2c_i801 joydev iTCO_vendor_support dca rtc_cmos cdrom mlx4_c
ore sg i2c_core container button pcspkr acpi_power_meter wmi ext3 jbd mbcache usbhid hid usb_storage ehci_hcd isci(X) u
sbcore libsas scsi_transport_sas usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_emc scsi_dh_hp_sw scs
i_dh_alua scsi_dh_rdac scsi_dh ahci libahci libata megaraid_sas scsi_mod
[10517.032149] Supported: No, Unsupported modules are loaded
[10517.044120]
[10517.047467] Pid: 7597, comm: kworker/u:16 Tainted: G NX 3.0.80-0.5-default #1 FUJITSU PRIMERGY RX300 S7/D2
939-A1
[10517.072005] RIP: 0010:[<ffffffffa06ec0e7>] [<ffffffffa06ec0e7>] rbd_img_obj_callback+0x2c7/0x390 [rbd]
[10517.092887] RSP: 0018:ffff8813ce305c20 EFLAGS: 00010286
[10517.104672] RAX: 00000000000000bc RBX: ffff8815d9120a28 RCX: 0000000000001a6d
[10517.120500] RDX: 0000000000001a6d RSI: 0000000000000007 RDI: 0000000000000246
[10517.136326] RBP: ffff88081300c118 R08: ffffffff81d8fac0 R09: 0000000000000000
[10517.152143] R10: 0000000000000003 R11: 00000000ffffffff R12: 0000000000000000
[10517.167961] R13: 0000000000000001 R14: 000000000243c1ca R15: 0000000000000000
[10517.183770] FS: 0000000000000000(0000) GS:ffff88207fc00000(0000) knlGS:0000000000000000
[10517.201759] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10517.214496] CR2: 00007f586eaef000 CR3: 0000001fb2faf000 CR4: 00000000000406f0
[10517.230300] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10517.246103] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10517.261906] Process kworker/u:16 (pid: 7597, threadinfo ffff8813ce304000, task ffff8813ce302440)
[10517.281422] Stack:
[10517.285919] ffff8815d9120a40 0000000000000000 ffff88081300c118 ffff881ebcdb2a70
[10517.302417] 0000000000000025 000000000243c1ca 0000000000000000 ffffffffa06a86c5
[10517.318919] ffffffffffffffff 000000000000000d 0000000000000000 ffffffff813772fc
[10517.335439] Call Trace:
[10517.340914] [<ffffffffa06a86c5>] handle_reply+0x405/0x660 [libceph]
[10517.355000] [<ffffffffa06aa613>] dispatch+0x93/0xb0 [libceph]
[10517.367941] [<ffffffffa069f975>] process_message+0x95/0x160 [libceph]
[10517.382425] [<ffffffffa06a268b>] try_read+0x38b/0x720 [libceph]
[10517.395744] [<ffffffffa06a34b5>] con_work+0xb5/0x390 [libceph]
[10517.408869] [<ffffffff810747cc>] process_one_work+0x16c/0x350
[10517.421801] [<ffffffff8107730a>] worker_thread+0x17a/0x410
[10517.434156] [<ffffffff8107b646>] kthread+0x96/0xa0
[10517.444979] [<ffffffff8144fe44>] kernel_thread_helper+0x4/0x10
[10517.458102] Code: c0 e8 88 81 d5 e0 0f 0b eb fe 48 c7 c1 af 10 6f a0 ba 62 08 00 00 48 c7 c6 c0 33 6f a0 48 c7 c7 90
1a 6f a0 31 c0 e8 63 81 d5 e0 <0f> 0b eb fe 49 89 f8 4c 89 e1 48 c7 c2 c0 33 6f a0 48 c7 c6 1a
[10517.500821] RIP [<ffffffffa06ec0e7>] rbd_img_obj_callback+0x2c7/0x390 [rbd]
[10517.516443] RSP <ffff8813ce305c20>
} END tail of crash log output
call flow and data analysis BEGIN {
disassembly of handle_reply shows
0xffffffffa06a86ad <handle_reply+1005>: mov 0x280(%rbp),%rax
0xffffffffa06a86b4 <handle_reply+1012>: test %rax,%rax
0xffffffffa06a86b7 <handle_reply+1015>: je 0xffffffffa06a87cc
0xffffffffa06a86bd <handle_reply+1021>: mov %r12,%rsi
0xffffffffa06a86c0 <handle_reply+1024>: mov %rbp,%rdi
0xffffffffa06a86c3 <handle_reply+1027>: callq *%rax
0xffffffffa06a86c5 <handle_reply+1029>: and $0x4,%r13d
handle_reply+1029 == handle_reply+0x405 as shown in the stack backtrace.
This is the return address after the call to r_callback in
the current ceph_osd_request structure.
RBP register holds the reference to the ceph_osd_request structure,
with offset 0x280 pointing at the r_callback element.
0xffff88081300c118: 0x000000000243c1ca 0xffff8814f0f8d180
0xffff88081300c128: 0x0000000000000000 0x0000000000000000
0xffff88081300c138: 0xffff88081300c138 0xffff88081300c138
0xffff88081300c148: 0xffff88081300c148 0xffff88081300c148
0xffff88081300c158: 0xffff88081300c158 0xffff88081300c158
0xffff88081300c168: 0xffff88081300c168 0xffff88081300c168
0xffff88081300c178: 0x0000000000000000 0x0000000000000002
0xffff88081300c188: 0xffff881ffa78acf8 0x000000d90000007b
0xffff88081300c198: 0x0000000000000000 0x0000000000000000
0xffff88081300c1a8: 0x0000000000000000 0x0000000000000000
0xffff88081300c1b8: 0x0000000000000000 0x0000000000000000
0xffff88081300c1c8: 0x0000000000000000 0x0000000000000002
0xffff88081300c1d8: 0xffff88135f47f988 0xffff881ebcdb2a70
0xffff88081300c1e8: 0x0000000100000024 0x0000000000000001
0xffff88081300c1f8: 0x0000100000002201 0x0000000000000000
0xffff88081300c208: 0x0000000000001000 0x0000000000000000
0xffff88081300c218: 0x0000000000000000 0x0000000000000003
0xffff88081300c228: 0xffff8814f5590c80 0x0000000000001000
0xffff88081300c238: 0x0000000000000000 0x0000000000000000
0xffff88081300c248: 0x0000000000000000 0x0000000000000000
0xffff88081300c258: 0x0000000000000000 0x0000000000000000
0xffff88081300c268: 0x0000000000000000 0x0000000000000000
0xffff88081300c278: 0x0000000000000000 0x0000000000000000
0xffff88081300c288: 0x0000000000000000 0x0000000000000000
0xffff88081300c298: 0x0000000000000000 0x0000000000000000
0xffff88081300c2a8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2b8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2c8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2d8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2e8: 0x0000000000000000 0x0000000000000000
0xffff88081300c2f8: 0xffff880e5142f884 0xffff880e5142f888
0xffff88081300c308: 0xffff880e5142f8a6 0xffff880e5142f8b7
0xffff88081300c318: 0xffff880e5142f926 0xffff880e5142f894
0xffff88081300c328: 0x0000000000000000 0x0000000000000000
0xffff88081300c338: 0x0000000100000000 0x0000000100000000
0xffff88081300c348: 0xffff881fb2fac718 0x0000000000000001
0xffff88081300c358: 0x0000000000000000 0x0000000000000000
0xffff88081300c368: 0xffff88081300c368 0xffff88081300c368
0xffff88081300c378: 0x0000000000000000 0x0000000000000000
0xffff88081300c388: 0xffff88081300c388 0xffff88081300c388
0xffff88081300c398: 0xffffffffa06ed6a0 0x0000000000000000
0xffff88081300c3a8: 0x0000000000006e7c 0x00000000000001bf
0xffff88081300c3b8: 0xffff88081300c3b8 0xffff88081300c3b8
0xffff88081300c3c8: 0x0000000000000000 0xffff8815d9120a28
0xffff88081300c3d8: 0x6630312e302e6272 0x3534623862362e30
0xffff88081300c3e8: 0x30303030302e3736 0x0062363938666530
0xffff88081300c3f8: 0x0000000000000000 0x0000000000000000
0xffff88081300c408: 0x0000000000000000 0x0000000000000000
0xffff88081300c118 + 0x280 = 0xffff88081300c398: 0xffffffffa06ed6a0
crash> dis 0xffffffffa06ed6a0
0xffffffffa06ed6a0 <rbd_osd_req_callback>: push %r13
Hence: the r_callback function is rbd_osd_req_callback().
This will in turn invoke the callback function for the rbd_obj_request:
the rbd_obj_request is contained in element r_priv of ceph_osd_request
at offset 0x2b8
0xffff88081300c118 + 0x2b8 = 0xffff88081300c3d0: 0xffff8815d9120a28
crash> gdb x/20xg 0xffff8815d9120a28
0xffff8815d9120a28: 0x0000000000000000 0x0000000000000000
0xffff8815d9120a38: 0x0000000000001000 0x0000000000000003
0xffff8815d9120a48: 0x0000000000000000 0x000000ef896b0000
0xffff8815d9120a58: 0xdead000000100100 0xdead000000200200
0xffff8815d9120a68: 0x00000001ffffffff 0xffff8814f5590c80
0xffff8815d9120a78: 0x0000000000000000 0x0000000000000000
0xffff8815d9120a88: 0x0000000000000000 0xffff88081300c118
0xffff8815d9120a98: 0x0000000000001000 0x0000000000000000
0xffff8815d9120aa8: 0x0000000000000000 0x0000000000000000
0xffff8815d9120ab8: 0x0000000000000000 0xffff8815d9120ac0
The img_request field at 0xffff8815d9120a48 is NULL: this is the cause
for the assertion.
The img_offset is a valid offset into the 1 TByte rbd image under test:
0xef896b0000 < 0x10000000000 (1 TByte).
The invalid pointers in the links of the rbd_obj_request (which is
a list_head structure) are marked as invalid:
0xffff8815d9120a58: 0xdead000000100100 0xdead000000200200
This suggests that the rbd_obj_request had been invalidated already
before the processing of the answer to the corressponding request
processing takes place.
} END call flow and data analysis
next reply other threads:[~2013-06-20 8:40 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-20 8:40 Andreas Bluemle [this message]
2013-06-20 15:45 ` rbd: assertion failure in rbd_img_obj_callback() Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130620104004.10574583@doppio \
--to=andreas.bluemle@itxperts.de \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.