* md raid10 Oops on recent kernels @ 2012-08-13 12:49 Ivan Vasilyev 2012-08-14 0:50 ` NeilBrown 0 siblings, 1 reply; 4+ messages in thread From: Ivan Vasilyev @ 2012-08-13 12:49 UTC (permalink / raw) To: linux-raid Hi all, I'm using md raid over LVM on some servers (since EVMS project has proven to be dead), but on kernel versions 3.4 and 3.5 there is a problem with raid10. It can be reproduced on current Debian Wheezy (set up from scratch with 7.0beta1 installer) with kernel package v3.5 taken from experimental repository. Array create, initial sync (after "dd ... of=/dev/md/rtest_a") and --assemble give no errors, but then any directIO on md device causes oops (dd without iflag=direct does not). Seems strange, but V4L capture by uvcvideo driver also freezes after first oops (and resumes only after mdadm --stop on problematic array) Recent LVM2 has built-in RAID (implemented with md driver), but unfortunately raid10 is not supported, so it can't replace current setup. Is this a bug in MD driver or in some other part of the kernel? Will it affect other raid setups in future? (like old one with raid0 layered over raid1) ------------------------------------------------------------ Tested on a KVM guest, so hardware seems to be irrelevant. Config: 1.5Gb memory, 2 vCPUs, 5 virtio disks *** Short summary of commands: vgcreate gurion_vg_jnt /dev/vdb6 /dev/vdc6 /dev/vdd6 /dev/vde6 lvcreate -n rtest_a_c1r -l 129 gurion_vg_jnt /dev/vdb6 ... lvcreate -n rtest_a_c4r -l 129 guiron_vg_jnt /dev/vde6 mdadm --create /dev/md/rtest_a --verbose --metadata=1.2 \ --level=raid10 --raid-devices=4 --name=rtest_a \ --chunk=1024 --bitmap=internal \ /dev/gurion_vg_jnt/rtest_a_c1r /dev/gurion_vg_jnt/rtest_a_c2r \ /dev/gurion_vg_jnt/rtest_a_c3r /dev/gurion_vg_jnt/rtest_a_c4r Linux version 3.5-trunk-amd64 (Debian 3.5-1~experimental.1) (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-1) ) #1 SMP Thu Aug 2 17:16:27 UTC 2012 ii linux-image-3.5-trunk-amd64 3.5-1~experimental.1 ii mdadm 3.2.5-1 (oops is captured after "mdadm --assemble /dev/md/rtest_a" and then "lvs") ---------- BUG: unable to handle kernel paging request at ffffffff00000001 IP: [<ffffffff00000001>] 0xffffffff00000000 PGD 160d067 PUD 0 Oops: 0010 [#1] SMP CPU 0 Modules linked in: appletalk ipx p8023 p8022 psnap llc rose netrom ax25 iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop crc32c_intel ghash_clmulni_intel processor aesni_intel aes_x86_64 i2c_piix4 aes_generic cryptd thermal_sys button snd_pcm i2c_core snd_page_alloc snd_timer snd soundcore psmouse pcspkr serio_raw evdev microcode virtio_balloon ext4 crc16 jbd2 mbcache dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear md_mod sr_mod cdrom ata_generic virtio_net floppy virtio_blk ata_piix uhci_hcd ehci_hcd libata scsi_mod virtio_pci virtio_ring virtio usbcore usb_common [last unloaded: scsi_wait_scan] Pid: 11591, comm: lvs Not tainted 3.5-trunk-amd64 #1 Bochs Bochs RIP: 0010:[<ffffffff00000001>] [<ffffffff00000001>] 0xffffffff00000000 RSP: 0018:ffff88005a601a58 EFLAGS: 00010292 RAX: 0000000000100000 RBX: ffff88005cc34c80 RCX: ffff88005d334440 RDX: 0000000000000000 RSI: ffff88005a601a68 RDI: ffff88005b3d1c00 RBP: 0000000000000000 R08: ffffffffa017e99c R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: ffff88005cc34d00 R14: ffffea00010d7d60 R15: 0000000000000000 FS: 00007fd8fcef77a0(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff00000001 CR3: 000000005f836000 CR4: 00000000000407f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lvs (pid: 11591, threadinfo ffff88005a600000, task ffff88005f8ae040) Stack: ffff880054ad0c80 ffffffff81126dec ffff880057065900 0000000000000400 ffffea0000000000 0000000000000000 ffff88005a601b80 ffff8800575ded40 ffff88005a601c20 0000000000000000 0000000000000000 ffffffff811299b5 Call Trace: [<ffffffff81126dec>] ? bio_alloc+0xe/0x1e [<ffffffff811299b5>] ? dio_bio_add_page+0x16/0x4c [<ffffffff81129a51>] ? dio_send_cur_page+0x66/0xa4 [<ffffffff8112a4dc>] ? do_blockdev_direct_IO+0x8cb/0xa81 [<ffffffff8125ed7e>] ? kobj_lookup+0xf6/0x12e [<ffffffff811a13c7>] ? disk_map_sector_rcu+0x5d/0x5d [<ffffffff811a2d9f>] ? disk_clear_events+0x3f/0xe4 [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b [<ffffffff81128000>] ? blkdev_direct_IO+0x4e/0x53 [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b [<ffffffff810bbf07>] ? generic_file_aio_read+0xeb/0x5b5 [<ffffffff811103fd>] ? dput+0x26/0xf4 [<ffffffff81115b87>] ? mntput_no_expire+0x2a/0x134 [<ffffffff8110b3fc>] ? do_last+0x67d/0x717 [<ffffffff810ffe44>] ? do_sync_read+0xb4/0xec [<ffffffff8110051e>] ? vfs_read+0x9f/0xe6 [<ffffffff811005aa>] ? sys_read+0x45/0x6b [<ffffffff81364779>] ? system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [<ffffffff00000001>] 0xffffffff00000000 RSP <ffff88005a601a58> CR2: ffffffff00000001 ---[ end trace b86c49ca25a6cdb2 ]--- ---------- ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: md raid10 Oops on recent kernels 2012-08-13 12:49 md raid10 Oops on recent kernels Ivan Vasilyev @ 2012-08-14 0:50 ` NeilBrown 2012-08-14 18:56 ` Ivan Vasilyev 0 siblings, 1 reply; 4+ messages in thread From: NeilBrown @ 2012-08-14 0:50 UTC (permalink / raw) To: Ivan Vasilyev; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 6345 bytes --] On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com> wrote: > Hi all, > > I'm using md raid over LVM on some servers (since EVMS project has > proven to be dead), but on kernel versions 3.4 and 3.5 there is a > problem with raid10. > It can be reproduced on current Debian Wheezy (set up from scratch with > 7.0beta1 installer) with kernel package v3.5 taken > from experimental repository. > > Array create, initial sync (after "dd ... of=/dev/md/rtest_a") and > --assemble give no errors, > but then any directIO on md device causes oops (dd without > iflag=direct does not). > Seems strange, but V4L capture by uvcvideo driver also freezes after first oops > (and resumes only after mdadm --stop on problematic array) > > Recent LVM2 has built-in RAID (implemented with md driver), but > unfortunately raid10 is not supported, so it can't replace current > setup. > > Is this a bug in MD driver or in some other part of the kernel? Will it affect > other raid setups in future? (like old one with raid0 layered over raid1) > > > ------------------------------------------------------------ > > Tested on a KVM guest, so hardware seems to be irrelevant. > Config: 1.5Gb memory, 2 vCPUs, 5 virtio disks > > > *** Short summary of commands: > vgcreate gurion_vg_jnt /dev/vdb6 /dev/vdc6 /dev/vdd6 /dev/vde6 > lvcreate -n rtest_a_c1r -l 129 gurion_vg_jnt /dev/vdb6 > ... > lvcreate -n rtest_a_c4r -l 129 guiron_vg_jnt /dev/vde6 > mdadm --create /dev/md/rtest_a --verbose --metadata=1.2 \ > --level=raid10 --raid-devices=4 --name=rtest_a \ > --chunk=1024 --bitmap=internal \ > /dev/gurion_vg_jnt/rtest_a_c1r /dev/gurion_vg_jnt/rtest_a_c2r \ > /dev/gurion_vg_jnt/rtest_a_c3r /dev/gurion_vg_jnt/rtest_a_c4r > > > Linux version 3.5-trunk-amd64 (Debian 3.5-1~experimental.1) > (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-1) ) > #1 SMP Thu Aug 2 17:16:27 UTC 2012 > > ii linux-image-3.5-trunk-amd64 3.5-1~experimental.1 > ii mdadm 3.2.5-1 > > (oops is captured after "mdadm --assemble /dev/md/rtest_a" and then "lvs") > ---------- > BUG: unable to handle kernel paging request at ffffffff00000001 > IP: [<ffffffff00000001>] 0xffffffff00000000 > PGD 160d067 PUD 0 > Oops: 0010 [#1] SMP > CPU 0 > Modules linked in: appletalk ipx p8023 p8022 psnap llc rose netrom > ax25 iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 > nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables nfsd nfs > nfs_acl auth_rpcgss fscache lockd sunrpc loop crc32c_intel > ghash_clmulni_intel processor aesni_intel aes_x86_64 i2c_piix4 > aes_generic cryptd thermal_sys button snd_pcm i2c_core snd_page_alloc > snd_timer snd soundcore psmouse pcspkr serio_raw evdev microcode > virtio_balloon ext4 crc16 jbd2 mbcache dm_mod raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor xor async_tx > raid6_pq raid1 raid0 multipath linear md_mod sr_mod cdrom ata_generic > virtio_net floppy virtio_blk ata_piix uhci_hcd ehci_hcd libata > scsi_mod virtio_pci virtio_ring virtio usbcore usb_common [last > unloaded: scsi_wait_scan] > > Pid: 11591, comm: lvs Not tainted 3.5-trunk-amd64 #1 Bochs Bochs > RIP: 0010:[<ffffffff00000001>] [<ffffffff00000001>] 0xffffffff00000000 > RSP: 0018:ffff88005a601a58 EFLAGS: 00010292 > RAX: 0000000000100000 RBX: ffff88005cc34c80 RCX: ffff88005d334440 > RDX: 0000000000000000 RSI: ffff88005a601a68 RDI: ffff88005b3d1c00 > RBP: 0000000000000000 R08: ffffffffa017e99c R09: 0000000000000001 > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 > R13: ffff88005cc34d00 R14: ffffea00010d7d60 R15: 0000000000000000 > FS: 00007fd8fcef77a0(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffff00000001 CR3: 000000005f836000 CR4: 00000000000407f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process lvs (pid: 11591, threadinfo ffff88005a600000, task ffff88005f8ae040) > Stack: > ffff880054ad0c80 ffffffff81126dec ffff880057065900 0000000000000400 > ffffea0000000000 0000000000000000 ffff88005a601b80 ffff8800575ded40 > ffff88005a601c20 0000000000000000 0000000000000000 ffffffff811299b5 > Call Trace: > [<ffffffff81126dec>] ? bio_alloc+0xe/0x1e > [<ffffffff811299b5>] ? dio_bio_add_page+0x16/0x4c > [<ffffffff81129a51>] ? dio_send_cur_page+0x66/0xa4 > [<ffffffff8112a4dc>] ? do_blockdev_direct_IO+0x8cb/0xa81 > [<ffffffff8125ed7e>] ? kobj_lookup+0xf6/0x12e > [<ffffffff811a13c7>] ? disk_map_sector_rcu+0x5d/0x5d > [<ffffffff811a2d9f>] ? disk_clear_events+0x3f/0xe4 > [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b > [<ffffffff81128000>] ? blkdev_direct_IO+0x4e/0x53 > [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b > [<ffffffff810bbf07>] ? generic_file_aio_read+0xeb/0x5b5 > [<ffffffff811103fd>] ? dput+0x26/0xf4 > [<ffffffff81115b87>] ? mntput_no_expire+0x2a/0x134 > [<ffffffff8110b3fc>] ? do_last+0x67d/0x717 > [<ffffffff810ffe44>] ? do_sync_read+0xb4/0xec > [<ffffffff8110051e>] ? vfs_read+0x9f/0xe6 > [<ffffffff811005aa>] ? sys_read+0x45/0x6b > [<ffffffff81364779>] ? system_call_fastpath+0x16/0x1b > Code: Bad RIP value. > RIP [<ffffffff00000001>] 0xffffffff00000000 > RSP <ffff88005a601a58> > CR2: ffffffff00000001 > ---[ end trace b86c49ca25a6cdb2 ]--- > ---------- It looks like the ->merge_bvec_fn is bad - the code is jumping to 0xffffffff00000001, which strongly suggests some function pointer is bad, and merge_bvec_fn is the only one in that area of code. However I cannot see how it could possibly get a bad value like that. There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when you say the problem appeared. However I cannot see how direct IO would be affected any differently to normal IO. If I were to try to debug this I'd build a kernel and put a printk in __bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...). I don't know if you are up for that sort of thing... NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: md raid10 Oops on recent kernels 2012-08-14 0:50 ` NeilBrown @ 2012-08-14 18:56 ` Ivan Vasilyev 2012-08-15 4:44 ` NeilBrown 0 siblings, 1 reply; 4+ messages in thread From: Ivan Vasilyev @ 2012-08-14 18:56 UTC (permalink / raw) To: NeilBrown, linux-raid 2012/8/14 NeilBrown <neilb@suse.de>: > On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com> > wrote: > >> ---[ end trace b86c49ca25a6cdb2 ]--- >> ---------- > > It looks like the ->merge_bvec_fn is bad - the code is jumping to > 0xffffffff00000001, which strongly suggests some function pointer is bad, and > merge_bvec_fn is the only one in that area of code. > However I cannot see how it could possibly get a bad value like that. > > There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when > you say the problem appeared. However I cannot see how direct IO would be > affected any differently to normal IO. > > If I were to try to debug this I'd build a kernel and put a printk in > __bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a > message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...). Such printk is triggered right befire oops: DEBUG q-> merge_bvec_fn=0xffffffffa011a1c3 queue_flags=0x40 queuedata=0xffff880058bf1520 backing_dev_info.congested_fn=0xffffffffa011d39a BUG: unable to handle kernel paging request at ffffffff00000001 although address is different (so this means the bug does not occur exactly on merge_bvec_fn() call?) Checked again - this problem affects only directIO: dd if=/dev/md/rtest_a count=10000 of=/dev/null => ok dd if=/dev/md/rtest_a iflag=direct count=10000 of=/dev/null => oops (first since boot) Linux version 3.6.0-rc1.git6.1.fc18 (via@liber) (gcc version 4.7.1 (Debian 4.7.1-2) ) #1 SMP Tue Aug 14 21:15:58 SAMT 2012 (in fact no patches from fedora included, just git snapshot) code: ------------------------------ --- kernel.orig/fs/bio.c 2012-08-14 18:01:51.000000000 +0400 +++ kernel/fs/bio.c 2012-08-14 19:24:37.716746106 +0400 @@ -519,6 +519,10 @@ } EXPORT_SYMBOL(bio_get_nr_vecs); +#define DBG_MBF(q) if (((unsigned long int)(q->merge_bvec_fn)) & 1L) { \ + printk("DEBUG q-> merge_bvec_fn=0x%pK queue_flags=0x%lx queuedata=0x%pK backing_dev_info.congested_fn=0x%pK \n", \ + q->merge_bvec_fn, q->queue_flags, q->queuedata, q->backing_dev_info.congested_fn); } + static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset, unsigned short max_sectors) @@ -560,6 +564,7 @@ .bi_rw = bio->bi_rw, }; + DBG_MBF(q) if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) { prev->bv_len -= len; return 0; @@ -613,6 +618,8 @@ * merge_bvec_fn() returns number of bytes it can accept * at this offset */ + + DBG_MBF(q) if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) { bvec->bv_page = NULL; bvec->bv_len = 0; ------------------------------ oops: ------------------------------ DEBUG q-> merge_bvec_fn=0xffffffffa011a1c3 queue_flags=0x40 queuedata=0xffff880058bf1520 backing_dev_info.congested_fn=0xffffffffa011d39a BUG: unable to handle kernel paging request at ffffffff00000001 IP: [<ffffffff00000001>] 0xffffffff00000000 PGD 160e067 PUD 0 Oops: 0010 [#1] SMP DEBUG_PAGEALLOC Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc ipv6 crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic ablk_helper cryptd microcode psmouse pcspkr serio_raw evdev cirrus processor ttm thermal_sys hwmon virtio_balloon drm_kms_helper drm button syscopyarea sysfillrect intel_agp sysimgblt intel_gtt agpgart i2c_piix4 i2c_core ext4 crc16 jbd2 mbcache dm_mod raid10 sr_mod cdrom ata_generic pata_acpi virtio_blk virtio_net floppy ata_piix uhci_hcd libata ehci_hcd virtio_pci scsi_mod virtio_ring virtio CPU 0 Pid: 2242, comm: dd Not tainted 3.6.0-rc1.git6.1.fc18 #1 Bochs Bochs RIP: 0010:[<ffffffff00000001>] [<ffffffff00000001>] 0xffffffff00000000 RSP: 0018:ffff88005c2fd9b8 EFLAGS: 00010292 RAX: 0000000000100000 RBX: ffff880058ec8240 RCX: ffff88005b530578 RDX: ffffc90001857040 RSI: ffff88005c2fd9c8 RDI: ffff880058aaa418 RBP: 0000000000000000 R08: ffffc90001857040 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000100000 R12: 0000000000000000 R13: ffff880000000000 R14: 0000000000000200 R15: ffffea000155ba80 FS: 00007fc5b67b7700(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff00000001 CR3: 00000000583f8000 CR4: 00000000000407f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dd (pid: 2242, threadinfo ffff88005c2fc000, task ffff880058f5c000) Stack: ffff88005c2fda48 ffff88005c2fdb98 ffff8800579e8480 0000000000000400 ffff880000000000 0000000000000000 ffff88005c2fd9f8 ffff88005c2fdb98 ffff880058d00000 ffff88005c2fdb30 0000000000000000 0000000000000000 Call Trace: [<ffffffff8113e627>] ? bio_add_page+0x49/0x50 [<ffffffff81141854>] ? dio_bio_add_page+0x1b/0x53 [<ffffffff811418e9>] ? dio_send_cur_page+0x5d/0xb8 [<ffffffff8114239b>] ? do_blockdev_direct_IO+0x8c7/0xa7a [<ffffffff81140394>] ? blkdev_max_block+0x30/0x30 [<ffffffff8114259e>] ? __blockdev_direct_IO+0x50/0x52 [<ffffffff81140394>] ? blkdev_max_block+0x30/0x30 [<ffffffff8113f705>] ? blkdev_direct_IO+0x52/0x54 [<ffffffff81140394>] ? blkdev_max_block+0x30/0x30 [<ffffffff810cfe17>] ? generic_file_aio_read+0xec/0x5ef [<ffffffff810f5a8f>] ? page_add_new_anon_rmap+0x92/0xa5 [<ffffffff810ead0b>] ? set_pte_at+0x9/0xd [<ffffffff810ede85>] ? handle_pte_fault+0x6f0/0x741 [<ffffffff8111576e>] ? do_sync_read+0x6e/0xab [<ffffffff81115f47>] ? vfs_read+0x98/0xfa [<ffffffff81115fe7>] ? sys_read+0x3e/0x6b [<ffffffff813ab9bd>] ? system_call_fastpath+0x1a/0x1f Code: Bad RIP value. RIP [<ffffffff00000001>] 0xffffffff00000000 RSP <ffff88005c2fd9b8> CR2: ffffffff00000001 ---[ end trace 4261c96a920a2a62 ]--- ------------------------------ ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: md raid10 Oops on recent kernels 2012-08-14 18:56 ` Ivan Vasilyev @ 2012-08-15 4:44 ` NeilBrown 0 siblings, 0 replies; 4+ messages in thread From: NeilBrown @ 2012-08-15 4:44 UTC (permalink / raw) To: Ivan Vasilyev; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 5478 bytes --] On Tue, 14 Aug 2012 22:56:59 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com> wrote: > 2012/8/14 NeilBrown <neilb@suse.de>: > > On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com> > > wrote: > > > >> ---[ end trace b86c49ca25a6cdb2 ]--- > >> ---------- > > > > It looks like the ->merge_bvec_fn is bad - the code is jumping to > > 0xffffffff00000001, which strongly suggests some function pointer is bad, and > > merge_bvec_fn is the only one in that area of code. > > However I cannot see how it could possibly get a bad value like that. > > > > There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when > > you say the problem appeared. However I cannot see how direct IO would be > > affected any differently to normal IO. > > > > If I were to try to debug this I'd build a kernel and put a printk in > > __bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a > > message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...). > > Such printk is triggered right befire oops: > > DEBUG q-> merge_bvec_fn=0xffffffffa011a1c3 queue_flags=0x40 > queuedata=0xffff880058bf1520 > backing_dev_info.congested_fn=0xffffffffa011d39a > BUG: unable to handle kernel paging request at ffffffff00000001 > > although address is different (so this means the bug does not occur > exactly on merge_bvec_fn() call?) > > Checked again - this problem affects only directIO: > > dd if=/dev/md/rtest_a count=10000 of=/dev/null > => ok > dd if=/dev/md/rtest_a iflag=direct count=10000 of=/dev/null > => oops (first since boot) > Hmmm.. not what I expected. I found it was indeed easy to reproduce and after being sure it was impossible for half the afternoon I've found the problem. The following patch fixes it. I'm not sure yet if that it was what I'll submit upstream. The problem is that "struct r10bio" isn't by itself big enough. It is usually allocated with extra memory at the end. So when declared on the stack, the same is needed. Thanks, NeilBrown diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 93fe561..12565c3 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -659,7 +659,10 @@ static int raid10_mergeable_bvec(struct request_queue *q, max = biovec->bv_len; if (mddev->merge_check_needed) { - struct r10bio r10_bio; + struct { + struct r10bio r10_bio; + struct r10dev devs[conf->copies]; + } x; int s; if (conf->reshape_progress != MaxSector) { /* Cannot give any guidance during reshape */ @@ -667,18 +670,18 @@ static int raid10_mergeable_bvec(struct request_queue *q, return biovec->bv_len; return 0; } - r10_bio.sector = sector; - raid10_find_phys(conf, &r10_bio); + x.r10_bio.sector = sector; + raid10_find_phys(conf, &x.r10_bio); rcu_read_lock(); for (s = 0; s < conf->copies; s++) { - int disk = r10_bio.devs[s].devnum; + int disk = x.r10_bio.devs[s].devnum; struct md_rdev *rdev = rcu_dereference( conf->mirrors[disk].rdev); if (rdev && !test_bit(Faulty, &rdev->flags)) { struct request_queue *q = bdev_get_queue(rdev->bdev); if (q->merge_bvec_fn) { - bvm->bi_sector = r10_bio.devs[s].addr + bvm->bi_sector = x.r10_bio.devs[s].addr + rdev->data_offset; bvm->bi_bdev = rdev->bdev; max = min(max, q->merge_bvec_fn( @@ -690,7 +693,7 @@ static int raid10_mergeable_bvec(struct request_queue *q, struct request_queue *q = bdev_get_queue(rdev->bdev); if (q->merge_bvec_fn) { - bvm->bi_sector = r10_bio.devs[s].addr + bvm->bi_sector = x.r10_bio.devs[s].addr + rdev->data_offset; bvm->bi_bdev = rdev->bdev; max = min(max, q->merge_bvec_fn( @@ -4434,14 +4437,17 @@ static int handle_reshape_read_error(struct mddev *mddev, { /* Use sync reads to get the blocks from somewhere else */ int sectors = r10_bio->sectors; - struct r10bio r10b; struct r10conf *conf = mddev->private; + struct { + struct r10bio r10b; + struct r10dev devs[conf->copies]; + } x; int slot = 0; int idx = 0; struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec; - r10b.sector = r10_bio->sector; - __raid10_find_phys(&conf->prev, &r10b); + x.r10b.sector = r10_bio->sector; + __raid10_find_phys(&conf->prev, &x.r10b); while (sectors) { int s = sectors; @@ -4452,7 +4458,7 @@ static int handle_reshape_read_error(struct mddev *mddev, s = PAGE_SIZE >> 9; while (!success) { - int d = r10b.devs[slot].devnum; + int d = x.r10b.devs[slot].devnum; struct md_rdev *rdev = conf->mirrors[d].rdev; sector_t addr; if (rdev == NULL || @@ -4460,7 +4466,7 @@ static int handle_reshape_read_error(struct mddev *mddev, !test_bit(In_sync, &rdev->flags)) goto failed; - addr = r10b.devs[slot].addr + idx * PAGE_SIZE; + addr = x.r10b.devs[slot].addr + idx * PAGE_SIZE; success = sync_page_io(rdev, addr, s << 9, diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h index 007c2c6..1054cf6 100644 --- a/drivers/md/raid10.h +++ b/drivers/md/raid10.h @@ -110,7 +110,7 @@ struct r10bio { * We choose the number when they are allocated. * We sometimes need an extra bio to write to the replacement. */ - struct { + struct r10dev { struct bio *bio; union { struct bio *repl_bio; /* used for resync and [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-08-15 4:44 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-13 12:49 md raid10 Oops on recent kernels Ivan Vasilyev 2012-08-14 0:50 ` NeilBrown 2012-08-14 18:56 ` Ivan Vasilyev 2012-08-15 4:44 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).