md raid10 Oops on recent kernels

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md raid10 Oops on recent kernels
@ 2012-08-13 12:49 Ivan Vasilyev
  2012-08-14  0:50 ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Vasilyev @ 2012-08-13 12:49 UTC (permalink / raw)
  To: linux-raid

Hi all,

I'm using md raid over LVM on some servers (since EVMS project has
proven to be dead), but on kernel versions 3.4 and 3.5 there is a
problem with raid10.
It can be reproduced on current Debian Wheezy (set up from scratch with
7.0beta1 installer) with kernel package v3.5 taken
from experimental repository.

Array create, initial sync (after "dd ... of=/dev/md/rtest_a") and
--assemble give no errors,
but then any directIO on md device causes oops (dd without
iflag=direct does not).
Seems strange, but V4L capture by uvcvideo driver also freezes after first oops
(and resumes only after mdadm --stop on problematic array)

Recent LVM2 has built-in RAID (implemented with md driver), but
unfortunately raid10 is not supported, so it can't replace current
setup.

Is this a bug in MD driver or in some other part of the kernel? Will it affect
other raid setups in future? (like old one with raid0 layered over raid1)


------------------------------------------------------------

Tested on a KVM guest, so hardware seems to be irrelevant.
Config: 1.5Gb memory, 2 vCPUs, 5 virtio disks


*** Short summary of commands:
vgcreate gurion_vg_jnt /dev/vdb6 /dev/vdc6 /dev/vdd6 /dev/vde6
lvcreate -n rtest_a_c1r -l 129 gurion_vg_jnt /dev/vdb6
...
lvcreate -n rtest_a_c4r -l 129 guiron_vg_jnt /dev/vde6
mdadm --create /dev/md/rtest_a --verbose --metadata=1.2 \
  --level=raid10 --raid-devices=4 --name=rtest_a \
  --chunk=1024 --bitmap=internal \
  /dev/gurion_vg_jnt/rtest_a_c1r /dev/gurion_vg_jnt/rtest_a_c2r \
  /dev/gurion_vg_jnt/rtest_a_c3r /dev/gurion_vg_jnt/rtest_a_c4r


Linux version 3.5-trunk-amd64 (Debian 3.5-1~experimental.1)
(debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-1) )
#1 SMP Thu Aug 2 17:16:27 UTC 2012

ii  linux-image-3.5-trunk-amd64                  3.5-1~experimental.1
ii  mdadm                                        3.2.5-1

(oops is captured after "mdadm --assemble /dev/md/rtest_a" and then "lvs")
----------
 BUG: unable to handle kernel paging request at ffffffff00000001
 IP: [<ffffffff00000001>] 0xffffffff00000000
 PGD 160d067 PUD 0
 Oops: 0010 [#1] SMP
 CPU 0
 Modules linked in: appletalk ipx p8023 p8022 psnap llc rose netrom
ax25 iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables nfsd nfs
nfs_acl auth_rpcgss fscache lockd sunrpc loop crc32c_intel
ghash_clmulni_intel processor aesni_intel aes_x86_64 i2c_piix4
aes_generic cryptd thermal_sys button snd_pcm i2c_core snd_page_alloc
snd_timer snd soundcore psmouse pcspkr serio_raw evdev microcode
virtio_balloon ext4 crc16 jbd2 mbcache dm_mod raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor xor async_tx
raid6_pq raid1 raid0 multipath linear md_mod sr_mod cdrom ata_generic
virtio_net floppy virtio_blk ata_piix uhci_hcd ehci_hcd libata
scsi_mod virtio_pci virtio_ring virtio usbcore usb_common [last
unloaded: scsi_wait_scan]

 Pid: 11591, comm: lvs Not tainted 3.5-trunk-amd64 #1 Bochs Bochs
 RIP: 0010:[<ffffffff00000001>]  [<ffffffff00000001>] 0xffffffff00000000
 RSP: 0018:ffff88005a601a58  EFLAGS: 00010292
 RAX: 0000000000100000 RBX: ffff88005cc34c80 RCX: ffff88005d334440
 RDX: 0000000000000000 RSI: ffff88005a601a68 RDI: ffff88005b3d1c00
 RBP: 0000000000000000 R08: ffffffffa017e99c R09: 0000000000000001
 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff88005cc34d00 R14: ffffea00010d7d60 R15: 0000000000000000
 FS:  00007fd8fcef77a0(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffff00000001 CR3: 000000005f836000 CR4: 00000000000407f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process lvs (pid: 11591, threadinfo ffff88005a600000, task ffff88005f8ae040)
 Stack:
  ffff880054ad0c80 ffffffff81126dec ffff880057065900 0000000000000400
  ffffea0000000000 0000000000000000 ffff88005a601b80 ffff8800575ded40
  ffff88005a601c20 0000000000000000 0000000000000000 ffffffff811299b5
 Call Trace:
  [<ffffffff81126dec>] ? bio_alloc+0xe/0x1e
  [<ffffffff811299b5>] ? dio_bio_add_page+0x16/0x4c
  [<ffffffff81129a51>] ? dio_send_cur_page+0x66/0xa4
  [<ffffffff8112a4dc>] ? do_blockdev_direct_IO+0x8cb/0xa81
  [<ffffffff8125ed7e>] ? kobj_lookup+0xf6/0x12e
  [<ffffffff811a13c7>] ? disk_map_sector_rcu+0x5d/0x5d
  [<ffffffff811a2d9f>] ? disk_clear_events+0x3f/0xe4
  [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b
  [<ffffffff81128000>] ? blkdev_direct_IO+0x4e/0x53
  [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b
  [<ffffffff810bbf07>] ? generic_file_aio_read+0xeb/0x5b5
  [<ffffffff811103fd>] ? dput+0x26/0xf4
  [<ffffffff81115b87>] ? mntput_no_expire+0x2a/0x134
  [<ffffffff8110b3fc>] ? do_last+0x67d/0x717
  [<ffffffff810ffe44>] ? do_sync_read+0xb4/0xec
  [<ffffffff8110051e>] ? vfs_read+0x9f/0xe6
  [<ffffffff811005aa>] ? sys_read+0x45/0x6b
  [<ffffffff81364779>] ? system_call_fastpath+0x16/0x1b
 Code:  Bad RIP value.
 RIP  [<ffffffff00000001>] 0xffffffff00000000
  RSP <ffff88005a601a58>
 CR2: ffffffff00000001
 ---[ end trace b86c49ca25a6cdb2 ]---
----------

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: md raid10 Oops on recent kernels
  2012-08-13 12:49 md raid10 Oops on recent kernels Ivan Vasilyev
@ 2012-08-14  0:50 ` NeilBrown
  2012-08-14 18:56   ` Ivan Vasilyev
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2012-08-14  0:50 UTC (permalink / raw)
  To: Ivan Vasilyev; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6345 bytes --]

On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com>
wrote:

> Hi all,
> 
> I'm using md raid over LVM on some servers (since EVMS project has
> proven to be dead), but on kernel versions 3.4 and 3.5 there is a
> problem with raid10.
> It can be reproduced on current Debian Wheezy (set up from scratch with
> 7.0beta1 installer) with kernel package v3.5 taken
> from experimental repository.
> 
> Array create, initial sync (after "dd ... of=/dev/md/rtest_a") and
> --assemble give no errors,
> but then any directIO on md device causes oops (dd without
> iflag=direct does not).
> Seems strange, but V4L capture by uvcvideo driver also freezes after first oops
> (and resumes only after mdadm --stop on problematic array)
> 
> Recent LVM2 has built-in RAID (implemented with md driver), but
> unfortunately raid10 is not supported, so it can't replace current
> setup.
> 
> Is this a bug in MD driver or in some other part of the kernel? Will it affect
> other raid setups in future? (like old one with raid0 layered over raid1)
> 
> 
> ------------------------------------------------------------
> 
> Tested on a KVM guest, so hardware seems to be irrelevant.
> Config: 1.5Gb memory, 2 vCPUs, 5 virtio disks
> 
> 
> *** Short summary of commands:
> vgcreate gurion_vg_jnt /dev/vdb6 /dev/vdc6 /dev/vdd6 /dev/vde6
> lvcreate -n rtest_a_c1r -l 129 gurion_vg_jnt /dev/vdb6
> ...
> lvcreate -n rtest_a_c4r -l 129 guiron_vg_jnt /dev/vde6
> mdadm --create /dev/md/rtest_a --verbose --metadata=1.2 \
>   --level=raid10 --raid-devices=4 --name=rtest_a \
>   --chunk=1024 --bitmap=internal \
>   /dev/gurion_vg_jnt/rtest_a_c1r /dev/gurion_vg_jnt/rtest_a_c2r \
>   /dev/gurion_vg_jnt/rtest_a_c3r /dev/gurion_vg_jnt/rtest_a_c4r
> 
> 
> Linux version 3.5-trunk-amd64 (Debian 3.5-1~experimental.1)
> (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-1) )
> #1 SMP Thu Aug 2 17:16:27 UTC 2012
> 
> ii  linux-image-3.5-trunk-amd64                  3.5-1~experimental.1
> ii  mdadm                                        3.2.5-1
> 
> (oops is captured after "mdadm --assemble /dev/md/rtest_a" and then "lvs")
> ----------
>  BUG: unable to handle kernel paging request at ffffffff00000001
>  IP: [<ffffffff00000001>] 0xffffffff00000000
>  PGD 160d067 PUD 0
>  Oops: 0010 [#1] SMP
>  CPU 0
>  Modules linked in: appletalk ipx p8023 p8022 psnap llc rose netrom
> ax25 iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables nfsd nfs
> nfs_acl auth_rpcgss fscache lockd sunrpc loop crc32c_intel
> ghash_clmulni_intel processor aesni_intel aes_x86_64 i2c_piix4
> aes_generic cryptd thermal_sys button snd_pcm i2c_core snd_page_alloc
> snd_timer snd soundcore psmouse pcspkr serio_raw evdev microcode
> virtio_balloon ext4 crc16 jbd2 mbcache dm_mod raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor xor async_tx
> raid6_pq raid1 raid0 multipath linear md_mod sr_mod cdrom ata_generic
> virtio_net floppy virtio_blk ata_piix uhci_hcd ehci_hcd libata
> scsi_mod virtio_pci virtio_ring virtio usbcore usb_common [last
> unloaded: scsi_wait_scan]
> 
>  Pid: 11591, comm: lvs Not tainted 3.5-trunk-amd64 #1 Bochs Bochs
>  RIP: 0010:[<ffffffff00000001>]  [<ffffffff00000001>] 0xffffffff00000000
>  RSP: 0018:ffff88005a601a58  EFLAGS: 00010292
>  RAX: 0000000000100000 RBX: ffff88005cc34c80 RCX: ffff88005d334440
>  RDX: 0000000000000000 RSI: ffff88005a601a68 RDI: ffff88005b3d1c00
>  RBP: 0000000000000000 R08: ffffffffa017e99c R09: 0000000000000001
>  R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
>  R13: ffff88005cc34d00 R14: ffffea00010d7d60 R15: 0000000000000000
>  FS:  00007fd8fcef77a0(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: ffffffff00000001 CR3: 000000005f836000 CR4: 00000000000407f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  Process lvs (pid: 11591, threadinfo ffff88005a600000, task ffff88005f8ae040)
>  Stack:
>   ffff880054ad0c80 ffffffff81126dec ffff880057065900 0000000000000400
>   ffffea0000000000 0000000000000000 ffff88005a601b80 ffff8800575ded40
>   ffff88005a601c20 0000000000000000 0000000000000000 ffffffff811299b5
>  Call Trace:
>   [<ffffffff81126dec>] ? bio_alloc+0xe/0x1e
>   [<ffffffff811299b5>] ? dio_bio_add_page+0x16/0x4c
>   [<ffffffff81129a51>] ? dio_send_cur_page+0x66/0xa4
>   [<ffffffff8112a4dc>] ? do_blockdev_direct_IO+0x8cb/0xa81
>   [<ffffffff8125ed7e>] ? kobj_lookup+0xf6/0x12e
>   [<ffffffff811a13c7>] ? disk_map_sector_rcu+0x5d/0x5d
>   [<ffffffff811a2d9f>] ? disk_clear_events+0x3f/0xe4
>   [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b
>   [<ffffffff81128000>] ? blkdev_direct_IO+0x4e/0x53
>   [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b
>   [<ffffffff810bbf07>] ? generic_file_aio_read+0xeb/0x5b5
>   [<ffffffff811103fd>] ? dput+0x26/0xf4
>   [<ffffffff81115b87>] ? mntput_no_expire+0x2a/0x134
>   [<ffffffff8110b3fc>] ? do_last+0x67d/0x717
>   [<ffffffff810ffe44>] ? do_sync_read+0xb4/0xec
>   [<ffffffff8110051e>] ? vfs_read+0x9f/0xe6
>   [<ffffffff811005aa>] ? sys_read+0x45/0x6b
>   [<ffffffff81364779>] ? system_call_fastpath+0x16/0x1b
>  Code:  Bad RIP value.
>  RIP  [<ffffffff00000001>] 0xffffffff00000000
>   RSP <ffff88005a601a58>
>  CR2: ffffffff00000001
>  ---[ end trace b86c49ca25a6cdb2 ]---
> ----------

It looks like the ->merge_bvec_fn is bad - the code is jumping to
0xffffffff00000001, which strongly suggests some function pointer is bad, and
merge_bvec_fn is the only one in that area of code.
However I cannot see how it could possibly get a bad value like that.

There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when
you say the problem appeared.  However I cannot see how direct IO would be
affected any differently to normal IO.

If I were to try to debug this I'd build a kernel and put a printk in
__bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a
message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...).
I don't know if you are up for that sort of thing...

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: md raid10 Oops on recent kernels
  2012-08-14  0:50 ` NeilBrown
@ 2012-08-14 18:56   ` Ivan Vasilyev
  2012-08-15  4:44     ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Vasilyev @ 2012-08-14 18:56 UTC (permalink / raw)
  To: NeilBrown, linux-raid

2012/8/14 NeilBrown <neilb@suse.de>:
> On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com>
> wrote:
>
>>  ---[ end trace b86c49ca25a6cdb2 ]---
>> ----------
>
> It looks like the ->merge_bvec_fn is bad - the code is jumping to
> 0xffffffff00000001, which strongly suggests some function pointer is bad, and
> merge_bvec_fn is the only one in that area of code.
> However I cannot see how it could possibly get a bad value like that.
>
> There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when
> you say the problem appeared.  However I cannot see how direct IO would be
> affected any differently to normal IO.
>
> If I were to try to debug this I'd build a kernel and put a printk in
> __bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a
> message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...).

Such printk is triggered right befire oops:

DEBUG q-> merge_bvec_fn=0xffffffffa011a1c3 queue_flags=0x40
queuedata=0xffff880058bf1520
backing_dev_info.congested_fn=0xffffffffa011d39a
BUG: unable to handle kernel paging request at ffffffff00000001

although address is different (so this means the bug does not occur
exactly on merge_bvec_fn() call?)

Checked again - this problem affects only directIO:

dd if=/dev/md/rtest_a count=10000 of=/dev/null
 => ok
dd if=/dev/md/rtest_a iflag=direct count=10000 of=/dev/null
 => oops (first since boot)


Linux version 3.6.0-rc1.git6.1.fc18 (via@liber) (gcc version 4.7.1
(Debian 4.7.1-2) ) #1 SMP Tue Aug 14 21:15:58 SAMT 2012
(in fact no patches from fedora included, just git snapshot)

code:
------------------------------
--- kernel.orig/fs/bio.c        2012-08-14 18:01:51.000000000 +0400
+++ kernel/fs/bio.c     2012-08-14 19:24:37.716746106 +0400
@@ -519,6 +519,10 @@
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);

+#define DBG_MBF(q)  if (((unsigned long int)(q->merge_bvec_fn)) & 1L) { \
+ printk("DEBUG q-> merge_bvec_fn=0x%pK queue_flags=0x%lx
queuedata=0x%pK  backing_dev_info.congested_fn=0x%pK \n", \
+  q->merge_bvec_fn, q->queue_flags, q->queuedata,
q->backing_dev_info.congested_fn); }
+
 static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
                          *page, unsigned int len, unsigned int offset,
                          unsigned short max_sectors)
@@ -560,6 +564,7 @@
                                        .bi_rw = bio->bi_rw,
                                };

+                                DBG_MBF(q)
                                if (q->merge_bvec_fn(q, &bvm, prev) <
prev->bv_len) {
                                        prev->bv_len -= len;
                                        return 0;
@@ -613,6 +618,8 @@
                 * merge_bvec_fn() returns number of bytes it can accept
                 * at this offset
                 */
+
+               DBG_MBF(q)
                if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
                        bvec->bv_page = NULL;
                        bvec->bv_len = 0;
------------------------------


oops:
------------------------------
DEBUG q-> merge_bvec_fn=0xffffffffa011a1c3 queue_flags=0x40
queuedata=0xffff880058bf1520
backing_dev_info.congested_fn=0xffffffffa011d39a
BUG: unable to handle kernel paging request at ffffffff00000001
IP: [<ffffffff00000001>] 0xffffffff00000000
PGD 160e067 PUD 0
Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc
ipv6 crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64
aes_generic ablk_helper cryptd microcode psmouse pcspkr serio_raw
evdev cirrus processor ttm thermal_sys hwmon virtio_balloon
drm_kms_helper drm button syscopyarea sysfillrect intel_agp sysimgblt
intel_gtt agpgart i2c_piix4 i2c_core ext4 crc16 jbd2 mbcache dm_mod
raid10 sr_mod cdrom ata_generic pata_acpi virtio_blk virtio_net floppy
ata_piix uhci_hcd libata ehci_hcd virtio_pci scsi_mod virtio_ring
virtio
CPU 0
Pid: 2242, comm: dd Not tainted 3.6.0-rc1.git6.1.fc18 #1 Bochs Bochs
RIP: 0010:[<ffffffff00000001>]  [<ffffffff00000001>] 0xffffffff00000000
RSP: 0018:ffff88005c2fd9b8  EFLAGS: 00010292
RAX: 0000000000100000 RBX: ffff880058ec8240 RCX: ffff88005b530578
RDX: ffffc90001857040 RSI: ffff88005c2fd9c8 RDI: ffff880058aaa418
RBP: 0000000000000000 R08: ffffc90001857040 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000100000 R12: 0000000000000000
R13: ffff880000000000 R14: 0000000000000200 R15: ffffea000155ba80
FS:  00007fc5b67b7700(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff00000001 CR3: 00000000583f8000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 2242, threadinfo ffff88005c2fc000, task ffff880058f5c000)
Stack:
 ffff88005c2fda48 ffff88005c2fdb98 ffff8800579e8480 0000000000000400
 ffff880000000000 0000000000000000 ffff88005c2fd9f8 ffff88005c2fdb98
 ffff880058d00000 ffff88005c2fdb30 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8113e627>] ? bio_add_page+0x49/0x50
 [<ffffffff81141854>] ? dio_bio_add_page+0x1b/0x53
 [<ffffffff811418e9>] ? dio_send_cur_page+0x5d/0xb8
 [<ffffffff8114239b>] ? do_blockdev_direct_IO+0x8c7/0xa7a
 [<ffffffff81140394>] ? blkdev_max_block+0x30/0x30
 [<ffffffff8114259e>] ? __blockdev_direct_IO+0x50/0x52
 [<ffffffff81140394>] ? blkdev_max_block+0x30/0x30
 [<ffffffff8113f705>] ? blkdev_direct_IO+0x52/0x54
 [<ffffffff81140394>] ? blkdev_max_block+0x30/0x30
 [<ffffffff810cfe17>] ? generic_file_aio_read+0xec/0x5ef
 [<ffffffff810f5a8f>] ? page_add_new_anon_rmap+0x92/0xa5
 [<ffffffff810ead0b>] ? set_pte_at+0x9/0xd
 [<ffffffff810ede85>] ? handle_pte_fault+0x6f0/0x741
 [<ffffffff8111576e>] ? do_sync_read+0x6e/0xab
 [<ffffffff81115f47>] ? vfs_read+0x98/0xfa
 [<ffffffff81115fe7>] ? sys_read+0x3e/0x6b
 [<ffffffff813ab9bd>] ? system_call_fastpath+0x1a/0x1f
Code:  Bad RIP value.
RIP  [<ffffffff00000001>] 0xffffffff00000000
 RSP <ffff88005c2fd9b8>
CR2: ffffffff00000001
---[ end trace 4261c96a920a2a62 ]---
------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: md raid10 Oops on recent kernels
  2012-08-14 18:56   ` Ivan Vasilyev
@ 2012-08-15  4:44     ` NeilBrown
  0 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2012-08-15  4:44 UTC (permalink / raw)
  To: Ivan Vasilyev; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5478 bytes --]

On Tue, 14 Aug 2012 22:56:59 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com>
wrote:

> 2012/8/14 NeilBrown <neilb@suse.de>:
> > On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@gmail.com>
> > wrote:
> >
> >>  ---[ end trace b86c49ca25a6cdb2 ]---
> >> ----------
> >
> > It looks like the ->merge_bvec_fn is bad - the code is jumping to
> > 0xffffffff00000001, which strongly suggests some function pointer is bad, and
> > merge_bvec_fn is the only one in that area of code.
> > However I cannot see how it could possibly get a bad value like that.
> >
> > There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when
> > you say the problem appeared.  However I cannot see how direct IO would be
> > affected any differently to normal IO.
> >
> > If I were to try to debug this I'd build a kernel and put a printk in
> > __bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a
> > message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...).
> 
> Such printk is triggered right befire oops:
> 
> DEBUG q-> merge_bvec_fn=0xffffffffa011a1c3 queue_flags=0x40
> queuedata=0xffff880058bf1520
> backing_dev_info.congested_fn=0xffffffffa011d39a
> BUG: unable to handle kernel paging request at ffffffff00000001
> 
> although address is different (so this means the bug does not occur
> exactly on merge_bvec_fn() call?)
> 
> Checked again - this problem affects only directIO:
> 
> dd if=/dev/md/rtest_a count=10000 of=/dev/null
>  => ok
> dd if=/dev/md/rtest_a iflag=direct count=10000 of=/dev/null
>  => oops (first since boot)
> 

Hmmm.. not what I expected.  
I found it was indeed easy to reproduce and after being sure it was
impossible for half the afternoon I've found the problem.
The following patch fixes it.  I'm not sure yet if that it was what I'll
submit upstream.
The problem is that "struct r10bio" isn't by itself big enough.  It is
usually allocated with extra memory at the end.  So when declared on the
stack, the same is needed.

Thanks,
NeilBrown

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 93fe561..12565c3 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -659,7 +659,10 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 		max = biovec->bv_len;
 
 	if (mddev->merge_check_needed) {
-		struct r10bio r10_bio;
+		struct {
+			struct r10bio r10_bio;
+			struct r10dev devs[conf->copies];
+		} x;
 		int s;
 		if (conf->reshape_progress != MaxSector) {
 			/* Cannot give any guidance during reshape */
@@ -667,18 +670,18 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 				return biovec->bv_len;
 			return 0;
 		}
-		r10_bio.sector = sector;
-		raid10_find_phys(conf, &r10_bio);
+		x.r10_bio.sector = sector;
+		raid10_find_phys(conf, &x.r10_bio);
 		rcu_read_lock();
 		for (s = 0; s < conf->copies; s++) {
-			int disk = r10_bio.devs[s].devnum;
+			int disk = x.r10_bio.devs[s].devnum;
 			struct md_rdev *rdev = rcu_dereference(
 				conf->mirrors[disk].rdev);
 			if (rdev && !test_bit(Faulty, &rdev->flags)) {
 				struct request_queue *q =
 					bdev_get_queue(rdev->bdev);
 				if (q->merge_bvec_fn) {
-					bvm->bi_sector = r10_bio.devs[s].addr
+					bvm->bi_sector = x.r10_bio.devs[s].addr
 						+ rdev->data_offset;
 					bvm->bi_bdev = rdev->bdev;
 					max = min(max, q->merge_bvec_fn(
@@ -690,7 +693,7 @@ static int raid10_mergeable_bvec(struct request_queue *q,
 				struct request_queue *q =
 					bdev_get_queue(rdev->bdev);
 				if (q->merge_bvec_fn) {
-					bvm->bi_sector = r10_bio.devs[s].addr
+					bvm->bi_sector = x.r10_bio.devs[s].addr
 						+ rdev->data_offset;
 					bvm->bi_bdev = rdev->bdev;
 					max = min(max, q->merge_bvec_fn(
@@ -4434,14 +4437,17 @@ static int handle_reshape_read_error(struct mddev *mddev,
 {
 	/* Use sync reads to get the blocks from somewhere else */
 	int sectors = r10_bio->sectors;
-	struct r10bio r10b;
 	struct r10conf *conf = mddev->private;
+	struct {
+		struct r10bio r10b;
+		struct r10dev devs[conf->copies];
+	} x;
 	int slot = 0;
 	int idx = 0;
 	struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
 
-	r10b.sector = r10_bio->sector;
-	__raid10_find_phys(&conf->prev, &r10b);
+	x.r10b.sector = r10_bio->sector;
+	__raid10_find_phys(&conf->prev, &x.r10b);
 
 	while (sectors) {
 		int s = sectors;
@@ -4452,7 +4458,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
 			s = PAGE_SIZE >> 9;
 
 		while (!success) {
-			int d = r10b.devs[slot].devnum;
+			int d = x.r10b.devs[slot].devnum;
 			struct md_rdev *rdev = conf->mirrors[d].rdev;
 			sector_t addr;
 			if (rdev == NULL ||
@@ -4460,7 +4466,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
 			    !test_bit(In_sync, &rdev->flags))
 				goto failed;
 
-			addr = r10b.devs[slot].addr + idx * PAGE_SIZE;
+			addr = x.r10b.devs[slot].addr + idx * PAGE_SIZE;
 			success = sync_page_io(rdev,
 					       addr,
 					       s << 9,
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 007c2c6..1054cf6 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -110,7 +110,7 @@ struct r10bio {
 	 * We choose the number when they are allocated.
 	 * We sometimes need an extra bio to write to the replacement.
 	 */
-	struct {
+	struct r10dev {
 		struct bio	*bio;
 		union {
 			struct bio	*repl_bio; /* used for resync and

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-08-15  4:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-13 12:49 md raid10 Oops on recent kernels Ivan Vasilyev
2012-08-14  0:50 ` NeilBrown
2012-08-14 18:56   ` Ivan Vasilyev
2012-08-15  4:44     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).