Raid10 device hangs during resync and heavy I/O.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid10 device hangs during resync and heavy I/O.
@ 2010-07-16 18:46 Justin Bronder
  2010-07-16 18:49 ` Justin Bronder
  2010-07-22 18:49 ` Justin Bronder
  0 siblings, 2 replies; 9+ messages in thread
From: Justin Bronder @ 2010-07-16 18:46 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 9135 bytes --]

I've been able to reproduce this across a number of machines with the same
hardware configuration.  During a raid10 resync, it's possible to hang the
device so that any further I/O operations will also block.  This can be
fairly simply done using dd.  

Interestingly, this is not reproducible when using a non-partitioned device.
That is, creating the device with --auto=yes and then directly using it
functions as expected.  However, using --auto=yes or --auto=mdp and then
creating a partition across the device will cause the hang.

From all appearances, this is not just slow I/O, days later the same tasks
are still blocked.  The rest of the system continues to function normally,
including other raid devices.

Below I'm going to include the script I'm using to reproduce, the relevant
kernel tracebacks, and /proc/mdstat.  Thanks in advance for any help
resolving this.


=== md10-hang.sh ===
#!/bin/bash

MDP=false
# Pick two unused drives here.
MD_DRIVES="sdc sdd"

if ${MDP}; then
    MD_DEV="md_d99"
else
    MD_DEV="md99"
fi

M="/mnt/mdmount"
SIZE=8192

die () {
    echo
    echo "ERROR:  $*"
    echo
    exit 1
}

mkraid() {
    local d
    local drives
    local mdargs="--auto=yes"

    ${MDP} && mdargs="--auto=mdp"

    mkdir -p ${M}
    umount -f ${M} &>/dev/null
    mdadm --stop /dev/md_d99 &>/dev/null
    mdadm --stop /dev/md99 &>/dev/null

    for d in ${MD_DRIVES}; do
        sfdisk -uM /dev/${d} <<-EOF
,${SIZE},83
,,83
EOF
        mdadm --zero-superblock /dev/${d}1 &>/dev/null
        drives="${drives} /dev/${d}1"
    done
    mdadm --create /dev/${MD_DEV} \
        --run \
        --force \
        --level=10 \
        --layout=f2 \
        --raid-devices=2 \
        ${mdargs} ${drives} || die "mdadm --create failed"

    if ${MDP}; then
        printf ",,83\n" | sfdisk -uM /dev/${MD_DEV}
        mkfs.ext2 -q /dev/${MD_DEV}p1
        mount /dev/${MD_DEV}p1 ${M} || die "Mount failed"
    else
        printf ",,83\n" | sfdisk -uM /dev/${MD_DEV}
        mkfs.ext2 -q /dev/${MD_DEV}p1
        mount /dev/${MD_DEV}p1 ${M} || die "Mount failed"
    fi

    echo "Creating tmp file"
    dd if=/dev/zero of=${M}/tmpfile bs=1M count=4000
}

mkraid
i=1
while [ "$(</sys/block/${MD_DEV}/md/sync_action)"  != "idle" ]; do
    echo "Attempt ${i} to cause crash"
    cat /proc/mdstat
    dd if=${M}/tmpfile of=${M}/cpfile bs=1M
    i=$((i++))
done


=== kernel trace ===
[ 9002.405247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9002.433361]  ffff88025436fc30 0000000000000046 ffff88025436fc10 ffff880254616800
[ 9002.460415]  ffff88025d40dd70 ffff88025d40a3f0 0000000354616800 00000000000de600
[ 9002.487497]  ffff88025436fc10 ffff8801570343c0 ffff880157034420 ffff880157034448
[ 9002.514575] Call Trace:
[ 9002.526609]  [<ffffffff81320efb>] raise_barrier+0x167/0x1a3
[ 9002.548139]  [<ffffffff810383b6>] ? default_wake_function+0x0/0xf
[ 9002.571218]  [<ffffffff813238e1>] sync_request+0x57d/0x8a8
[ 9002.592430]  [<ffffffff81320ca5>] ? raid10_unplug+0x24/0x28
[ 9002.613833]  [<ffffffff8132ad63>] ? md_thread+0x0/0xe8
[ 9002.633938]  [<ffffffff8132dab2>] md_do_sync+0x685/0xa9d
[ 9002.654556]  [<ffffffff8132ad63>] ? md_thread+0x0/0xe8
[ 9002.674650]  [<ffffffff8132ae31>] md_thread+0xce/0xe8
[ 9002.694435]  [<ffffffff81034aa6>] ? spin_unlock_irqrestore+0x9/0xb
[ 9002.717583]  [<ffffffff81056cc0>] kthread+0x69/0x71
[ 9002.736753]  [<ffffffff810037e4>] kernel_thread_helper+0x4/0x10
[ 9002.759062]  [<ffffffff81056c57>] ? kthread+0x0/0x71
[ 9002.778478]  [<ffffffff810037e0>] ? kernel_thread_helper+0x0/0x10
[ 9002.801286] INFO: task flush-9:99:5896 blocked for more than 120 seconds.
[ 9002.826287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9002.854475]  ffff88012fa8b870 0000000000000046 ffff88012fa8b850 ffff880254616800
[ 9002.881589]  ffff88025d40ebd0 ffff88025d40a3f0 000000036fb73300 0000000000000001
[ 9002.908691]  ffff88012fa8b850 ffff8801570343c0 ffff880157034420 ffff880157034448
[ 9002.935778] Call Trace:
[ 9002.947781]  [<ffffffff81320d5b>] wait_barrier+0xa7/0xe0
[ 9002.968438]  [<ffffffff810383b6>] ? default_wake_function+0x0/0xf
[ 9002.991452]  [<ffffffff8132163e>] make_request+0x121/0x507
[ 9003.012697]  [<ffffffff8132d2aa>] md_make_request+0xc7/0x101
[ 9003.034515]  [<ffffffff811dc817>] generic_make_request+0x1af/0x276
[ 9003.057953]  [<ffffffff811dda3b>] submit_bio+0x9e/0xa7
[ 9003.078197]  [<ffffffff810e950d>] submit_bh+0x11b/0x13f
[ 9003.098648]  [<ffffffff810ebba9>] __block_write_full_page+0x20b/0x310
[ 9003.122755]  [<ffffffff810ec383>] ? end_buffer_async_write+0x0/0x13a
[ 9003.146576]  [<ffffffff810ef5b2>] ? blkdev_get_block+0x0/0x50
[ 9003.168612]  [<ffffffff810ec383>] ? end_buffer_async_write+0x0/0x13a
[ 9003.192521]  [<ffffffff810ef5b2>] ? blkdev_get_block+0x0/0x50
[ 9003.214684]  [<ffffffff810ebd30>] block_write_full_page_endio+0x82/0x8e
[ 9003.239426]  [<ffffffff810ebd4c>] block_write_full_page+0x10/0x12
[ 9003.262539]  [<ffffffff810eea92>] blkdev_writepage+0x13/0x15
[ 9003.284297]  [<ffffffff8109e005>] __writepage+0x12/0x2b
[ 9003.304718]  [<ffffffff8109e46c>] write_cache_pages+0x1fa/0x306
[ 9003.327254]  [<ffffffff8109dff3>] ? __writepage+0x0/0x2b
[ 9003.347976]  [<ffffffff810e9f65>] ? mark_buffer_dirty+0x85/0x89
[ 9003.370488]  [<ffffffff8109e597>] generic_writepages+0x1f/0x25
[ 9003.392661]  [<ffffffff8109e5b9>] do_writepages+0x1c/0x25
[ 9003.413477]  [<ffffffff810e43e0>] writeback_single_inode+0xb0/0x1c7
[ 9003.436962]  [<ffffffff810e4b5a>] writeback_inodes_wb+0x2bf/0x35a
[ 9003.459949]  [<ffffffff810e4d1a>] wb_writeback+0x125/0x1a1
[ 9003.481145]  [<ffffffff810e4f66>] wb_do_writeback+0x138/0x14f
[ 9003.503124]  [<ffffffff810ab3e7>] ? bdi_start_fn+0x0/0xca
[ 9003.524024]  [<ffffffff810e4fa4>] bdi_writeback_task+0x27/0x92
[ 9003.546241]  [<ffffffff810ab44c>] bdi_start_fn+0x65/0xca
[ 9003.566875]  [<ffffffff81056cc0>] kthread+0x69/0x71
[ 9003.586179]  [<ffffffff810037e4>] kernel_thread_helper+0x4/0x10
[ 9003.608532]  [<ffffffff81056c57>] ? kthread+0x0/0x71
[ 9003.627949]  [<ffffffff810037e0>] ? kernel_thread_helper+0x0/0x10
[ 9003.650791] INFO: task dd:5912 blocked for more than 120 seconds.
[ 9003.673621] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9003.701721]  ffff88024c84d7b8 0000000000000082 ffff88024c84d798 ffff880254616800
[ 9003.728793]  ffff88025d6840b0 ffff88025f065640 0000000296364968 0000000000000000
[ 9003.755893]  000000014c84d798 ffff8801570343c0 ffff880157034420 ffff880157034448
[ 9003.773323] Call Trace:
[ 9003.773326]  [<ffffffff81320d5b>] wait_barrier+0xa7/0xe0
[ 9003.773328]  [<ffffffff810383b6>] ? default_wake_function+0x0/0xf
[ 9003.773330]  [<ffffffff8132163e>] make_request+0x121/0x507
[ 9003.773332]  [<ffffffff810edbd7>] ? bio_split+0xca/0x183
[ 9003.773334]  [<ffffffff813215d5>] make_request+0xb8/0x507
[ 9003.773337]  [<ffffffff811d780d>] ? __elv_add_request+0xa1/0xaa
[ 9003.773339]  [<ffffffff8132d2aa>] md_make_request+0xc7/0x101
[ 9003.773341]  [<ffffffff811dc817>] generic_make_request+0x1af/0x276
[ 9003.773343]  [<ffffffff810ed885>] ? bio_alloc_bioset+0x70/0xc0
[ 9003.773345]  [<ffffffff811dda3b>] submit_bio+0x9e/0xa7
[ 9003.773347]  [<ffffffff810f0d0b>] mpage_bio_submit+0x22/0x26
[ 9003.773349]  [<ffffffff810f17df>] do_mpage_readpage+0x462/0x54e
[ 9003.773352]  [<ffffffff8109fb21>] ? get_page+0x9/0xf
[ 9003.773354]  [<ffffffff810a004d>] ? __lru_cache_add+0x40/0x58
[ 9003.773357]  [<ffffffff8112c194>] ? ext2_get_block+0x0/0x78a
[ 9003.773359]  [<ffffffff810f1a66>] mpage_readpages+0xc9/0x10f
[ 9003.773361]  [<ffffffff8112c194>] ? ext2_get_block+0x0/0x78a
[ 9003.773363]  [<ffffffff81001d89>] ? __switch_to+0x10e/0x1e1
[ 9003.773366]  [<ffffffff8112b40c>] ext2_readpages+0x1a/0x1c
[ 9003.773368]  [<ffffffff8109f4d0>] __do_page_cache_readahead+0xf6/0x191
[ 9003.773370]  [<ffffffff8109f587>] ra_submit+0x1c/0x20
[ 9003.773372]  [<ffffffff8109f7e3>] ondemand_readahead+0x17b/0x18e
[ 9003.773374]  [<ffffffff8109f870>] page_cache_async_readahead+0x7a/0xa2
[ 9003.773379]  [<ffffffff81098a59>] generic_file_aio_read+0x26e/0x55d
[ 9003.773382]  [<ffffffff810cb32e>] do_sync_read+0xc2/0x106
[ 9003.773384]  [<ffffffff810a009d>] ? lru_cache_add_lru+0x38/0x3d
[ 9003.773387]  [<ffffffff8100338e>] ? apic_timer_interrupt+0xe/0x20
[ 9003.773389]  [<ffffffff810cb980>] vfs_read+0xa4/0xde
[ 9003.773391]  [<ffffffff810cbc02>] sys_read+0x47/0x6d
[ 9003.773393]  [<ffffffff81002a42>] system_call_fastpath+0x16/0x1b


=== /proc/mdstat ===
Personalities : [raid1] [raid10] 
md99 : active raid10 sdd1[1] sdc1[0]
      8393856 blocks 64K chunks 2 far-copies [2/2] [UU]
      [=>...................]  resync =  5.4% (455360/8393856) finish=3938.0min speed=33K/sec
      
md1 : active raid10 sda2[0] sdb2[1]
      976703488 blocks 512K chunks 2 far-copies [2/2] [UU]
      
md0 : active raid1 sda1[0] sdb1[1]
      56128 blocks [2/2] [UU]
      
unused devices: <none>


-- 
Justin Bronder

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-07-16 18:46 Raid10 device hangs during resync and heavy I/O Justin Bronder
@ 2010-07-16 18:49 ` Justin Bronder
  2010-07-22 18:49 ` Justin Bronder
  1 sibling, 0 replies; 9+ messages in thread
From: Justin Bronder @ 2010-07-16 18:49 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

On 16/07/10 14:46 -0400, Justin Bronder wrote:
> I've been able to reproduce this across a number of machines with the same
> hardware configuration.  During a raid10 resync, it's possible to hang the
> device so that any further I/O operations will also block.  This can be
> fairly simply done using dd.  
> 
> Interestingly, this is not reproducible when using a non-partitioned device.
> That is, creating the device with --auto=yes and then directly using it
> functions as expected.  However, using --auto=yes or --auto=mdp and then
> creating a partition across the device will cause the hang.
> 
> From all appearances, this is not just slow I/O, days later the same tasks
> are still blocked.  The rest of the system continues to function normally,
> including other raid devices.
> 
> Below I'm going to include the script I'm using to reproduce, the relevant
> kernel tracebacks, and /proc/mdstat.  Thanks in advance for any help
> resolving this.

My apologies, I should have proof read this first.  I've reproduced using
2.6.28, 2.6.32 and 2.6.34.1.

-- 
Justin Bronder

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-07-16 18:46 Raid10 device hangs during resync and heavy I/O Justin Bronder
  2010-07-16 18:49 ` Justin Bronder
@ 2010-07-22 18:49 ` Justin Bronder
  2010-07-23  3:19   ` Neil Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Justin Bronder @ 2010-07-22 18:49 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 13002 bytes --]

On 16/07/10 14:46 -0400, Justin Bronder wrote:
> I've been able to reproduce this across a number of machines with the same
> hardware configuration.  During a raid10 resync, it's possible to hang the
> device so that any further I/O operations will also block.  This can be
> fairly simply done using dd.  
> 
> Interestingly, this is not reproducible when using a non-partitioned device.
> That is, creating the device with --auto=yes and then directly using it
> functions as expected.  However, using --auto=yes or --auto=mdp and then
> creating a partition across the device will cause the hang.
> 

I've done some more research that may potentially help. All of
the following was done with 2.6.34.1.

Still produces the hang:
    - Using cp (may take a bit longer).
    - Using jfs as the filesystem.
    - Dropping RESYNC_DEPTH to 32
    - Using the offset layout.

Does not produce the hang:
    - Using the near layout.
    - Using dd on the partition directly instead of on a
      filesystem via something like:
      dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M


As the barrier code is very similiar, I repeated a number of
these tests using raid1 instead of raid10.  In every case, I was
unable to cause the system to hang.  I focused on the barriers
due to the tracebacks in the previous email.  For the heck of it,
I added some tracing (patch below) where the reason for the hang
is fairly obvious.  Of course, how it happened isn't.

The last bit of the trace before the hang.

          <idle>-0     [003]   188.987489: allow_barrier:     swapper - w:2 p:1 b:1
          <idle>-0     [003]   188.987495: allow_barrier:     swapper - w:2 p:0 b:1
           <...>-4910  [003]   188.987518: raise_barrier: out: md99_resync - w:2 p:0 b:1
           <...>-4910  [003]   188.987733: raise_barrier: in:  md99_resync - w:2 p:0 b:1
           <...>-4909  [003]   188.988174: lower_barrier:     md99_raid10 - w:2 p:0 b:0
           <...>-4915  [002]   188.988188: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
           <...>-4996  [003]   188.988199: wait_barrier: out: dd - w:0 p:2 b:0
           <...>-4915  [002]   188.988203: wait_barrier: in:  flush-9:99 - w:0 p:2 b:0
           <...>-4915  [002]   188.988204: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
           <...>-4915  [002]   188.988208: wait_barrier: in:  flush-9:99 - w:0 p:3 b:0
           <...>-4915  [002]   188.988209: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
           <...>-4996  [003]   188.988211: wait_barrier: in:  dd - w:0 p:4 b:0
           <...>-4996  [003]   188.988211: wait_barrier: out: dd - w:0 p:5 b:0
           <...>-4996  [003]   188.988214: wait_barrier: in:  dd - w:0 p:5 b:0
           <...>-4996  [003]   188.988214: wait_barrier: out: dd - w:0 p:6 b:0
           <...>-4915  [002]   188.988234: wait_barrier: in:  flush-9:99 - w:0 p:6 b:0
           <...>-4915  [002]   188.988235: wait_barrier: out: flush-9:99 - w:0 p:7 b:0
           <...>-4915  [002]   188.988244: wait_barrier: in:  flush-9:99 - w:0 p:7 b:0
           <...>-4915  [002]   188.988244: wait_barrier: out: flush-9:99 - w:0 p:8 b:0
           <...>-4915  [002]   188.988248: wait_barrier: in:  flush-9:99 - w:0 p:8 b:0
           <...>-4915  [002]   188.988249: wait_barrier: out: flush-9:99 - w:0 p:9 b:0
           <...>-4996  [003]   188.988251: wait_barrier: in:  dd - w:0 p:9 b:0
           <...>-4996  [003]   188.988252: wait_barrier: out: dd - w:0 p:10 b:0
           <...>-4996  [003]   188.988261: wait_barrier: in:  dd - w:0 p:10 b:0
           <...>-4996  [003]   188.988262: wait_barrier: out: dd - w:0 p:11 b:0
           <...>-4996  [003]   188.988264: wait_barrier: in:  dd - w:0 p:11 b:0
           <...>-4996  [003]   188.988264: wait_barrier: out: dd - w:0 p:12 b:0
           <...>-4915  [002]   188.988272: wait_barrier: in:  flush-9:99 - w:0 p:12 b:0
           <...>-4915  [002]   188.988272: wait_barrier: out: flush-9:99 - w:0 p:13 b:0
           <...>-4915  [002]   188.988281: wait_barrier: in:  flush-9:99 - w:0 p:13 b:0
           <...>-4915  [002]   188.988281: wait_barrier: out: flush-9:99 - w:0 p:14 b:0
           <...>-4996  [003]   188.988282: wait_barrier: in:  dd - w:0 p:14 b:0
           <...>-4996  [003]   188.988283: wait_barrier: out: dd - w:0 p:15 b:0
           <...>-4915  [002]   188.988285: wait_barrier: in:  flush-9:99 - w:0 p:15 b:0
           <...>-4915  [002]   188.988286: wait_barrier: out: flush-9:99 - w:0 p:16 b:0
           <...>-4996  [003]   188.988290: wait_barrier: in:  dd - w:0 p:16 b:0
           <...>-4996  [003]   188.988290: wait_barrier: out: dd - w:0 p:17 b:0
           <...>-4996  [003]   188.988292: wait_barrier: in:  dd - w:0 p:17 b:0
           <...>-4996  [003]   188.988293: wait_barrier: out: dd - w:0 p:18 b:0
           <...>-4915  [002]   188.988309: wait_barrier: in:  flush-9:99 - w:0 p:18 b:0
           <...>-4915  [002]   188.988310: wait_barrier: out: flush-9:99 - w:0 p:19 b:0
           <...>-4996  [003]   188.988310: wait_barrier: in:  dd - w:0 p:19 b:0
           <...>-4996  [003]   188.988311: wait_barrier: out: dd - w:0 p:20 b:0
           <...>-4996  [003]   188.988317: wait_barrier: in:  dd - w:0 p:20 b:0
           <...>-4996  [003]   188.988318: wait_barrier: out: dd - w:0 p:21 b:0
           <...>-4996  [003]   188.988321: wait_barrier: in:  dd - w:0 p:21 b:0
           <...>-4996  [003]   188.988321: wait_barrier: out: dd - w:0 p:22 b:0
           <...>-4915  [002]   188.988323: allow_barrier:     flush-9:99 - w:0 p:21 b:0
           <...>-4996  [003]   188.988327: wait_barrier: in:  dd - w:0 p:21 b:0
           <...>-4996  [003]   188.988328: wait_barrier: out: dd - w:0 p:22 b:0
           <...>-4915  [002]   188.988356: wait_barrier: in:  flush-9:99 - w:0 p:22 b:0
           <...>-4915  [002]   188.988356: wait_barrier: out: flush-9:99 - w:0 p:23 b:0
           <...>-4915  [002]   188.988361: wait_barrier: in:  flush-9:99 - w:0 p:23 b:0
           <...>-4915  [002]   188.988361: wait_barrier: out: flush-9:99 - w:0 p:24 b:0
           <...>-4910  [003]   188.988370: raise_barrier: mid: md99_resync - w:0 p:24 b:1
           <...>-4915  [002]   188.988378: wait_barrier: in:  flush-9:99 - w:0 p:24 b:1
           <...>-4996  [003]   188.988638: wait_barrier: in:  dd - w:1 p:24 b:1
          <idle>-0     [003]   188.988887: allow_barrier:     swapper - w:2 p:23 b:1
          <idle>-0     [003]   188.988895: allow_barrier:     swapper - w:2 p:22 b:1
          <idle>-0     [003]   188.988899: allow_barrier:     swapper - w:2 p:21 b:1
          <idle>-0     [003]   188.988905: allow_barrier:     swapper - w:2 p:20 b:1
          <idle>-0     [003]   188.988909: allow_barrier:     swapper - w:2 p:19 b:1
          <idle>-0     [002]   188.996299: allow_barrier:     swapper - w:2 p:18 b:1
          <idle>-0     [002]   188.996310: allow_barrier:     swapper - w:2 p:17 b:1
          <idle>-0     [002]   188.996317: allow_barrier:     swapper - w:2 p:16 b:1
          <idle>-0     [002]   188.996325: allow_barrier:     swapper - w:2 p:15 b:1
          <idle>-0     [002]   188.996330: allow_barrier:     swapper - w:2 p:14 b:1
          <idle>-0     [002]   188.996336: allow_barrier:     swapper - w:2 p:13 b:1
          <idle>-0     [002]   188.996340: allow_barrier:     swapper - w:2 p:12 b:1
          <idle>-0     [003]   189.004270: allow_barrier:     swapper - w:2 p:11 b:1
          <idle>-0     [003]   189.004278: allow_barrier:     swapper - w:2 p:10 b:1
          <idle>-0     [003]   189.004284: allow_barrier:     swapper - w:2 p:9 b:1
          <idle>-0     [003]   189.004302: allow_barrier:     swapper - w:2 p:8 b:1
          <idle>-0     [003]   189.004313: allow_barrier:     swapper - w:2 p:7 b:1
          <idle>-0     [003]   189.004322: allow_barrier:     swapper - w:2 p:6 b:1
          <idle>-0     [002]   189.004936: allow_barrier:     swapper - w:2 p:5 b:1
          <idle>-0     [002]   189.004959: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [002]   189.004970: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [002]   189.004979: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [002]   189.004996: allow_barrier:     swapper - w:2 p:1 b:1
          <idle>-0     [002]   189.005006: allow_barrier:     swapper - w:2 p:0 b:1
           <...>-4910  [002]   189.005030: raise_barrier: out: md99_resync - w:2 p:0 b:1
           <...>-4910  [002]   189.005251: raise_barrier: in:  md99_resync - w:2 p:0 b:1
           <...>-4909  [002]   189.005676: lower_barrier:     md99_raid10 - w:2 p:0 b:0
           <...>-4915  [003]   189.005687: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
           <...>-4996  [003]   189.005713: wait_barrier: out: dd - w:0 p:2 b:0
           <...>-4910  [002]   189.005715: raise_barrier: mid: md99_resync - w:0 p:2 b:1
           <...>-4996  [003]   189.005724: wait_barrier: in:  dd - w:0 p:2 b:1
          <idle>-0     [002]   189.012925: allow_barrier:     swapper - w:1 p:1 b:1
          <idle>-0     [002]   189.013271: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4910  [002]   189.013295: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4910  [002]   189.013414: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4909  [002]   189.021171: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-4996  [003]   189.021179: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-4910  [002]   189.021182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-4996  [003]   189.021184: wait_barrier: in:  dd - w:0 p:1 b:1
           <...>-4915  [003]   218.929283: wait_barrier: in:  flush-9:99 - w:1 p:1 b:1


diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ad945cc..35ad593 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/blkdev.h>
 #include <linux/seq_file.h>
+#include <linux/sched.h>
 #include "md.h"
 #include "raid10.h"
 #include "bitmap.h"
@@ -704,6 +705,9 @@ static void raise_barrier(conf_t *conf, int force)
 	BUG_ON(force && !conf->barrier);
 	spin_lock_irq(&conf->resync_lock);
 
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("in:  %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
+
 	/* Wait until no block IO is waiting (unless 'force') */
 	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
 			    conf->resync_lock,
@@ -712,12 +716,18 @@ static void raise_barrier(conf_t *conf, int force)
 	/* block any new IO from starting */
 	conf->barrier++;
 
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("mid: %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
+
 	/* No wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
 			    conf->resync_lock,
 			    raid10_unplug(conf->mddev->queue));
 
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("out: %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
+
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -726,6 +736,8 @@ static void lower_barrier(conf_t *conf)
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
 	conf->barrier--;
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("    %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
@@ -733,7 +745,9 @@ static void lower_barrier(conf_t *conf)
 static void wait_barrier(conf_t *conf)
 {
 	spin_lock_irq(&conf->resync_lock);
-	if (conf->barrier) {
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("in:  %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
+    if (conf->barrier) {
 		conf->nr_waiting++;
 		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
 				    conf->resync_lock,
@@ -741,6 +755,8 @@ static void wait_barrier(conf_t *conf)
 		conf->nr_waiting--;
 	}
 	conf->nr_pending++;
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("out: %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -749,6 +765,8 @@ static void allow_barrier(conf_t *conf)
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
 	conf->nr_pending--;
+    if ( conf->mddev->md_minor == 99 )
+        trace_printk("    %s - w:%u p:%u b:%u\n", current->comm, conf->nr_waiting, conf->nr_pending, conf->barrier);
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }

-- 
Justin Bronder

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-07-22 18:49 ` Justin Bronder
@ 2010-07-23  3:19   ` Neil Brown
  2010-07-23 15:47     ` Justin Bronder
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2010-07-23  3:19 UTC (permalink / raw)
  To: Justin Bronder; +Cc: linux-raid

On Thu, 22 Jul 2010 14:49:33 -0400
Justin Bronder <jsbronder@gentoo.org> wrote:

> On 16/07/10 14:46 -0400, Justin Bronder wrote:
> > I've been able to reproduce this across a number of machines with the same
> > hardware configuration.  During a raid10 resync, it's possible to hang the
> > device so that any further I/O operations will also block.  This can be
> > fairly simply done using dd.  
> > 
> > Interestingly, this is not reproducible when using a non-partitioned device.
> > That is, creating the device with --auto=yes and then directly using it
> > functions as expected.  However, using --auto=yes or --auto=mdp and then
> > creating a partition across the device will cause the hang.
> > 
> 
> I've done some more research that may potentially help. All of
> the following was done with 2.6.34.1.
> 
> Still produces the hang:
>     - Using cp (may take a bit longer).
>     - Using jfs as the filesystem.
>     - Dropping RESYNC_DEPTH to 32
>     - Using the offset layout.
> 
> Does not produce the hang:
>     - Using the near layout.
>     - Using dd on the partition directly instead of on a
>       filesystem via something like:
>       dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M
> 
> 
> As the barrier code is very similiar, I repeated a number of
> these tests using raid1 instead of raid10.  In every case, I was
> unable to cause the system to hang.  I focused on the barriers
> due to the tracebacks in the previous email.  For the heck of it,
> I added some tracing (patch below) where the reason for the hang
> is fairly obvious.  Of course, how it happened isn't.
> 
> The last bit of the trace before the hang.

Thanks for doing this!

See below...

> 
>           <idle>-0     [003]   188.987489: allow_barrier:     swapper - w:2 p:1 b:1
>           <idle>-0     [003]   188.987495: allow_barrier:     swapper - w:2 p:0 b:1
>            <...>-4910  [003]   188.987518: raise_barrier: out: md99_resync - w:2 p:0 b:1
>            <...>-4910  [003]   188.987733: raise_barrier: in:  md99_resync - w:2 p:0 b:1
>            <...>-4909  [003]   188.988174: lower_barrier:     md99_raid10 - w:2 p:0 b:0
>            <...>-4915  [002]   188.988188: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
>            <...>-4996  [003]   188.988199: wait_barrier: out: dd - w:0 p:2 b:0
>            <...>-4915  [002]   188.988203: wait_barrier: in:  flush-9:99 - w:0 p:2 b:0
>            <...>-4915  [002]   188.988204: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
>            <...>-4915  [002]   188.988208: wait_barrier: in:  flush-9:99 - w:0 p:3 b:0
>            <...>-4915  [002]   188.988209: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
>            <...>-4996  [003]   188.988211: wait_barrier: in:  dd - w:0 p:4 b:0
>            <...>-4996  [003]   188.988211: wait_barrier: out: dd - w:0 p:5 b:0
>            <...>-4996  [003]   188.988214: wait_barrier: in:  dd - w:0 p:5 b:0
>            <...>-4996  [003]   188.988214: wait_barrier: out: dd - w:0 p:6 b:0
>            <...>-4915  [002]   188.988234: wait_barrier: in:  flush-9:99 - w:0 p:6 b:0
>            <...>-4915  [002]   188.988235: wait_barrier: out: flush-9:99 - w:0 p:7 b:0
>            <...>-4915  [002]   188.988244: wait_barrier: in:  flush-9:99 - w:0 p:7 b:0
>            <...>-4915  [002]   188.988244: wait_barrier: out: flush-9:99 - w:0 p:8 b:0
>            <...>-4915  [002]   188.988248: wait_barrier: in:  flush-9:99 - w:0 p:8 b:0
>            <...>-4915  [002]   188.988249: wait_barrier: out: flush-9:99 - w:0 p:9 b:0
>            <...>-4996  [003]   188.988251: wait_barrier: in:  dd - w:0 p:9 b:0
>            <...>-4996  [003]   188.988252: wait_barrier: out: dd - w:0 p:10 b:0
>            <...>-4996  [003]   188.988261: wait_barrier: in:  dd - w:0 p:10 b:0
>            <...>-4996  [003]   188.988262: wait_barrier: out: dd - w:0 p:11 b:0
>            <...>-4996  [003]   188.988264: wait_barrier: in:  dd - w:0 p:11 b:0
>            <...>-4996  [003]   188.988264: wait_barrier: out: dd - w:0 p:12 b:0
>            <...>-4915  [002]   188.988272: wait_barrier: in:  flush-9:99 - w:0 p:12 b:0
>            <...>-4915  [002]   188.988272: wait_barrier: out: flush-9:99 - w:0 p:13 b:0
>            <...>-4915  [002]   188.988281: wait_barrier: in:  flush-9:99 - w:0 p:13 b:0
>            <...>-4915  [002]   188.988281: wait_barrier: out: flush-9:99 - w:0 p:14 b:0
>            <...>-4996  [003]   188.988282: wait_barrier: in:  dd - w:0 p:14 b:0
>            <...>-4996  [003]   188.988283: wait_barrier: out: dd - w:0 p:15 b:0
>            <...>-4915  [002]   188.988285: wait_barrier: in:  flush-9:99 - w:0 p:15 b:0
>            <...>-4915  [002]   188.988286: wait_barrier: out: flush-9:99 - w:0 p:16 b:0
>            <...>-4996  [003]   188.988290: wait_barrier: in:  dd - w:0 p:16 b:0
>            <...>-4996  [003]   188.988290: wait_barrier: out: dd - w:0 p:17 b:0
>            <...>-4996  [003]   188.988292: wait_barrier: in:  dd - w:0 p:17 b:0
>            <...>-4996  [003]   188.988293: wait_barrier: out: dd - w:0 p:18 b:0
>            <...>-4915  [002]   188.988309: wait_barrier: in:  flush-9:99 - w:0 p:18 b:0
>            <...>-4915  [002]   188.988310: wait_barrier: out: flush-9:99 - w:0 p:19 b:0
>            <...>-4996  [003]   188.988310: wait_barrier: in:  dd - w:0 p:19 b:0
>            <...>-4996  [003]   188.988311: wait_barrier: out: dd - w:0 p:20 b:0
>            <...>-4996  [003]   188.988317: wait_barrier: in:  dd - w:0 p:20 b:0
>            <...>-4996  [003]   188.988318: wait_barrier: out: dd - w:0 p:21 b:0
>            <...>-4996  [003]   188.988321: wait_barrier: in:  dd - w:0 p:21 b:0
>            <...>-4996  [003]   188.988321: wait_barrier: out: dd - w:0 p:22 b:0
>            <...>-4915  [002]   188.988323: allow_barrier:     flush-9:99 - w:0 p:21 b:0
>            <...>-4996  [003]   188.988327: wait_barrier: in:  dd - w:0 p:21 b:0
>            <...>-4996  [003]   188.988328: wait_barrier: out: dd - w:0 p:22 b:0
>            <...>-4915  [002]   188.988356: wait_barrier: in:  flush-9:99 - w:0 p:22 b:0
>            <...>-4915  [002]   188.988356: wait_barrier: out: flush-9:99 - w:0 p:23 b:0
>            <...>-4915  [002]   188.988361: wait_barrier: in:  flush-9:99 - w:0 p:23 b:0
>            <...>-4915  [002]   188.988361: wait_barrier: out: flush-9:99 - w:0 p:24 b:0
>            <...>-4910  [003]   188.988370: raise_barrier: mid: md99_resync - w:0 p:24 b:1
>            <...>-4915  [002]   188.988378: wait_barrier: in:  flush-9:99 - w:0 p:24 b:1
>            <...>-4996  [003]   188.988638: wait_barrier: in:  dd - w:1 p:24 b:1
>           <idle>-0     [003]   188.988887: allow_barrier:     swapper - w:2 p:23 b:1
>           <idle>-0     [003]   188.988895: allow_barrier:     swapper - w:2 p:22 b:1
>           <idle>-0     [003]   188.988899: allow_barrier:     swapper - w:2 p:21 b:1
>           <idle>-0     [003]   188.988905: allow_barrier:     swapper - w:2 p:20 b:1
>           <idle>-0     [003]   188.988909: allow_barrier:     swapper - w:2 p:19 b:1
>           <idle>-0     [002]   188.996299: allow_barrier:     swapper - w:2 p:18 b:1
>           <idle>-0     [002]   188.996310: allow_barrier:     swapper - w:2 p:17 b:1
>           <idle>-0     [002]   188.996317: allow_barrier:     swapper - w:2 p:16 b:1
>           <idle>-0     [002]   188.996325: allow_barrier:     swapper - w:2 p:15 b:1
>           <idle>-0     [002]   188.996330: allow_barrier:     swapper - w:2 p:14 b:1
>           <idle>-0     [002]   188.996336: allow_barrier:     swapper - w:2 p:13 b:1
>           <idle>-0     [002]   188.996340: allow_barrier:     swapper - w:2 p:12 b:1
>           <idle>-0     [003]   189.004270: allow_barrier:     swapper - w:2 p:11 b:1
>           <idle>-0     [003]   189.004278: allow_barrier:     swapper - w:2 p:10 b:1
>           <idle>-0     [003]   189.004284: allow_barrier:     swapper - w:2 p:9 b:1
>           <idle>-0     [003]   189.004302: allow_barrier:     swapper - w:2 p:8 b:1
>           <idle>-0     [003]   189.004313: allow_barrier:     swapper - w:2 p:7 b:1
>           <idle>-0     [003]   189.004322: allow_barrier:     swapper - w:2 p:6 b:1
>           <idle>-0     [002]   189.004936: allow_barrier:     swapper - w:2 p:5 b:1
>           <idle>-0     [002]   189.004959: allow_barrier:     swapper - w:2 p:4 b:1
>           <idle>-0     [002]   189.004970: allow_barrier:     swapper - w:2 p:3 b:1
>           <idle>-0     [002]   189.004979: allow_barrier:     swapper - w:2 p:2 b:1
>           <idle>-0     [002]   189.004996: allow_barrier:     swapper - w:2 p:1 b:1
>           <idle>-0     [002]   189.005006: allow_barrier:     swapper - w:2 p:0 b:1
>            <...>-4910  [002]   189.005030: raise_barrier: out: md99_resync - w:2 p:0 b:1
>            <...>-4910  [002]   189.005251: raise_barrier: in:  md99_resync - w:2 p:0 b:1
>            <...>-4909  [002]   189.005676: lower_barrier:     md99_raid10 - w:2 p:0 b:0
>            <...>-4915  [003]   189.005687: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
>            <...>-4996  [003]   189.005713: wait_barrier: out: dd - w:0 p:2 b:0
>            <...>-4910  [002]   189.005715: raise_barrier: mid: md99_resync - w:0 p:2 b:1
>            <...>-4996  [003]   189.005724: wait_barrier: in:  dd - w:0 p:2 b:1
>           <idle>-0     [002]   189.012925: allow_barrier:     swapper - w:1 p:1 b:1
>           <idle>-0     [002]   189.013271: allow_barrier:     swapper - w:1 p:0 b:1
>            <...>-4910  [002]   189.013295: raise_barrier: out: md99_resync - w:1 p:0 b:1
>            <...>-4910  [002]   189.013414: raise_barrier: in:  md99_resync - w:1 p:0 b:1
>            <...>-4909  [002]   189.021171: lower_barrier:     md99_raid10 - w:1 p:0 b:0
>            <...>-4996  [003]   189.021179: wait_barrier: out: dd - w:0 p:1 b:0
>            <...>-4910  [002]   189.021182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
>            <...>-4996  [003]   189.021184: wait_barrier: in:  dd - w:0 p:1 b:1
>            <...>-4915  [003]   218.929283: wait_barrier: in:  flush-9:99 - w:1 p:1 b:1
> 
> 

So the 'dd' process successfully waited for the barrier to be gone at
189.021179, and thus set pending to '1'.  It then submitted the IO request.
We should then see swapper (or possibly some other thread) calling
allow_barrier when the request completes.  But we don't.
A request could possibly take many milliseconds to complete, but it shouldn't
take seconds and certainly not minutes.

It might be helpful if you could run this again, and in make_request(), after
the call to "wait_barrier()" print out:
  bio->bi_sector, bio->bi_size, bio->bi_rw

I'm guessing that the last request that doesn't seem to complete will be
different from the other in some important way.

Thanks,
NeilBrown





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-07-23  3:19   ` Neil Brown
@ 2010-07-23 15:47     ` Justin Bronder
  2010-08-02  2:29       ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Justin Bronder @ 2010-07-23 15:47 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7321 bytes --]

On 23/07/10 13:19 +1000, Neil Brown wrote:
> On Thu, 22 Jul 2010 14:49:33 -0400
> Justin Bronder <jsbronder@gentoo.org> wrote:
> 
> > On 16/07/10 14:46 -0400, Justin Bronder wrote:
> > 
> > I've done some more research that may potentially help. All of
> > the following was done with 2.6.34.1.
> > 
> > Still produces the hang:
> >     - Using cp (may take a bit longer).
> >     - Using jfs as the filesystem.
> >     - Dropping RESYNC_DEPTH to 32
> >     - Using the offset layout.
> > 
> > Does not produce the hang:
> >     - Using the near layout.
> >     - Using dd on the partition directly instead of on a
> >       filesystem via something like:
> >       dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M
> > 
> > 
> > As the barrier code is very similiar, I repeated a number of
> > these tests using raid1 instead of raid10.  In every case, I was
> > unable to cause the system to hang.  I focused on the barriers
> > due to the tracebacks in the previous email.  For the heck of it,
> > I added some tracing (patch below) where the reason for the hang
> > is fairly obvious.  Of course, how it happened isn't.
> > 
> > The last bit of the trace before the hang.
> 
> Thanks for doing this!
> 
> See below...

<previous trace cut>

> 
> 
> So the 'dd' process successfully waited for the barrier to be gone at
> 189.021179, and thus set pending to '1'.  It then submitted the IO request.
> We should then see swapper (or possibly some other thread) calling
> allow_barrier when the request completes.  But we don't.
> A request could possibly take many milliseconds to complete, but it shouldn't
> take seconds and certainly not minutes.
> 
> It might be helpful if you could run this again, and in make_request(), after
> the call to "wait_barrier()" print out:
>   bio->bi_sector, bio->bi_size, bio->bi_rw
> 
> I'm guessing that the last request that doesn't seem to complete will be
> different from the other in some important way.

Nothing stood out to me, but here's the tail end of a couple of different
traces.

           <...>-5047  [002]   207.023784: wait_barrier: in:  dd - w:0 p:11 b:0
           <...>-5047  [002]   207.023784: wait_barrier: out: dd - w:0 p:12 b:0
           <...>-5047  [002]   207.023785: make_request: dd - sector:7472001 sz:40960 rw:0
           <...>-4958  [002]   207.023872: raise_barrier: mid: md99_resync - w:0 p:12 b:1
           <...>-5047  [002]   207.024689: allow_barrier:     dd - w:0 p:11 b:1
           <...>-5047  [002]   207.024695: allow_barrier:     dd - w:0 p:10 b:1
           <...>-5047  [002]   207.024697: allow_barrier:     dd - w:0 p:9 b:1
           <...>-5047  [002]   207.024710: allow_barrier:     dd - w:0 p:8 b:1
           <...>-5047  [002]   207.024713: allow_barrier:     dd - w:0 p:7 b:1
           <...>-5047  [002]   207.026679: wait_barrier: in:  dd - w:0 p:7 b:1
          <idle>-0     [003]   207.043049: allow_barrier:     swapper - w:1 p:6 b:1
          <idle>-0     [003]   207.043058: allow_barrier:     swapper - w:1 p:5 b:1
          <idle>-0     [003]   207.043063: allow_barrier:     swapper - w:1 p:4 b:1
          <idle>-0     [003]   207.043070: allow_barrier:     swapper - w:1 p:3 b:1
          <idle>-0     [003]   207.043074: allow_barrier:     swapper - w:1 p:2 b:1
          <idle>-0     [003]   207.043079: allow_barrier:     swapper - w:1 p:1 b:1
          <idle>-0     [003]   207.043084: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4958  [003]   207.043108: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4958  [003]   207.043150: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4957  [003]   207.051206: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-5047  [002]   207.051215: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-5047  [002]   207.051216: make_request: dd - sector:7472081 sz:20480 rw:0
           <...>-4958  [003]   207.051218: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-5047  [002]   207.051227: wait_barrier: in:  dd - w:0 p:1 b:1
          <idle>-0     [002]   207.058929: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4958  [003]   207.058938: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4958  [003]   207.059044: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4957  [003]   207.067171: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-5047  [002]   207.067179: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-5047  [002]   207.067180: make_request: dd - sector:7472121 sz:3584 rw:0
           <...>-4958  [003]   207.067182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-5047  [002]   207.067184: wait_barrier: in:  dd - w:0 p:1 b:1



          <idle>-0     [000]   463.231730: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [000]   463.231739: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [000]   463.231746: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [000]   463.231765: allow_barrier:     swapper - w:2 p:1 b:1
          <idle>-0     [000]   463.231774: allow_barrier:     swapper - w:2 p:0 b:1
           <...>-5004  [000]   463.231792: raise_barrier: out: md99_resync - w:2 p:0 b:1
           <...>-5004  [000]   463.232005: raise_barrier: in:  md99_resync - w:2 p:0 b:1
           <...>-5003  [001]   463.232453: lower_barrier:     md99_raid10 - w:2 p:0 b:0
           <...>-5009  [000]   463.232463: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
           <...>-5009  [000]   463.232464: make_request: flush-9:99 - sector:13931137 sz:61440 rw:1
           <...>-5105  [001]   463.232466: wait_barrier: out: dd - w:0 p:2 b:0
           <...>-5105  [001]   463.232467: make_request: dd - sector:7204393 sz:40960 rw:0
           <...>-5009  [000]   463.232476: wait_barrier: in:  flush-9:99 - w:0 p:2 b:0
           <...>-5009  [000]   463.232477: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
           <...>-5009  [000]   463.232477: make_request: flush-9:99 - sector:13931257 sz:3584 rw:1
           <...>-5009  [000]   463.232481: wait_barrier: in:  flush-9:99 - w:0 p:3 b:0
           <...>-5009  [000]   463.232482: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
           <...>-5009  [000]   463.232483: make_request: flush-9:99 - sector:13931264 sz:512 rw:1
           <...>-5105  [001]   463.232492: wait_barrier: in:  dd - w:0 p:4 b:0
           <...>-5105  [001]   463.232493: wait_barrier: out: dd - w:0 p:5 b:0
           <...>-5105  [001]   463.232494: make_request: dd - sector:7204473 sz:3584 rw:0
           <...>-5004  [000]   463.232495: raise_barrier: mid: md99_resync - w:0 p:5 b:1
           <...>-5105  [001]   463.232496: wait_barrier: in:  dd - w:0 p:5 b:1
           <...>-5009  [000]   463.232522: wait_barrier: in:  flush-9:99 - w:1 p:5 b:1
          <idle>-0     [000]   463.232726: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [001]   463.240520: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [000]   463.240946: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [000]   463.240955: allow_barrier:     swapper - w:2 p:1 b:1

Thanks,

-- 
Justin Bronder

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-07-23 15:47     ` Justin Bronder
@ 2010-08-02  2:29       ` Neil Brown
  2010-08-02  2:58         ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2010-08-02  2:29 UTC (permalink / raw)
  To: Justin Bronder; +Cc: linux-raid

On Fri, 23 Jul 2010 11:47:01 -0400
Justin Bronder <jsbronder@gentoo.org> wrote:

> On 23/07/10 13:19 +1000, Neil Brown wrote:
> > On Thu, 22 Jul 2010 14:49:33 -0400

> > 
> > So the 'dd' process successfully waited for the barrier to be gone at
> > 189.021179, and thus set pending to '1'.  It then submitted the IO request.
> > We should then see swapper (or possibly some other thread) calling
> > allow_barrier when the request completes.  But we don't.
> > A request could possibly take many milliseconds to complete, but it shouldn't
> > take seconds and certainly not minutes.
> > 
> > It might be helpful if you could run this again, and in make_request(), after
> > the call to "wait_barrier()" print out:
> >   bio->bi_sector, bio->bi_size, bio->bi_rw
> > 
> > I'm guessing that the last request that doesn't seem to complete will be
> > different from the other in some important way.
> 
> Nothing stood out to me, but here's the tail end of a couple of different
> traces.

Thanks a lot!  Something does stand out for me!....

>            <...>-5047  [002]   207.051215: wait_barrier: out: dd - w:0 p:1 b:0
>            <...>-5047  [002]   207.051216: make_request: dd - sector:7472081 sz:20480 rw:0
>            <...>-4958  [003]   207.051218: raise_barrier: mid: md99_resync - w:0 p:1 b:1
>            <...>-5047  [002]   207.051227: wait_barrier: in:  dd - w:0 p:1 b:1
>           <idle>-0     [002]   207.058929: allow_barrier:     swapper - w:1 p:0 b:1
>            <...>-4958  [003]   207.058938: raise_barrier: out: md99_resync - w:1 p:0 b:1
>            <...>-4958  [003]   207.059044: raise_barrier: in:  md99_resync - w:1 p:0 b:1
>            <...>-4957  [003]   207.067171: lower_barrier:     md99_raid10 - w:1 p:0 b:0
>            <...>-5047  [002]   207.067179: wait_barrier: out: dd - w:0 p:1 b:0
>            <...>-5047  [002]   207.067180: make_request: dd - sector:7472121 sz:3584 rw:0
>            <...>-4958  [003]   207.067182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
>            <...>-5047  [002]   207.067184: wait_barrier: in:  dd - w:0 p:1 b:1

The last successful IO is only 3584 bytes - 7 sectors.  All the others are
much larger.
It is almost certain that the read needed to cross a chunk boundary, so some
goes to one device, some to the next.  It was probably a 64K read or similar.
The first 5 pages all fit in one device, and so goes through as a 20K
read. The next page doesn't so it comes down to md/raid10 as a 1 page read.
raid10 splits it into a 7 sector read and a 1 sector read.  We see the 7
sector read being initiated, but it doesn't complete for some reason and the
resync barrier gets in the way so the 1 sector read blocks in wait_barrier.

In the next trace....
> 
> 
> 
>           <idle>-0     [000]   463.231730: allow_barrier:     swapper - w:2 p:4 b:1
>           <idle>-0     [000]   463.231739: allow_barrier:     swapper - w:2 p:3 b:1
>           <idle>-0     [000]   463.231746: allow_barrier:     swapper - w:2 p:2 b:1
>           <idle>-0     [000]   463.231765: allow_barrier:     swapper - w:2 p:1 b:1
>           <idle>-0     [000]   463.231774: allow_barrier:     swapper - w:2 p:0 b:1
>            <...>-5004  [000]   463.231792: raise_barrier: out: md99_resync - w:2 p:0 b:1
>            <...>-5004  [000]   463.232005: raise_barrier: in:  md99_resync - w:2 p:0 b:1
>            <...>-5003  [001]   463.232453: lower_barrier:     md99_raid10 - w:2 p:0 b:0
>            <...>-5009  [000]   463.232463: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
>            <...>-5009  [000]   463.232464: make_request: flush-9:99 - sector:13931137 sz:61440 rw:1
>            <...>-5105  [001]   463.232466: wait_barrier: out: dd - w:0 p:2 b:0
>            <...>-5105  [001]   463.232467: make_request: dd - sector:7204393 sz:40960 rw:0
>            <...>-5009  [000]   463.232476: wait_barrier: in:  flush-9:99 - w:0 p:2 b:0
>            <...>-5009  [000]   463.232477: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
>            <...>-5009  [000]   463.232477: make_request: flush-9:99 - sector:13931257 sz:3584 rw:1
>            <...>-5009  [000]   463.232481: wait_barrier: in:  flush-9:99 - w:0 p:3 b:0
>            <...>-5009  [000]   463.232482: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
>            <...>-5009  [000]   463.232483: make_request: flush-9:99 - sector:13931264 sz:512 rw:1
>            <...>-5105  [001]   463.232492: wait_barrier: in:  dd - w:0 p:4 b:0
>            <...>-5105  [001]   463.232493: wait_barrier: out: dd - w:0 p:5 b:0

We see a similar thing with a write being broken into a 15 page writes, then
a 7 sector write, then a 1 sector write - that all works.

>            <...>-5105  [001]   463.232494: make_request: dd - sector:7204473 sz:3584 rw:0
>            <...>-5004  [000]   463.232495: raise_barrier: mid: md99_resync - w:0 p:5 b:1
>            <...>-5105  [001]   463.232496: wait_barrier: in:  dd - w:0 p:5 b:1
>            <...>-5009  [000]   463.232522: wait_barrier: in:  flush-9:99 - w:1 p:5 b:1
>           <idle>-0     [000]   463.232726: allow_barrier:     swapper - w:2 p:4 b:1
>           <idle>-0     [001]   463.240520: allow_barrier:     swapper - w:2 p:3 b:1
>           <idle>-0     [000]   463.240946: allow_barrier:     swapper - w:2 p:2 b:1
>           <idle>-0     [000]   463.240955: allow_barrier:     swapper - w:2 p:1 b:1
> 

But again we see a 7 sector read following a larger read, and the 1 sector
read that should follow gets blocked.

So it is somehow related to the need to split one-page requests across
multiple devices, and it could be specific to read requests.

Ahhhh.... I see the problem.  Because a 'generic_make_request' is already
active, the once called by raid10::make_request just queues the request until
the top level one completes.   This results in a deadlock.

I'll have to ponder a bit to figure out the best way to fix this.

Thanks again for the report and the help tracking down the problem.

NeilBrown


> Thanks,
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-08-02  2:29       ` Neil Brown
@ 2010-08-02  2:58         ` Neil Brown
  2010-08-02 20:37           ` Justin Bronder
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2010-08-02  2:58 UTC (permalink / raw)
  To: Justin Bronder; +Cc: linux-raid

On Mon, 2 Aug 2010 12:29:49 +1000
Neil Brown <neilb@suse.de> wrote:


> Ahhhh.... I see the problem.  Because a 'generic_make_request' is already
> active, the once called by raid10::make_request just queues the request until
> the top level one completes.   This results in a deadlock.
> 
> I'll have to ponder a bit to figure out the best way to fix this.
> 

So, one good strong cup of tea later I think I have a good solution.

Would you be able to test with this patch and confirm that you cannot
reproduce the hang?
Thanks.

NeilBrown

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 42e64e4..d1d6891 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -825,11 +825,29 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 		 */
 		bp = bio_split(bio,
 			       chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
+
+		/* Each of these 'make_request' calls will call 'wait_barrier'.
+		 * If the first succeeds but the second blocks due to the resync
+		 * thread raising the barrier, we will deadlock because the
+		 * IO to the underlying device will be queued in generic_make_request
+		 * and will never complete, so will never reduce nr_pending.
+		 * So increment nr_waiting here so no new raise_barriers will
+		 * succeed, and so the second wait_barrier cannot block.
+		 */
+		spin_lock_irq(&conf->resync_lock);
+		conf->nr_waiting++;
+		spin_unlock_irq(&conf->resync_lock);
+
 		if (make_request(mddev, &bp->bio1))
 			generic_make_request(&bp->bio1);
 		if (make_request(mddev, &bp->bio2))
 			generic_make_request(&bp->bio2);
 
+		spin_lock_irq(&conf->resync_lock);
+		conf->nr_waiting--;
+		wake_up(&conf->wait_barrier);
+		spin_unlock_irq(&conf->resync_lock);
+
 		bio_pair_release(bp);
 		return 0;
 	bad_map:

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-08-02  2:58         ` Neil Brown
@ 2010-08-02 20:37           ` Justin Bronder
  2010-08-07 11:22             ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Justin Bronder @ 2010-08-02 20:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2461 bytes --]

On 02/08/10 12:58 +1000, Neil Brown wrote:
> On Mon, 2 Aug 2010 12:29:49 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> 
> > Ahhhh.... I see the problem.  Because a 'generic_make_request' is already
> > active, the once called by raid10::make_request just queues the request until
> > the top level one completes.   This results in a deadlock.
> > 
> > I'll have to ponder a bit to figure out the best way to fix this.
> > 
> 
> So, one good strong cup of tea later I think I have a good solution.
> 
> Would you be able to test with this patch and confirm that you cannot
> reproduce the hang?

I've been running with this patch on 2.6.34.1 all day and have yet to cause
the hang.  Given it took under 5 minutes earlier, feel free to add:

Tested-by:  Justin Bronder <jsbronder@gentoo.org>

I really appreciate you taking care of this.  Thanks.

> Thanks.
> 
> NeilBrown
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 42e64e4..d1d6891 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -825,11 +825,29 @@ static int make_request(mddev_t *mddev, struct bio * bio)
>  		 */
>  		bp = bio_split(bio,
>  			       chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> +
> +		/* Each of these 'make_request' calls will call 'wait_barrier'.
> +		 * If the first succeeds but the second blocks due to the resync
> +		 * thread raising the barrier, we will deadlock because the
> +		 * IO to the underlying device will be queued in generic_make_request
> +		 * and will never complete, so will never reduce nr_pending.
> +		 * So increment nr_waiting here so no new raise_barriers will
> +		 * succeed, and so the second wait_barrier cannot block.
> +		 */
> +		spin_lock_irq(&conf->resync_lock);
> +		conf->nr_waiting++;
> +		spin_unlock_irq(&conf->resync_lock);
> +
>  		if (make_request(mddev, &bp->bio1))
>  			generic_make_request(&bp->bio1);
>  		if (make_request(mddev, &bp->bio2))
>  			generic_make_request(&bp->bio2);
>  
> +		spin_lock_irq(&conf->resync_lock);
> +		conf->nr_waiting--;
> +		wake_up(&conf->wait_barrier);
> +		spin_unlock_irq(&conf->resync_lock);
> +
>  		bio_pair_release(bp);
>  		return 0;
>  	bad_map:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Justin Bronder

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raid10 device hangs during resync and heavy I/O.
  2010-08-02 20:37           ` Justin Bronder
@ 2010-08-07 11:22             ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2010-08-07 11:22 UTC (permalink / raw)
  To: Justin Bronder; +Cc: linux-raid

On Mon, 2 Aug 2010 16:37:54 -0400
Justin Bronder <jsbronder@gentoo.org> wrote:

> On 02/08/10 12:58 +1000, Neil Brown wrote:
> > On Mon, 2 Aug 2010 12:29:49 +1000
> > Neil Brown <neilb@suse.de> wrote:
> > 
> > 
> > > Ahhhh.... I see the problem.  Because a 'generic_make_request' is already
> > > active, the once called by raid10::make_request just queues the request until
> > > the top level one completes.   This results in a deadlock.
> > > 
> > > I'll have to ponder a bit to figure out the best way to fix this.
> > > 
> > 
> > So, one good strong cup of tea later I think I have a good solution.
> > 
> > Would you be able to test with this patch and confirm that you cannot
> > reproduce the hang?
> 
> I've been running with this patch on 2.6.34.1 all day and have yet to cause
> the hang.  Given it took under 5 minutes earlier, feel free to add:
> 
> Tested-by:  Justin Bronder <jsbronder@gentoo.org>
> 
> I really appreciate you taking care of this.  Thanks.

And thank you for testing.  I've queued this up now and will send it to Linus
and -stable shortly.

NeilBrown


> 
> > Thanks.
> > 
> > NeilBrown
> > 
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index 42e64e4..d1d6891 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > @@ -825,11 +825,29 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> >  		 */
> >  		bp = bio_split(bio,
> >  			       chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> > +
> > +		/* Each of these 'make_request' calls will call 'wait_barrier'.
> > +		 * If the first succeeds but the second blocks due to the resync
> > +		 * thread raising the barrier, we will deadlock because the
> > +		 * IO to the underlying device will be queued in generic_make_request
> > +		 * and will never complete, so will never reduce nr_pending.
> > +		 * So increment nr_waiting here so no new raise_barriers will
> > +		 * succeed, and so the second wait_barrier cannot block.
> > +		 */
> > +		spin_lock_irq(&conf->resync_lock);
> > +		conf->nr_waiting++;
> > +		spin_unlock_irq(&conf->resync_lock);
> > +
> >  		if (make_request(mddev, &bp->bio1))
> >  			generic_make_request(&bp->bio1);
> >  		if (make_request(mddev, &bp->bio2))
> >  			generic_make_request(&bp->bio2);
> >  
> > +		spin_lock_irq(&conf->resync_lock);
> > +		conf->nr_waiting--;
> > +		wake_up(&conf->wait_barrier);
> > +		spin_unlock_irq(&conf->resync_lock);
> > +
> >  		bio_pair_release(bp);
> >  		return 0;
> >  	bad_map:
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-08-07 11:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-16 18:46 Raid10 device hangs during resync and heavy I/O Justin Bronder
2010-07-16 18:49 ` Justin Bronder
2010-07-22 18:49 ` Justin Bronder
2010-07-23  3:19   ` Neil Brown
2010-07-23 15:47     ` Justin Bronder
2010-08-02  2:29       ` Neil Brown
2010-08-02  2:58         ` Neil Brown
2010-08-02 20:37           ` Justin Bronder
2010-08-07 11:22             ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).