From: Alexander Lyakas <alex.bolshoy@gmail.com>
To: NeilBrown <neilb@suse.de>, linux-raid <linux-raid@vger.kernel.org>
Cc: Shyam Kaushik <shyam@zadarastorage.com>, yair@zadarastorage.com
Subject: Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
Date: Sun, 2 Jun 2013 15:43:41 +0300 [thread overview]
Message-ID: <CAGRgLy4zTwopPS-cVwT1RKBr8pE7f3n-X3YLyFpktooQtTYGuQ@mail.gmail.com> (raw)
In-Reply-To: <CAGRgLy5m8KxKPxJE+AbtYbBDzo9uo81wez1ba=H4VY3Jiq32WQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 9743 bytes --]
Hello Neil,
I believe I have found what is causing the deadlock. It happens in two flavors:
1)
# raid1d() is called, and conf->pending_bio_list is non-empty at this point
# raid1d() calls md_check_recovery(), which eventually calls
raid1_add_disk(), which calls raise_barrier()
# Now raise_barrier will wait for conf->nr_pending to become 0, but it
cannot become 0, because there are bios sitting in
conf->pending_bio_list, which nobody will flush, because raid1d is the
one supposed to call flush_pending_writes(), either directly or via
handle_read_error. But it is stuck in raise_barrier.
2)
# raid1_add_disk() calls raise_barrier(), and waits for
conf->nr_pending to become 0, as before
# new WRITE comes and calls wait_barrier(), but this thread has a
non-empty current->bio_list
# In this case, the code allows the WRITE to go through
wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
again end up in conf->pending_bio_list (either via raid1_unplug or
directly). But nobody will flush conf->pending_bio_list, because
raid1d is stuck in raise_barrier.
Previously, for example in kernel 3.2, raid1_add_disk did not call
raise_barrier, so this problem did not happen.
Attached is a reproduction with some prints that I added to
raise_barrier and wait_barrier (their code also attached). It
demonstrates case 2. It shows that once raise_barrier got called,
conf->nr_pending drops down, until it equals the number of
wait_barrier calls, that slipped through because of non-empty
current->bio_list. And at this point, this array hangs.
Can you please comment on how to fix this problem. It looks like a
real deadlock.
We can perhaps call md_check_recovery() after flush_pending_writes(),
but this only makes the window smaller, not closes it entirely. But it
looks like we really should not be calling raise_barrier from raid1d.
Thanks,
Alex.
On Thu, May 30, 2013 at 5:30 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hi Neil,
>
> On Mon, May 20, 2013 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
>> On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> wrote:
>>
>>> Hello Neil,
>>> we are hitting issue that looks very similar; we are on kernel 3.8.2.
>>> The array is a 2-device raid1, which experienced a drive failure, but
>>> then drive was removed and re-added back to the array (although
>>> rebuild never started). Relevant parts of the kernel log:
>>>
>>> May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Disk failure on
>>> dm-6, disabling device.
>>> May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Operation
>>> continuing on 1 devices.
>>> May 16 11:12:14 kernel: [46850.090511] md: super_written gets
>>> error=-5, uptodate=0
>>> May 16 11:12:14 kernel: [46850.090676] md/raid1:md2: dm-6:
>>> rescheduling sector 18040736
>>> May 16 11:12:14 kernel: [46850.090764] md/raid1:md2: dm-6:
>>> rescheduling sector 20367040
>>> May 16 11:12:14 kernel: [46850.090826] md/raid1:md2: dm-6:
>>> rescheduling sector 17613504
>>> May 16 11:12:14 kernel: [46850.090883] md/raid1:md2: dm-6:
>>> rescheduling sector 18042720
>>> May 16 11:12:15 kernel: [46850.229970] md/raid1:md2: redirecting
>>> sector 18040736 to other mirror: dm-13
>>> May 16 11:12:15 kernel: [46850.257687] md/raid1:md2: redirecting
>>> sector 20367040 to other mirror: dm-13
>>> May 16 11:12:15 kernel: [46850.268731] md/raid1:md2: redirecting
>>> sector 17613504 to other mirror: dm-13
>>> May 16 11:12:15 kernel: [46850.398242] md/raid1:md2: redirecting
>>> sector 18042720 to other mirror: dm-13
>>> May 16 11:12:23 kernel: [46858.448465] md: unbind<dm-6>
>>> May 16 11:12:23 kernel: [46858.456081] md: export_rdev(dm-6)
>>> May 16 11:23:19 kernel: [47515.062547] md: bind<dm-6>
>>>
>>> May 16 11:24:28 kernel: [47583.920126] INFO: task md2_raid1:15749
>>> blocked for more than 60 seconds.
>>> May 16 11:24:28 kernel: [47583.921829] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> May 16 11:24:28 kernel: [47583.923361] md2_raid1 D
>>> 0000000000000001 0 15749 2 0x00000000
>>> May 16 11:24:28 kernel: [47583.923367] ffff880097c23c28
>>> 0000000000000046 ffff880000000002 00000000982c43b8
>>> May 16 11:24:28 kernel: [47583.923372] ffff880097c23fd8
>>> ffff880097c23fd8 ffff880097c23fd8 0000000000013f40
>>> May 16 11:24:28 kernel: [47583.923376] ffff880119b11740
>>> ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0
>>> May 16 11:24:28 kernel: [47583.923381] Call Trace:
>>> May 16 11:24:28 kernel: [47583.923395] [<ffffffff816eb399>] schedule+0x29/0x70
>>> May 16 11:24:28 kernel: [47583.923402] [<ffffffffa0516736>]
>>> raise_barrier+0x106/0x160 [raid1]
>>> May 16 11:24:28 kernel: [47583.923410] [<ffffffff8107fcc0>] ?
>>> add_wait_queue+0x60/0x60
>>> May 16 11:24:28 kernel: [47583.923415] [<ffffffffa0516af7>]
>>> raid1_add_disk+0x197/0x200 [raid1]
>>> May 16 11:24:28 kernel: [47583.923421] [<ffffffff81567fa4>]
>>> remove_and_add_spares+0x104/0x220
>>> May 16 11:24:28 kernel: [47583.923426] [<ffffffff8156a02d>]
>>> md_check_recovery.part.49+0x40d/0x530
>>> May 16 11:24:28 kernel: [47583.923429] [<ffffffff8156a165>]
>>> md_check_recovery+0x15/0x20
>>> May 16 11:24:28 kernel: [47583.923433] [<ffffffffa0517e42>]
>>> raid1d+0x22/0x180 [raid1]
>>> May 16 11:24:28 kernel: [47583.923439] [<ffffffff81045cd9>] ?
>>> default_spin_lock_flags+0x9/0x10
>>> May 16 11:24:28 kernel: [47583.923443] [<ffffffff81045cd9>] ?
>>> default_spin_lock_flags+0x9/0x10
>>> May 16 11:24:28 kernel: [47583.923449] [<ffffffff815624ed>]
>>> md_thread+0x10d/0x140
>>> May 16 11:24:28 kernel: [47583.923453] [<ffffffff8107fcc0>] ?
>>> add_wait_queue+0x60/0x60
>>> May 16 11:24:28 kernel: [47583.923457] [<ffffffff815623e0>] ?
>>> md_rdev_init+0x140/0x140
>>> May 16 11:24:28 kernel: [47583.923461] [<ffffffff8107f0d0>] kthread+0xc0/0xd0
>>> May 16 11:24:28 kernel: [47583.923465] [<ffffffff8107f010>] ?
>>> flush_kthread_worker+0xb0/0xb0
>>> May 16 11:24:28 kernel: [47583.923472] [<ffffffff816f506c>]
>>> ret_from_fork+0x7c/0xb0
>>> May 16 11:24:28 kernel: [47583.923476] [<ffffffff8107f010>] ?
>>> flush_kthread_worker+0xb0/0xb0
>>>
>>> dm-13 is the good drive, dm-6 is the one that failed.
>>>
>>> At this point, we have several threads calling submit_bio and all
>>> stuck like this:
>>>
>>> cat /proc/16218/stack
>>> [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1]
>>> [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1]
>>> [<ffffffff81561ed0>] md_make_request+0xd0/0x200
>>> [<ffffffff8132bcaa>] generic_make_request+0xca/0x100
>>> [<ffffffff8132bd5b>] submit_bio+0x7b/0x160
>>> ...
>>>
>>> And md raid1 thread stuck like this:
>>>
>>> cat /proc/15749/stack
>>> [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1]
>>> [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1]
>>> [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220
>>> [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530
>>> [<ffffffff8156a165>] md_check_recovery+0x15/0x20
>>> [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1]
>>> [<ffffffff815624ed>] md_thread+0x10d/0x140
>>> [<ffffffff8107f0d0>] kthread+0xc0/0xd0
>>> [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> We have also two user-space threads stuck:
>>>
>>> one is trying to read /sys/block/md2/md/array_state and its kernel stack is:
>>> # cat /proc/2251/stack
>>> [<ffffffff81564602>] md_attr_show+0x72/0xf0
>>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
>>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
>>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> the other wants to read from /proc/mdstat and is:
>>> [<ffffffff81563d2b>] md_seq_show+0x4b/0x540
>>> [<ffffffff811bdd1b>] seq_read+0x16b/0x400
>>> [<ffffffff811ff572>] proc_reg_read+0x82/0xc0
>>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> mdadm --detail also gets stuck if attempted, in stack like this:
>>> cat /proc/2864/stack
>>> [<ffffffff81564602>] md_attr_show+0x72/0xf0
>>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
>>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
>>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Might your patch https://patchwork.kernel.org/patch/2260051/ fix this
>>
>> Probably.
>>
>>> issue? Is this patch alone applicable to kernel 3.8.2?
>>
>> Probably.
>
> This problem still happens in 3.8.13, which contains the above fix of
> yours. I am currently debugging it deeper, but still don't know what
> the issue is.
> The problem happens when raid1_add_disk() calls raise_barrier().
> conf->nr_waiting is 0 at this point. So then raise_barrier() starts
> waiting for conf->nr_pending getting to 0, but it never gets to 0.
> From this point, anybody that calls wait_barrier() is also stuck.
>
> This looks like a real deadlock in a non-ancient kernel. I realize
> that 3.8.x is EOL, but still it's only one stable behind.
> Can you help further with this problem?
>
> Another problem that I see often, is when a drive fails in RAID1, and
> then rdev is marked as "Blocked", but then Blocked bit never gets
> cleared for some reason. So this drive is not ejected from the array.
> Again, I was unable to debug it deeper at this point.
>
> Thanks,
> Alex.
>
>
>
>
>
>
>
>
>
>
>
>
>>
>>> Can you pls kindly comment on this.
>>>
>>
>> NeilBrown
[-- Attachment #2: repro.txt --]
[-- Type: text/plain, Size: 14688 bytes --]
[ 514.714425] md: bind<dm-5>
[ 514.718751] md2: raise_barrier(1): conf->barrier=0 conf->nr_waiting=0 conf->nr_pending=191 conf->pending_count=0
[ 514.718755] md2: raise_barrier(2): conf->barrier=1 conf->nr_waiting=0 conf->nr_pending=191 conf->pending_count=0
[ 514.720564] md2: wait_barrier(PID=7079,1) conf->barrier=1 conf->nr_waiting=1 conf->nr_pending=191 conf->pending_count=0
[ 514.722073] md2: wait_barrier(PID=7069,1) conf->barrier=1 conf->nr_waiting=2 conf->nr_pending=191 conf->pending_count=0
[ 514.724602] md2: wait_barrier(PID=9922,1) conf->barrier=1 conf->nr_waiting=3 conf->nr_pending=191 conf->pending_count=0
[ 514.725575] md2: wait_barrier(PID=9909,1) conf->barrier=1 conf->nr_waiting=4 conf->nr_pending=187 conf->pending_count=0
[ 514.725661] md2: wait_barrier(PID=7082,1) conf->barrier=1 conf->nr_waiting=5 conf->nr_pending=187 conf->pending_count=0
[ 514.725836] md2: wait_barrier(PID=9910,1) conf->barrier=1 conf->nr_waiting=6 conf->nr_pending=178 conf->pending_count=0
[ 514.725848] md2: wait_barrier(PID=9918,1) conf->barrier=1 conf->nr_waiting=7 conf->nr_pending=178 conf->pending_count=0
[ 514.725860] md2: wait_barrier(PID=7077,1) conf->barrier=1 conf->nr_waiting=8 conf->nr_pending=178 conf->pending_count=0
[ 514.725872] md2: wait_barrier(PID=9905,1) conf->barrier=1 conf->nr_waiting=9 conf->nr_pending=178 conf->pending_count=0
[ 514.725884] md2: wait_barrier(PID=7147,1) conf->barrier=1 conf->nr_waiting=10 conf->nr_pending=178 conf->pending_count=0
[ 514.725917] md2: wait_barrier(PID=10217,1) conf->barrier=1 conf->nr_waiting=11 conf->nr_pending=178 conf->pending_count=0
[ 514.727122] md2: wait_barrier(PID=7177,1) conf->barrier=1 conf->nr_waiting=12 conf->nr_pending=155 conf->pending_count=0
[ 514.727446] md2: wait_barrier(PID=9904,1) conf->barrier=1 conf->nr_waiting=13 conf->nr_pending=155 conf->pending_count=0
[ 514.727475] md2: wait_barrier(PID=9894,1) conf->barrier=1 conf->nr_waiting=14 conf->nr_pending=155 conf->pending_count=0
[ 514.727489] md2: wait_barrier(PID=9898,1) conf->barrier=1 conf->nr_waiting=15 conf->nr_pending=155 conf->pending_count=0
[ 514.727507] md2: wait_barrier(PID=7072,1) conf->barrier=1 conf->nr_waiting=16 conf->nr_pending=155 conf->pending_count=0
[ 514.727651] md2: wait_barrier(PID=9899,1) conf->barrier=1 conf->nr_waiting=17 conf->nr_pending=150 conf->pending_count=0
[ 514.727853] md2: wait_barrier(PID=9906,1) conf->barrier=1 conf->nr_waiting=18 conf->nr_pending=150 conf->pending_count=0
[ 514.727867] md2: wait_barrier(PID=9912,1) conf->barrier=1 conf->nr_waiting=19 conf->nr_pending=150 conf->pending_count=0
[ 514.727915] md2: wait_barrier(PID=6888,1) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=150 conf->pending_count=0
[ 514.728109] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=150 conf->pending_count=0
[ 514.728112] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=151 conf->pending_count=0
[ 514.728136] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=151 conf->pending_count=1
[ 514.728138] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=152 conf->pending_count=1
[ 514.728143] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=152 conf->pending_count=2
[ 514.728145] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=153 conf->pending_count=2
[ 514.728149] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=153 conf->pending_count=3
[ 514.730110] md2: wait_barrier(PID=9919,1) conf->barrier=1 conf->nr_waiting=22 conf->nr_pending=153 conf->pending_count=3
[ 514.730188] md2: wait_barrier(PID=5773,1) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=153 conf->pending_count=3
[ 514.731139] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=153 conf->pending_count=3
[ 514.731143] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=154 conf->pending_count=3
[ 514.731176] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=154 conf->pending_count=4
[ 514.731179] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=155 conf->pending_count=4
[ 514.731183] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=155 conf->pending_count=5
[ 514.731186] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=156 conf->pending_count=5
[ 514.731193] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=156 conf->pending_count=6
[ 514.731371] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=156 conf->pending_count=6
[ 514.731374] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=157 conf->pending_count=6
[ 514.731382] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=157 conf->pending_count=7
[ 514.731385] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=158 conf->pending_count=7
[ 514.731388] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=158 conf->pending_count=8
[ 514.731390] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=159 conf->pending_count=8
[ 514.731394] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=159 conf->pending_count=9
[ 514.731491] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=159 conf->pending_count=9
[ 514.731494] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=160 conf->pending_count=9
[ 514.731502] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=160 conf->pending_count=10
[ 514.731505] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=161 conf->pending_count=10
[ 514.731508] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=161 conf->pending_count=11
[ 514.731510] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=162 conf->pending_count=11
[ 514.731514] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=162 conf->pending_count=12
[ 514.731755] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=162 conf->pending_count=12
[ 514.731758] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=163 conf->pending_count=12
[ 514.731767] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=163 conf->pending_count=13
[ 514.731769] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=164 conf->pending_count=13
[ 514.731773] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=164 conf->pending_count=14
[ 514.731775] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=165 conf->pending_count=14
[ 514.731826] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=165 conf->pending_count=15
[ 514.731875] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=165 conf->pending_count=15
[ 514.731878] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=166 conf->pending_count=15
[ 514.731884] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=166 conf->pending_count=16
[ 514.731887] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=167 conf->pending_count=16
[ 514.731890] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=167 conf->pending_count=17
[ 514.731893] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=168 conf->pending_count=17
[ 514.731902] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=168 conf->pending_count=18
[ 514.731971] md2: wait_barrier(PID=9896,1) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=168 conf->pending_count=18
[ 514.732304] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=168 conf->pending_count=18
[ 514.732307] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=169 conf->pending_count=18
[ 514.732340] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=169 conf->pending_count=19
[ 514.732343] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=170 conf->pending_count=19
[ 514.732346] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=170 conf->pending_count=20
[ 514.732349] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=171 conf->pending_count=20
[ 514.732352] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=171 conf->pending_count=21
[ 514.732397] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=171 conf->pending_count=21
[ 514.732400] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=172 conf->pending_count=21
[ 514.732412] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=172 conf->pending_count=22
[ 514.732414] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=173 conf->pending_count=22
[ 514.732419] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=173 conf->pending_count=23
[ 514.732421] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=174 conf->pending_count=23
[ 514.732424] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=174 conf->pending_count=24
[ 514.732459] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=174 conf->pending_count=24
[ 514.732461] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=175 conf->pending_count=24
[ 514.732467] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=175 conf->pending_count=25
[ 514.732469] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=176 conf->pending_count=25
[ 514.732473] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=176 conf->pending_count=26
[ 514.732475] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=177 conf->pending_count=26
[ 514.732478] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=177 conf->pending_count=27
[ 514.734807] md2: wait_barrier(PID=9901,1) conf->barrier=1 conf->nr_waiting=33 conf->nr_pending=177 conf->pending_count=27
[ 514.734830] md2: wait_barrier(PID=9923,1) conf->barrier=1 conf->nr_waiting=34 conf->nr_pending=177 conf->pending_count=27
[ 514.735975] md2: wait_barrier(PID=7175,1) conf->barrier=1 conf->nr_waiting=35 conf->nr_pending=158 conf->pending_count=27
[ 514.735992] md2: wait_barrier(PID=7081,1) conf->barrier=1 conf->nr_waiting=36 conf->nr_pending=158 conf->pending_count=27
[ 514.736083] md2: wait_barrier(PID=9921,1) conf->barrier=1 conf->nr_waiting=37 conf->nr_pending=158 conf->pending_count=27
[ 514.736670] md2: wait_barrier(PID=7075,1) conf->barrier=1 conf->nr_waiting=38 conf->nr_pending=158 conf->pending_count=27
[ 514.740425] md2: wait_barrier(PID=9925,1) conf->barrier=1 conf->nr_waiting=39 conf->nr_pending=125 conf->pending_count=27
[ 514.740448] md2: wait_barrier(PID=41,1) conf->barrier=1 conf->nr_waiting=40 conf->nr_pending=125 conf->pending_count=27
[ 514.740465] md2: wait_barrier(PID=7083,1) conf->barrier=1 conf->nr_waiting=41 conf->nr_pending=125 conf->pending_count=27
[ 514.740478] md2: wait_barrier(PID=9911,1) conf->barrier=1 conf->nr_waiting=42 conf->nr_pending=125 conf->pending_count=27
[ 514.740492] md2: wait_barrier(PID=9903,1) conf->barrier=1 conf->nr_waiting=43 conf->nr_pending=125 conf->pending_count=27
[ 514.741067] md2: wait_barrier(PID=7071,1) conf->barrier=1 conf->nr_waiting=44 conf->nr_pending=125 conf->pending_count=27
[ 514.743463] md2: wait_barrier(PID=9920,1) conf->barrier=1 conf->nr_waiting=45 conf->nr_pending=105 conf->pending_count=27
[ 514.743478] md2: wait_barrier(PID=9924,1) conf->barrier=1 conf->nr_waiting=46 conf->nr_pending=105 conf->pending_count=27
[ 514.743489] md2: wait_barrier(PID=7146,1) conf->barrier=1 conf->nr_waiting=47 conf->nr_pending=105 conf->pending_count=27
[ 514.745765] md2: wait_barrier(PID=9890,1) conf->barrier=1 conf->nr_waiting=48 conf->nr_pending=89 conf->pending_count=27
[ 514.745784] md2: wait_barrier(PID=9879,1) conf->barrier=1 conf->nr_waiting=49 conf->nr_pending=89 conf->pending_count=27
[ 514.746556] md2: wait_barrier(PID=6,1) conf->barrier=1 conf->nr_waiting=50 conf->nr_pending=81 conf->pending_count=27
[ 514.748897] md2: wait_barrier(PID=9966,1) conf->barrier=1 conf->nr_waiting=51 conf->nr_pending=75 conf->pending_count=27
[ 514.749195] md2: wait_barrier(PID=9900,1) conf->barrier=1 conf->nr_waiting=52 conf->nr_pending=67 conf->pending_count=27
[ 514.750047] md2: wait_barrier(PID=9887,1) conf->barrier=1 conf->nr_waiting=53 conf->nr_pending=57 conf->pending_count=27
[ 514.750065] md2: wait_barrier(PID=9889,1) conf->barrier=1 conf->nr_waiting=54 conf->nr_pending=57 conf->pending_count=27
[ 514.750076] md2: wait_barrier(PID=9893,1) conf->barrier=1 conf->nr_waiting=55 conf->nr_pending=57 conf->pending_count=27
[ 514.752565] md2: wait_barrier(PID=9915,1) conf->barrier=1 conf->nr_waiting=56 conf->nr_pending=48 conf->pending_count=27
[ 514.752580] md2: wait_barrier(PID=7074,1) conf->barrier=1 conf->nr_waiting=57 conf->nr_pending=48 conf->pending_count=27
[ 514.752592] md2: wait_barrier(PID=9902,1) conf->barrier=1 conf->nr_waiting=58 conf->nr_pending=48 conf->pending_count=27
[ 514.752603] md2: wait_barrier(PID=9916,1) conf->barrier=1 conf->nr_waiting=59 conf->nr_pending=48 conf->pending_count=27
[ 514.752926] md2: wait_barrier(PID=42,1) conf->barrier=1 conf->nr_waiting=60 conf->nr_pending=27 conf->pending_count=27
[ 514.752946] md2: wait_barrier(PID=7073,1) conf->barrier=1 conf->nr_waiting=61 conf->nr_pending=27 conf->pending_count=27
[ 514.752960] md2: wait_barrier(PID=9914,1) conf->barrier=1 conf->nr_waiting=62 conf->nr_pending=27 conf->pending_count=27
[-- Attachment #3: raise_wait_barrier.c --]
[-- Type: text/x-csrc, Size: 2457 bytes --]
static void raise_barrier(struct r1conf *conf)
{
spin_lock_irq(&conf->resync_lock);
pr_info("%s: raise_barrier(1): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);
/* Wait until no block IO is waiting */
wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
conf->resync_lock);
/* block any new IO from starting */
conf->barrier++;
pr_info("%s: raise_barrier(2): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);
/* Now wait for all pending IO to complete */
wait_event_lock_irq(conf->wait_barrier,
!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
conf->resync_lock);
pr_info("%s: raise_barrier(3): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);
spin_unlock_irq(&conf->resync_lock);
}
static void wait_barrier(struct r1conf *conf)
{
int waited = 0;
spin_lock_irq(&conf->resync_lock);
if (conf->barrier) {
conf->nr_waiting++;
waited = 1;
pr_info("%s: wait_barrier(PID=%d,1) conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), current->pid,
conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);
if (conf->nr_pending && current->bio_list && !bio_list_empty(current->bio_list))
dump_stack();
/* Wait for the barrier to drop.
* However if there are already pending
* requests (preventing the barrier from
* rising completely), and the
* pre-process bio queue isn't empty,
* then don't wait, as we need to empty
* that queue to get the nr_pending
* count down.
*/
wait_event_lock_irq(conf->wait_barrier,
!conf->barrier ||
(conf->nr_pending &&
current->bio_list &&
!bio_list_empty(current->bio_list)),
conf->resync_lock);
conf->nr_waiting--;
}
conf->nr_pending++;
if (waited)
pr_info("%s: wait_barrier(PID=%d,2) conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), current->pid,
conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);
spin_unlock_irq(&conf->resync_lock);
}
next prev parent reply other threads:[~2013-06-02 12:43 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly
2013-02-22 3:44 ` Joe Lawrence
2013-02-22 11:52 ` majianpeng
2013-02-22 16:03 ` Tregaron Bayly
2013-02-22 18:14 ` Joe Lawrence
2013-02-24 22:43 ` NeilBrown
2013-02-25 0:04 ` NeilBrown
2013-02-25 16:11 ` Tregaron Bayly
2013-02-25 22:54 ` NeilBrown
2013-02-26 14:09 ` Joe Lawrence
2013-05-16 14:07 ` Alexander Lyakas
2013-05-20 7:17 ` NeilBrown
2013-05-30 14:30 ` Alexander Lyakas
2013-06-02 12:43 ` Alexander Lyakas [this message]
2013-06-04 1:49 ` NeilBrown
2013-06-04 9:52 ` Alexander Lyakas
2013-06-06 15:00 ` Tregaron Bayly
2013-06-08 9:45 ` Alexander Lyakas
2013-06-12 0:42 ` NeilBrown
2013-06-12 1:30 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGRgLy4zTwopPS-cVwT1RKBr8pE7f3n-X3YLyFpktooQtTYGuQ@mail.gmail.com \
--to=alex.bolshoy@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=shyam@zadarastorage.com \
--cc=yair@zadarastorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).