From: Neil Brown <neilb@suse.de>
To: Hubert Tonneau <hubert.tonneau@fullpliant.org>
Cc: linux-scsi@vger.kernel.org
Subject: Re: MD RAID1 deadlock on failed disk
Date: Wed, 27 Oct 2010 20:52:38 +1100 [thread overview]
Message-ID: <20101027205238.4e1a4b68@notabene> (raw)
In-Reply-To: <0AFEJ5E11@briare1.fullpliant.org>
On Wed, 27 Oct 2010 10:44:02 GMT
Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote:
> Hi,
>
> The configuration is:
> Perc H200 controller configured with no RAID (mpt2sas driver),
> 2 SATA disks (sda and sdb),
> Linux MD Sofware RAID1 (md0),
> stock Linux 2.6.35.7 kernel.
>
> I hotunplug the second (sdb) disk, and the result is:
> . as expected, I can read sda device,
> . as expected, any read to sdb device fails,
> . unexpectedly, any read to md0 never returns.
>
> No oops or thing like that in the kernel log.
> I did not try the same with other kernel releases.
>
> 2.6.32.24 kernel worked fine.
>
> Neil Brown asked for /proc/sysrq-trigger ouput,
> and concluded that the problem is related to 'fw_event0'.
> See his answer bellow.
>
> Regards,
> Hubert Tonneau
>
>
> Neil Brown wrote:
> >
> > The fw_event0 process is interesting.
> > It seems to be hung trying to 'sync' the drive that has just been pulled.
> > If that is somehow causing some IO request from the md/raid1 to be delayed
> > then that would certainly hang the array.
> >
> > There is a section in the middle of the trace which is missing - presumably
> > the sysrq-trigger output overflowed a buffer - that isn't uncommon.
> >
> > So I cannot see all the timing clearly.
> > How long after pulling the drive was this trace taken?
> >
> > I suspect that you need to post this to linux-scsi@vger.kernel.org
> > and ask about that fw_event0 thread - whether that should happen, whether it
> > has been fixed, and whether it could delay pending IO requests.
> >
> > NeilBrown
It probably would help to have included the sysrq-T output so the scsi people
could see why I pointed the finger at fw_event0.
Here is that part of the trace
<6>[ 318.881486] fw_event0 D 0000000000000000 0 244 2 0x00000000
<4>[ 318.881493] ffff88081d191570 0000000000000046 ffff880800000000 00000000000158c0
<4>[ 318.881500] ffff88081d191fd8 00000000000158c0 ffff88081d191fd8 ffff88081d188000
<4>[ 318.881507] 00000000000158c0 00000000000158c0 ffff88081d191fd8 00000000000158c0
<4>[ 318.881514] Call Trace:
<4>[ 318.881520] [<ffffffff815a296d>] schedule_timeout+0x22d/0x310
<4>[ 318.881526] [<ffffffff813a21f0>] ? __scsi_queue_insert+0xb0/0x130
<4>[ 318.881533] [<ffffffff815a252b>] wait_for_common+0xdb/0x1a0
<4>[ 318.881540] [<ffffffff81051910>] ? default_wake_function+0x0/0x20
<4>[ 318.881546] [<ffffffff81294093>] ? __generic_unplug_device+0x33/0x40
<4>[ 318.881553] [<ffffffff815a26cd>] wait_for_completion+0x1d/0x20
<4>[ 318.881560] [<ffffffff8129a9fe>] blk_execute_rq+0x8e/0xf0
<4>[ 318.881567] [<ffffffff8129666c>] ? blk_get_request+0x6c/0xa0
<4>[ 318.881573] [<ffffffff813a129c>] scsi_execute+0xfc/0x160
<4>[ 318.881580] [<ffffffff813a2cec>] scsi_execute_req+0xac/0x180
<4>[ 318.881589] [<ffffffff813c5fd0>] sd_sync_cache+0xd0/0x120
<4>[ 318.881598] [<ffffffff815a187a>] ? printk+0x68/0x6e
<4>[ 318.881604] [<ffffffff813c6283>] sd_shutdown+0x83/0x1b0
<4>[ 318.881610] [<ffffffff813c6562>] sd_remove+0x62/0xa0
<4>[ 318.881618] [<ffffffff81377555>] __device_release_driver+0x75/0xe0
<4>[ 318.881624] [<ffffffff81377acd>] device_release_driver+0x2d/0x40
<4>[ 318.881631] [<ffffffff81376532>] bus_remove_device+0xb2/0xf0
<4>[ 318.881637] [<ffffffff81374237>] device_del+0x127/0x1b0
<4>[ 318.881644] [<ffffffff813a74d5>] __scsi_remove_device+0xb5/0xc0
<4>[ 318.881650] [<ffffffff813a7510>] scsi_remove_device+0x30/0x50
<4>[ 318.881656] [<ffffffff813a7601>] __scsi_remove_target+0xb1/0xe0
<4>[ 318.881662] [<ffffffff813a76a0>] ? __remove_child+0x0/0x30
<4>[ 318.881667] [<ffffffff813a76c3>] __remove_child+0x23/0x30
<4>[ 318.881673] [<ffffffff8137399c>] device_for_each_child+0x4c/0x80
<4>[ 318.881679] [<ffffffff813a766e>] scsi_remove_target+0x3e/0x70
<4>[ 318.881686] [<ffffffff813abcc5>] sas_rphy_remove+0x75/0x80
<4>[ 318.881692] [<ffffffff813ac266>] sas_rphy_delete+0x16/0x30
<4>[ 318.881698] [<ffffffff813ac2aa>] sas_port_delete+0x2a/0x130
<4>[ 318.881704] [<ffffffff813bf3ca>] mpt2sas_transport_port_remove+0x15a/0x240
<4>[ 318.881711] [<ffffffff813ba9ed>] _scsih_remove_device+0xcd/0x120
<4>[ 318.881720] [<ffffffff81035d09>] ? default_spin_lock_flags+0x9/0x10
<4>[ 318.881726] [<ffffffff813bea00>] ? mpt2sas_transport_update_links+0x80/0x1a0
<4>[ 318.881733] [<ffffffff813be0ee>] _firmware_event_work+0x155e/0x1af0
<4>[ 318.881742] [<ffffffff8100860b>] ? __switch_to+0xcb/0x350
<4>[ 318.881749] [<ffffffff8104de5a>] ? finish_task_switch+0x4a/0xd0
<4>[ 318.881756] [<ffffffff813bcb90>] ? _firmware_event_work+0x0/0x1af0
<4>[ 318.881762] [<ffffffff810792cf>] worker_thread+0x17f/0x2b0
<4>[ 318.881769] [<ffffffff8107d9c0>] ? autoremove_wake_function+0x0/0x40
<4>[ 318.881775] [<ffffffff81079150>] ? worker_thread+0x0/0x2b0
<4>[ 318.881781] [<ffffffff8107d466>] kthread+0x96/0xa0
<4>[ 318.881787] [<ffffffff8100ae64>] kernel_thread_helper+0x4/0x10
<4>[ 318.881794] [<ffffffff8107d3d0>] ? kthread+0x0/0xa0
<4>[ 318.881799] [<ffffffff8100ae60>] ? kernel_thread_helper+0x0/0x10
It seems to hang here, and while it hangs old IO requests don't complete so
md/raid1 cannot proceed.
NeilBrown
next prev parent reply other threads:[~2010-10-27 9:52 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-27 10:44 MD RAID1 deadlock on failed disk Hubert Tonneau
2010-10-27 9:52 ` Neil Brown [this message]
-- strict thread matches above, loose matches on Subject: below --
2010-10-27 0:18 Hubert Tonneau
2010-10-26 23:56 ` Neil Brown
2010-10-26 22:32 Hubert Tonneau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101027205238.4e1a4b68@notabene \
--to=neilb@suse.de \
--cc=hubert.tonneau@fullpliant.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.