From: Greg KH <gregkh@suse.de>
To: Hugh Daschbach <hdasch@broadcom.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>,
Alan Stern <stern@rowland.harvard.edu>,
Jan Blunck <jblunck@suse.de>, David Vrabel <david.vrabel@csr.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: System reboot hangs due to race against devices_kset->list triggered by SCSI FC workqueue
Date: Tue, 2 Mar 2010 20:54:33 -0800 [thread overview]
Message-ID: <20100303045433.GA27847@suse.de> (raw)
In-Reply-To: <233671224A0FED4688218FFDBED26E1A517AC38638@IRVEXCHCCR01.corp.ad.broadcom.com>
On Tue, Mar 02, 2010 at 04:47:01PM -0800, Hugh Daschbach wrote:
> The system may fail to boot when the kernel's devices_kset->list gets
> written by another thread while device_shutdown() is traversing the
> list. Though not common, this is fairly reproducible for some SCSI
> Fibre Channel topologies; particularly so with FCoE configurations.
Really? What a mess :(
> The reboot thread calls device_shutdown() as part of system shutdown.
> device_shutdown() loops through devices_kset->list, shutting down each
> system device. But devices_kset->list isn't protected from other
> writers while device_shutdown() traverses the list.
Can't we just protect the list? What is wanting to write to the list
while shutdown is happening?
> One such secondary writer is the SCI Fibre Channel workqueue. When
> fc_wq_N removes a device that device_shutdown() holds in it's "devn"
> (list traversal iterator) variable, device_shutdown() stalls, chasing
> what is essentially a broken link.
>
> This is not a common occurrence. But FC SCSI devices associated with a
> link that has gone down cause a race between device_shutdown() running
> in reboot's process and scsi_remove_target() running in a SCSI FC
> workqueue (fc_wq_N).
>
> Network attached FC devices are particularly vulnerable because SysV
> init scripts shut network interfaces down before proceeding with the
> reboot request. So by the time reboot is called, the link to the FC
> devices is already down.
>
> When the link is down device_shutdown() stalls (in sd_shutdown() --
> which issues cache flush CDBs to what are, by that time, inaccessible
> devices). The stall ends when the fc rport timer expires. But the
> timer expiration also initiates fc_starget_delete() in the fc workqueue,
> causing the race with device_shutdown().
Can't you just not do this?
> The attached patch detects and attempts to recover from the
> corruption. But this can hardly be considered a fix, as it does not
> address the race between device_shutdown() and scsi_remove_target().
I agree, this patch isn't ok, it should be handled in the scsi core as
it looks like a scsi problem, not a driver core problem, right?
> Perhaps converting the list_for_each_entry_safe_reverse() to something
> like.
>
> while (!list_empty(&devices_kset->list)) {
> dev = list_last_entry(...);
> ...
> }
>
> might be appropriate. But I have no idea if any devices don't fully
> remove themselves from the list when shutdown.
That shouldn't really solve the problem, right?
> Does anyone have any guidance for what would make a more appropriate
> fix?
So the scsi core is trying to remove a device at the same time shutdown
is happening, right? So we need to protect the list somehow, maybe just
switch it over to use a klist which should handle this for us instead?
Can you try that?
thanks,
greg k-h
next prev parent reply other threads:[~2010-03-03 4:54 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-03 0:47 System reboot hangs due to race against devices_kset->list triggered by SCSI FC workqueue Hugh Daschbach
2010-03-03 4:54 ` Greg KH [this message]
2010-03-03 19:16 ` Hugh Daschbach
2010-03-03 20:25 ` Alan Stern
2010-03-04 3:25 ` Hugh Daschbach
2010-03-04 15:18 ` Alan Stern
2010-03-04 19:09 ` Hugh Daschbach
2010-03-04 19:22 ` Alan Stern
2010-03-04 22:32 ` Hugh Daschbach
2010-03-05 14:31 ` Alan Stern
2010-03-03 15:50 ` Alan Stern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100303045433.GA27847@suse.de \
--to=gregkh@suse.de \
--cc=david.vrabel@csr.com \
--cc=hdasch@broadcom.com \
--cc=jblunck@suse.de \
--cc=kay.sievers@vrfy.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=stern@rowland.harvard.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox