From: Greg KH <gregkh@suse.de>
To: Hugh Daschbach <hdasch@broadcom.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>,
Alan Stern <stern@rowland.harvard.edu>,
Jan Blunck <jblunck@suse.de>, David Vrabel <david.vrabel@csr.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: System reboot hangs due to race against devices_kset->list triggered by SCSI FC workqueue
Date: Tue, 2 Mar 2010 20:54:33 -0800 [thread overview]
Message-ID: <20100303045433.GA27847@suse.de> (raw)
In-Reply-To: <233671224A0FED4688218FFDBED26E1A517AC38638@IRVEXCHCCR01.corp.ad.broadcom.com>
On Tue, Mar 02, 2010 at 04:47:01PM -0800, Hugh Daschbach wrote:
> The system may fail to boot when the kernel's devices_kset->list gets
> written by another thread while device_shutdown() is traversing the
> list. Though not common, this is fairly reproducible for some SCSI
> Fibre Channel topologies; particularly so with FCoE configurations.
Really? What a mess :(
> The reboot thread calls device_shutdown() as part of system shutdown.
> device_shutdown() loops through devices_kset->list, shutting down each
> system device. But devices_kset->list isn't protected from other
> writers while device_shutdown() traverses the list.
Can't we just protect the list? What is wanting to write to the list
while shutdown is happening?
> One such secondary writer is the SCI Fibre Channel workqueue. When
> fc_wq_N removes a device that device_shutdown() holds in it's "devn"
> (list traversal iterator) variable, device_shutdown() stalls, chasing
> what is essentially a broken link.
>
> This is not a common occurrence. But FC SCSI devices associated with a
> link that has gone down cause a race between device_shutdown() running
> in reboot's process and scsi_remove_target() running in a SCSI FC
> workqueue (fc_wq_N).
>
> Network attached FC devices are particularly vulnerable because SysV
> init scripts shut network interfaces down before proceeding with the
> reboot request. So by the time reboot is called, the link to the FC
> devices is already down.
>
> When the link is down device_shutdown() stalls (in sd_shutdown() --
> which issues cache flush CDBs to what are, by that time, inaccessible
> devices). The stall ends when the fc rport timer expires. But the
> timer expiration also initiates fc_starget_delete() in the fc workqueue,
> causing the race with device_shutdown().
Can't you just not do this?
> The attached patch detects and attempts to recover from the
> corruption. But this can hardly be considered a fix, as it does not
> address the race between device_shutdown() and scsi_remove_target().
I agree, this patch isn't ok, it should be handled in the scsi core as
it looks like a scsi problem, not a driver core problem, right?
> Perhaps converting the list_for_each_entry_safe_reverse() to something
> like.
>
> while (!list_empty(&devices_kset->list)) {
> dev = list_last_entry(...);
> ...
> }
>
> might be appropriate. But I have no idea if any devices don't fully
> remove themselves from the list when shutdown.
That shouldn't really solve the problem, right?
> Does anyone have any guidance for what would make a more appropriate
> fix?
So the scsi core is trying to remove a device at the same time shutdown
is happening, right? So we need to protect the list somehow, maybe just
switch it over to use a klist which should handle this for us instead?
Can you try that?
thanks,
greg k-h
next prev parent reply other threads:[~2010-03-03 4:54 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-03 0:47 System reboot hangs due to race against devices_kset->list triggered by SCSI FC workqueue Hugh Daschbach
2010-03-03 4:54 ` Greg KH [this message]
2010-03-03 19:16 ` Hugh Daschbach
2010-03-03 20:25 ` Alan Stern
2010-03-04 3:25 ` Hugh Daschbach
2010-03-04 15:18 ` Alan Stern
2010-03-04 19:09 ` Hugh Daschbach
2010-03-04 19:22 ` Alan Stern
2010-03-04 22:32 ` Hugh Daschbach
2010-03-05 14:31 ` Alan Stern
2010-03-03 15:50 ` Alan Stern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100303045433.GA27847@suse.de \
--to=gregkh@suse.de \
--cc=david.vrabel@csr.com \
--cc=hdasch@broadcom.com \
--cc=jblunck@suse.de \
--cc=kay.sievers@vrfy.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=stern@rowland.harvard.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.