* Bug in SCSI async probing
@ 2009-05-26 15:22 Alan Stern
2009-05-26 15:34 ` James Bottomley
0 siblings, 1 reply; 9+ messages in thread
From: Alan Stern @ 2009-05-26 15:22 UTC (permalink / raw)
To: James Bottomley, Arjan van de Ven
Cc: SCSI development list, Kernel development list
James & Arjan:
Am I missing something here? It looks like
fastboot: make scsi probes asynchronous
has introduced a bug in the sd probing code. AFAICT, there is now
nothing to prevent do_scan_async() from returning before
sd_probe_async() has run.
Doesn't this mean that there's nothing to prevent sd_remove() from
being called and trying to unregister the disk _before_
sd_probe_async() has managed to register it?
Alan Stern
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 15:22 Bug in SCSI async probing Alan Stern
@ 2009-05-26 15:34 ` James Bottomley
2009-05-26 18:34 ` Alan Stern
0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2009-05-26 15:34 UTC (permalink / raw)
To: Alan Stern
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 2009-05-26 at 11:22 -0400, Alan Stern wrote:
> James & Arjan:
>
> Am I missing something here? It looks like
>
> fastboot: make scsi probes asynchronous
>
> has introduced a bug in the sd probing code. AFAICT, there is now
> nothing to prevent do_scan_async() from returning before
> sd_probe_async() has run.
True, but this isn't really a problem.
> Doesn't this mean that there's nothing to prevent sd_remove() from
> being called and trying to unregister the disk _before_
> sd_probe_async() has managed to register it?
Yes, we've been discussing this ... most of the removal functions now
need async_synchronize calls to mitigate this type of race.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 15:34 ` James Bottomley
@ 2009-05-26 18:34 ` Alan Stern
2009-05-26 18:56 ` James Bottomley
0 siblings, 1 reply; 9+ messages in thread
From: Alan Stern @ 2009-05-26 18:34 UTC (permalink / raw)
To: James Bottomley
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 26 May 2009, James Bottomley wrote:
> On Tue, 2009-05-26 at 11:22 -0400, Alan Stern wrote:
> > James & Arjan:
> >
> > Am I missing something here? It looks like
> >
> > fastboot: make scsi probes asynchronous
> >
> > has introduced a bug in the sd probing code. AFAICT, there is now
> > nothing to prevent do_scan_async() from returning before
> > sd_probe_async() has run.
>
> True, but this isn't really a problem.
Why not? I'd say an oops is a problem. :-)
> > Doesn't this mean that there's nothing to prevent sd_remove() from
> > being called and trying to unregister the disk _before_
> > sd_probe_async() has managed to register it?
>
> Yes, we've been discussing this ... most of the removal functions now
> need async_synchronize calls to mitigate this type of race.
Such as this?
Index: usb-2.6/drivers/scsi/scsi_scan.c
===================================================================
--- usb-2.6.orig/drivers/scsi/scsi_scan.c
+++ usb-2.6/drivers/scsi/scsi_scan.c
@@ -1866,6 +1866,12 @@ void scsi_forget_host(struct Scsi_Host *
struct scsi_device *sdev;
unsigned long flags;
+ /*
+ * Don't try to get rid of this host's devices until all the async
+ * probing is finished.
+ */
+ async_synchronize_full();
+
restart:
spin_lock_irqsave(shost->host_lock, flags);
list_for_each_entry(sdev, &shost->__devices, siblings) {
(Which reminds me... Are the calls in wait_scan_init() really enough?
wait_for_device_probe() does async_synchronize_full() and then
scsi_complete_async_scans() finishes the SCSI scanning. But if this
scanning involves calling sd_probe(), then more async work will be
queued. Maybe a second call to wait_for_device_probe() is needed.)
There's still more; the patch above isn't sufficient. What happens if
the "device_add(&sdkp->dev)" call in sd_probe_async() fails? Then in
sd_remove(), sdkp will be NULL and &sdkp->dev will be meaningless. The
device_del() call will crash and the actual scsi_disk structure will be
leaked. This could be fixed by moving the dev_set_drvdata() call from
the end of sd_probe_async() back into sd_probe(), but then we'd find
sd_remove trying to unregister a device which was never successfully
registered.
And why is it that the "out_free_index:" code in sd_probe() acquires
sd_index_lock but the corresponding code in sd_probe_async() doesn't?
Alan Stern
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 18:34 ` Alan Stern
@ 2009-05-26 18:56 ` James Bottomley
2009-05-26 19:48 ` Alan Stern
0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2009-05-26 18:56 UTC (permalink / raw)
To: Alan Stern
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 2009-05-26 at 14:34 -0400, Alan Stern wrote:
> On Tue, 26 May 2009, James Bottomley wrote:
>
> > On Tue, 2009-05-26 at 11:22 -0400, Alan Stern wrote:
> > > James & Arjan:
> > >
> > > Am I missing something here? It looks like
> > >
> > > fastboot: make scsi probes asynchronous
> > >
> > > has introduced a bug in the sd probing code. AFAICT, there is now
> > > nothing to prevent do_scan_async() from returning before
> > > sd_probe_async() has run.
> >
> > True, but this isn't really a problem.
>
> Why not? I'd say an oops is a problem. :-)
Details?... In theory the sd driver can be attached at any time and
nothing should be relying on it being there when the inquiry scan
finishes, so if there's a bug it would be exposed by async scanning, not
really caused by it.
> > > Doesn't this mean that there's nothing to prevent sd_remove() from
> > > being called and trying to unregister the disk _before_
> > > sd_probe_async() has managed to register it?
> >
> > Yes, we've been discussing this ... most of the removal functions now
> > need async_synchronize calls to mitigate this type of race.
>
> Such as this?
>
>
> Index: usb-2.6/drivers/scsi/scsi_scan.c
> ===================================================================
> --- usb-2.6.orig/drivers/scsi/scsi_scan.c
> +++ usb-2.6/drivers/scsi/scsi_scan.c
> @@ -1866,6 +1866,12 @@ void scsi_forget_host(struct Scsi_Host *
> struct scsi_device *sdev;
> unsigned long flags;
>
> + /*
> + * Don't try to get rid of this host's devices until all the async
> + * probing is finished.
> + */
> + async_synchronize_full();
No, scsi_complete_async_scans() here. There should be an
async_synchronize_full() in sd_remove.
> +
> restart:
> spin_lock_irqsave(shost->host_lock, flags);
> list_for_each_entry(sdev, &shost->__devices, siblings) {
>
>
>
> (Which reminds me... Are the calls in wait_scan_init() really enough?
> wait_for_device_probe() does async_synchronize_full() and then
> scsi_complete_async_scans() finishes the SCSI scanning. But if this
> scanning involves calling sd_probe(), then more async work will be
> queued. Maybe a second call to wait_for_device_probe() is needed.)
> There's still more; the patch above isn't sufficient. What happens if
> the "device_add(&sdkp->dev)" call in sd_probe_async() fails? Then in
> sd_remove(), sdkp will be NULL and &sdkp->dev will be meaningless. The
> device_del() call will crash and the actual scsi_disk structure will be
> leaked. This could be fixed by moving the dev_set_drvdata() call from
> the end of sd_probe_async() back into sd_probe(), but then we'd find
> sd_remove trying to unregister a device which was never successfully
> registered.
None of this really got reviewed through the SCSI list, so I'll let
Arjan answer.
> And why is it that the "out_free_index:" code in sd_probe() acquires
> sd_index_lock but the corresponding code in sd_probe_async() doesn't?
This one looks to be a mismerge between the async tree and the SCSI
tree.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 18:56 ` James Bottomley
@ 2009-05-26 19:48 ` Alan Stern
2009-05-26 20:35 ` James Bottomley
0 siblings, 1 reply; 9+ messages in thread
From: Alan Stern @ 2009-05-26 19:48 UTC (permalink / raw)
To: James Bottomley
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 26 May 2009, James Bottomley wrote:
> Details?... In theory the sd driver can be attached at any time and
> nothing should be relying on it being there when the inquiry scan
> finishes, so if there's a bug it would be exposed by async scanning, not
> really caused by it.
Provided the async scanning is implemented correctly...
> > (Which reminds me... Are the calls in wait_scan_init() really enough?
> > wait_for_device_probe() does async_synchronize_full() and then
> > scsi_complete_async_scans() finishes the SCSI scanning. But if this
> > scanning involves calling sd_probe(), then more async work will be
> > queued. Maybe a second call to wait_for_device_probe() is needed.)
You didn't respond to this point.
> None of this really got reviewed through the SCSI list, so I'll let
> Arjan answer.
>
> > And why is it that the "out_free_index:" code in sd_probe() acquires
> > sd_index_lock but the corresponding code in sd_probe_async() doesn't?
>
> This one looks to be a mismerge between the async tree and the SCSI
> tree.
Well then, how does this patch look?
Alan Stern
Index: usb-2.6/drivers/scsi/sd.c
===================================================================
--- usb-2.6.orig/drivers/scsi/sd.c
+++ usb-2.6/drivers/scsi/sd.c
@@ -1948,7 +1948,6 @@ static void sd_probe_async(void *data, a
if (sdp->removable)
gd->flags |= GENHD_FL_REMOVABLE;
- dev_set_drvdata(dev, sdkp);
add_disk(gd);
sd_dif_config_host(sdkp);
@@ -1958,7 +1957,9 @@ static void sd_probe_async(void *data, a
return;
out_free_index:
+ spin_lock(&sd_index_lock);
ida_remove(&sd_index_ida, index);
+ spin_unlock(&sd_index_lock);
}
/**
@@ -2019,6 +2020,7 @@ static int sd_probe(struct device *dev)
if (error)
goto out_free_index;
+ dev_set_drvdata(dev, sdkp);
sdkp->device = sdp;
sdkp->driver = &sd_template;
sdkp->disk = gd;
@@ -2057,8 +2059,13 @@ static int sd_remove(struct device *dev)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
- device_del(&sdkp->dev);
- del_gendisk(sdkp->disk);
+ /* Wait for sd_probe_async to finish */
+ async_synchronize_full();
+
+ if (device_is_registered(&sdkp->dev)) {
+ device_del(&sdkp->dev);
+ del_gendisk(sdkp->disk);
+ }
sd_shutdown(dev);
mutex_lock(&sd_ref_mutex);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 19:48 ` Alan Stern
@ 2009-05-26 20:35 ` James Bottomley
2009-05-26 21:04 ` Alan Stern
0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2009-05-26 20:35 UTC (permalink / raw)
To: Alan Stern
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 2009-05-26 at 15:48 -0400, Alan Stern wrote:
> On Tue, 26 May 2009, James Bottomley wrote:
>
> > Details?... In theory the sd driver can be attached at any time and
> > nothing should be relying on it being there when the inquiry scan
> > finishes, so if there's a bug it would be exposed by async scanning, not
> > really caused by it.
>
> Provided the async scanning is implemented correctly...
>
> > > (Which reminds me... Are the calls in wait_scan_init() really enough?
> > > wait_for_device_probe() does async_synchronize_full() and then
> > > scsi_complete_async_scans() finishes the SCSI scanning. But if this
> > > scanning involves calling sd_probe(), then more async work will be
> > > queued. Maybe a second call to wait_for_device_probe() is needed.)
>
> You didn't respond to this point.
Well this was the response:
> > None of this really got reviewed through the SCSI list, so I'll let
> > Arjan answer.
> >
> > > And why is it that the "out_free_index:" code in sd_probe() acquires
> > > sd_index_lock but the corresponding code in sd_probe_async() doesn't?
> >
> > This one looks to be a mismerge between the async tree and the SCSI
> > tree.
>
> Well then, how does this patch look?
Well, it's adding complexity, the best fix is to let async only take
care of the pieces which can't fail, that way we don't need complex
error handling. The piece that slows probing isn't really the sysfs
appearances, it's the SCSI probing, and the last piece that needs error
handling is the device_add() for sysfs visibility, so that should be the
dividing line between sync and async.
This should restore the logical flow and fix all the error leg problems
(by eliminating the error legs).
James
---
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bcf3bd4..d74315d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1902,24 +1902,6 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
index = sdkp->index;
dev = &sdp->sdev_gendev;
- if (!sdp->request_queue->rq_timeout) {
- if (sdp->type != TYPE_MOD)
- blk_queue_rq_timeout(sdp->request_queue, SD_TIMEOUT);
- else
- blk_queue_rq_timeout(sdp->request_queue,
- SD_MOD_TIMEOUT);
- }
-
- device_initialize(&sdkp->dev);
- sdkp->dev.parent = &sdp->sdev_gendev;
- sdkp->dev.class = &sd_disk_class;
- dev_set_name(&sdkp->dev, dev_name(&sdp->sdev_gendev));
-
- if (device_add(&sdkp->dev))
- goto out_free_index;
-
- get_device(&sdp->sdev_gendev);
-
if (index < SD_MAX_DISKS) {
gd->major = sd_major((index & 0xf0) >> 4);
gd->first_minor = ((index & 0xf) << 4) | (index & 0xfff00);
@@ -1954,11 +1936,6 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n",
sdp->removable ? "removable " : "");
-
- return;
-
- out_free_index:
- ida_remove(&sd_index_ida, index);
}
/**
@@ -2026,6 +2003,24 @@ static int sd_probe(struct device *dev)
sdkp->openers = 0;
sdkp->previous_state = 1;
+ if (!sdp->request_queue->rq_timeout) {
+ if (sdp->type != TYPE_MOD)
+ blk_queue_rq_timeout(sdp->request_queue, SD_TIMEOUT);
+ else
+ blk_queue_rq_timeout(sdp->request_queue,
+ SD_MOD_TIMEOUT);
+ }
+
+ device_initialize(&sdkp->dev);
+ sdkp->dev.parent = &sdp->sdev_gendev;
+ sdkp->dev.class = &sd_disk_class;
+ dev_set_name(&sdkp->dev, dev_name(&sdp->sdev_gendev));
+
+ if (device_add(&sdkp->dev))
+ goto out_free_index;
+
+ get_device(&sdp->sdev_gendev);
+
async_schedule(sd_probe_async, sdkp);
return 0;
@@ -2057,6 +2052,7 @@ static int sd_remove(struct device *dev)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
+ async_synchronize_full();
device_del(&sdkp->dev);
del_gendisk(sdkp->disk);
sd_shutdown(dev);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 20:35 ` James Bottomley
@ 2009-05-26 21:04 ` Alan Stern
2009-05-26 21:23 ` James Bottomley
0 siblings, 1 reply; 9+ messages in thread
From: Alan Stern @ 2009-05-26 21:04 UTC (permalink / raw)
To: James Bottomley
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 26 May 2009, James Bottomley wrote:
> > > > (Which reminds me... Are the calls in wait_scan_init() really enough?
> > > > wait_for_device_probe() does async_synchronize_full() and then
> > > > scsi_complete_async_scans() finishes the SCSI scanning. But if this
> > > > scanning involves calling sd_probe(), then more async work will be
> > > > queued. Maybe a second call to wait_for_device_probe() is needed.)
> >
> > You didn't respond to this point.
>
> Well this was the response:
>
> > > None of this really got reviewed through the SCSI list, so I'll let
> > > Arjan answer.
Whoops, for some reason I had gotten the idea that you wrote
wait_scan_init().
> > Well then, how does this patch look?
>
> Well, it's adding complexity, the best fix is to let async only take
> care of the pieces which can't fail, that way we don't need complex
> error handling. The piece that slows probing isn't really the sysfs
> appearances, it's the SCSI probing, and the last piece that needs error
> handling is the device_add() for sysfs visibility, so that should be the
> dividing line between sync and async.
I had exactly the same thought, but it seemed less intrusive to keep
the dividing line where Arjan put it.
> This should restore the logical flow and fix all the error leg problems
> (by eliminating the error legs).
You forgot to move the dev_set_drvdata() call into the synchronous
part. Apart from that it looks fine. Should I submit it officially?
Also, I do still think that wait_scan_init() needs an extra call to
async_synchronize_full() at the end.
Alan Stern
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 21:04 ` Alan Stern
@ 2009-05-26 21:23 ` James Bottomley
2009-05-26 21:52 ` Alan Stern
0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2009-05-26 21:23 UTC (permalink / raw)
To: Alan Stern
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 2009-05-26 at 17:04 -0400, Alan Stern wrote:
> On Tue, 26 May 2009, James Bottomley wrote:
>
> > > > > (Which reminds me... Are the calls in wait_scan_init() really enough?
> > > > > wait_for_device_probe() does async_synchronize_full() and then
> > > > > scsi_complete_async_scans() finishes the SCSI scanning. But if this
> > > > > scanning involves calling sd_probe(), then more async work will be
> > > > > queued. Maybe a second call to wait_for_device_probe() is needed.)
> > >
> > > You didn't respond to this point.
> >
> > Well this was the response:
> >
> > > > None of this really got reviewed through the SCSI list, so I'll let
> > > > Arjan answer.
>
> Whoops, for some reason I had gotten the idea that you wrote
> wait_scan_init().
>
> > > Well then, how does this patch look?
> >
> > Well, it's adding complexity, the best fix is to let async only take
> > care of the pieces which can't fail, that way we don't need complex
> > error handling. The piece that slows probing isn't really the sysfs
> > appearances, it's the SCSI probing, and the last piece that needs error
> > handling is the device_add() for sysfs visibility, so that should be the
> > dividing line between sync and async.
>
> I had exactly the same thought, but it seemed less intrusive to keep
> the dividing line where Arjan put it.
Not really ... the problem where it is becomes one of error handling.
It's better to complete all the inline error handling before becoming
async ... given the code state, that's provably correct (assuming the
original was).
> > This should restore the logical flow and fix all the error leg problems
> > (by eliminating the error legs).
>
> You forgot to move the dev_set_drvdata() call into the synchronous
> part. Apart from that it looks fine. Should I submit it officially?
Well, no ... I forgot to get the drive data after the async_synchronize.
Since this is an sd bug fix, I'll take it through the SCSI trees.
> Also, I do still think that wait_scan_init() needs an extra call to
> async_synchronize_full() at the end.
I need to understand it before commenting ... I'm slowly working my way
through the async patches.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing
2009-05-26 21:23 ` James Bottomley
@ 2009-05-26 21:52 ` Alan Stern
0 siblings, 0 replies; 9+ messages in thread
From: Alan Stern @ 2009-05-26 21:52 UTC (permalink / raw)
To: James Bottomley
Cc: Arjan van de Ven, SCSI development list, Kernel development list
On Tue, 26 May 2009, James Bottomley wrote:
> > Also, I do still think that wait_scan_init() needs an extra call to
> > async_synchronize_full() at the end.
>
> I need to understand it before commenting ... I'm slowly working my way
> through the async patches.
There are two lists of queued asynchronous tasks: Arjan's list and the
SCSI scanning list. As things stand now, routines in each list are
capable of submitting new jobs for the other list. That is, an async
task may discover and register a new SCSI host (requiring scanning),
and the probe of a SCSI disk discovered during scanning will submit
a sd_probe_async task.
This makes it difficult to wait for both lists to become idle. Maybe
the whole idea behind scsi_wait_scan.c needs some rethinking...
Alan Stern
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-05-26 21:52 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-26 15:22 Bug in SCSI async probing Alan Stern
2009-05-26 15:34 ` James Bottomley
2009-05-26 18:34 ` Alan Stern
2009-05-26 18:56 ` James Bottomley
2009-05-26 19:48 ` Alan Stern
2009-05-26 20:35 ` James Bottomley
2009-05-26 21:04 ` Alan Stern
2009-05-26 21:23 ` James Bottomley
2009-05-26 21:52 ` Alan Stern
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox