* Bug in SCSI async probing @ 2009-05-26 15:22 Alan Stern 2009-05-26 15:34 ` James Bottomley 0 siblings, 1 reply; 9+ messages in thread From: Alan Stern @ 2009-05-26 15:22 UTC (permalink / raw) To: James Bottomley, Arjan van de Ven Cc: SCSI development list, Kernel development list James & Arjan: Am I missing something here? It looks like fastboot: make scsi probes asynchronous has introduced a bug in the sd probing code. AFAICT, there is now nothing to prevent do_scan_async() from returning before sd_probe_async() has run. Doesn't this mean that there's nothing to prevent sd_remove() from being called and trying to unregister the disk _before_ sd_probe_async() has managed to register it? Alan Stern ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 15:22 Bug in SCSI async probing Alan Stern @ 2009-05-26 15:34 ` James Bottomley 2009-05-26 18:34 ` Alan Stern 0 siblings, 1 reply; 9+ messages in thread From: James Bottomley @ 2009-05-26 15:34 UTC (permalink / raw) To: Alan Stern Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 2009-05-26 at 11:22 -0400, Alan Stern wrote: > James & Arjan: > > Am I missing something here? It looks like > > fastboot: make scsi probes asynchronous > > has introduced a bug in the sd probing code. AFAICT, there is now > nothing to prevent do_scan_async() from returning before > sd_probe_async() has run. True, but this isn't really a problem. > Doesn't this mean that there's nothing to prevent sd_remove() from > being called and trying to unregister the disk _before_ > sd_probe_async() has managed to register it? Yes, we've been discussing this ... most of the removal functions now need async_synchronize calls to mitigate this type of race. James ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 15:34 ` James Bottomley @ 2009-05-26 18:34 ` Alan Stern 2009-05-26 18:56 ` James Bottomley 0 siblings, 1 reply; 9+ messages in thread From: Alan Stern @ 2009-05-26 18:34 UTC (permalink / raw) To: James Bottomley Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 26 May 2009, James Bottomley wrote: > On Tue, 2009-05-26 at 11:22 -0400, Alan Stern wrote: > > James & Arjan: > > > > Am I missing something here? It looks like > > > > fastboot: make scsi probes asynchronous > > > > has introduced a bug in the sd probing code. AFAICT, there is now > > nothing to prevent do_scan_async() from returning before > > sd_probe_async() has run. > > True, but this isn't really a problem. Why not? I'd say an oops is a problem. :-) > > Doesn't this mean that there's nothing to prevent sd_remove() from > > being called and trying to unregister the disk _before_ > > sd_probe_async() has managed to register it? > > Yes, we've been discussing this ... most of the removal functions now > need async_synchronize calls to mitigate this type of race. Such as this? Index: usb-2.6/drivers/scsi/scsi_scan.c =================================================================== --- usb-2.6.orig/drivers/scsi/scsi_scan.c +++ usb-2.6/drivers/scsi/scsi_scan.c @@ -1866,6 +1866,12 @@ void scsi_forget_host(struct Scsi_Host * struct scsi_device *sdev; unsigned long flags; + /* + * Don't try to get rid of this host's devices until all the async + * probing is finished. + */ + async_synchronize_full(); + restart: spin_lock_irqsave(shost->host_lock, flags); list_for_each_entry(sdev, &shost->__devices, siblings) { (Which reminds me... Are the calls in wait_scan_init() really enough? wait_for_device_probe() does async_synchronize_full() and then scsi_complete_async_scans() finishes the SCSI scanning. But if this scanning involves calling sd_probe(), then more async work will be queued. Maybe a second call to wait_for_device_probe() is needed.) There's still more; the patch above isn't sufficient. What happens if the "device_add(&sdkp->dev)" call in sd_probe_async() fails? Then in sd_remove(), sdkp will be NULL and &sdkp->dev will be meaningless. The device_del() call will crash and the actual scsi_disk structure will be leaked. This could be fixed by moving the dev_set_drvdata() call from the end of sd_probe_async() back into sd_probe(), but then we'd find sd_remove trying to unregister a device which was never successfully registered. And why is it that the "out_free_index:" code in sd_probe() acquires sd_index_lock but the corresponding code in sd_probe_async() doesn't? Alan Stern ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 18:34 ` Alan Stern @ 2009-05-26 18:56 ` James Bottomley 2009-05-26 19:48 ` Alan Stern 0 siblings, 1 reply; 9+ messages in thread From: James Bottomley @ 2009-05-26 18:56 UTC (permalink / raw) To: Alan Stern Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 2009-05-26 at 14:34 -0400, Alan Stern wrote: > On Tue, 26 May 2009, James Bottomley wrote: > > > On Tue, 2009-05-26 at 11:22 -0400, Alan Stern wrote: > > > James & Arjan: > > > > > > Am I missing something here? It looks like > > > > > > fastboot: make scsi probes asynchronous > > > > > > has introduced a bug in the sd probing code. AFAICT, there is now > > > nothing to prevent do_scan_async() from returning before > > > sd_probe_async() has run. > > > > True, but this isn't really a problem. > > Why not? I'd say an oops is a problem. :-) Details?... In theory the sd driver can be attached at any time and nothing should be relying on it being there when the inquiry scan finishes, so if there's a bug it would be exposed by async scanning, not really caused by it. > > > Doesn't this mean that there's nothing to prevent sd_remove() from > > > being called and trying to unregister the disk _before_ > > > sd_probe_async() has managed to register it? > > > > Yes, we've been discussing this ... most of the removal functions now > > need async_synchronize calls to mitigate this type of race. > > Such as this? > > > Index: usb-2.6/drivers/scsi/scsi_scan.c > =================================================================== > --- usb-2.6.orig/drivers/scsi/scsi_scan.c > +++ usb-2.6/drivers/scsi/scsi_scan.c > @@ -1866,6 +1866,12 @@ void scsi_forget_host(struct Scsi_Host * > struct scsi_device *sdev; > unsigned long flags; > > + /* > + * Don't try to get rid of this host's devices until all the async > + * probing is finished. > + */ > + async_synchronize_full(); No, scsi_complete_async_scans() here. There should be an async_synchronize_full() in sd_remove. > + > restart: > spin_lock_irqsave(shost->host_lock, flags); > list_for_each_entry(sdev, &shost->__devices, siblings) { > > > > (Which reminds me... Are the calls in wait_scan_init() really enough? > wait_for_device_probe() does async_synchronize_full() and then > scsi_complete_async_scans() finishes the SCSI scanning. But if this > scanning involves calling sd_probe(), then more async work will be > queued. Maybe a second call to wait_for_device_probe() is needed.) > There's still more; the patch above isn't sufficient. What happens if > the "device_add(&sdkp->dev)" call in sd_probe_async() fails? Then in > sd_remove(), sdkp will be NULL and &sdkp->dev will be meaningless. The > device_del() call will crash and the actual scsi_disk structure will be > leaked. This could be fixed by moving the dev_set_drvdata() call from > the end of sd_probe_async() back into sd_probe(), but then we'd find > sd_remove trying to unregister a device which was never successfully > registered. None of this really got reviewed through the SCSI list, so I'll let Arjan answer. > And why is it that the "out_free_index:" code in sd_probe() acquires > sd_index_lock but the corresponding code in sd_probe_async() doesn't? This one looks to be a mismerge between the async tree and the SCSI tree. James ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 18:56 ` James Bottomley @ 2009-05-26 19:48 ` Alan Stern 2009-05-26 20:35 ` James Bottomley 0 siblings, 1 reply; 9+ messages in thread From: Alan Stern @ 2009-05-26 19:48 UTC (permalink / raw) To: James Bottomley Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 26 May 2009, James Bottomley wrote: > Details?... In theory the sd driver can be attached at any time and > nothing should be relying on it being there when the inquiry scan > finishes, so if there's a bug it would be exposed by async scanning, not > really caused by it. Provided the async scanning is implemented correctly... > > (Which reminds me... Are the calls in wait_scan_init() really enough? > > wait_for_device_probe() does async_synchronize_full() and then > > scsi_complete_async_scans() finishes the SCSI scanning. But if this > > scanning involves calling sd_probe(), then more async work will be > > queued. Maybe a second call to wait_for_device_probe() is needed.) You didn't respond to this point. > None of this really got reviewed through the SCSI list, so I'll let > Arjan answer. > > > And why is it that the "out_free_index:" code in sd_probe() acquires > > sd_index_lock but the corresponding code in sd_probe_async() doesn't? > > This one looks to be a mismerge between the async tree and the SCSI > tree. Well then, how does this patch look? Alan Stern Index: usb-2.6/drivers/scsi/sd.c =================================================================== --- usb-2.6.orig/drivers/scsi/sd.c +++ usb-2.6/drivers/scsi/sd.c @@ -1948,7 +1948,6 @@ static void sd_probe_async(void *data, a if (sdp->removable) gd->flags |= GENHD_FL_REMOVABLE; - dev_set_drvdata(dev, sdkp); add_disk(gd); sd_dif_config_host(sdkp); @@ -1958,7 +1957,9 @@ static void sd_probe_async(void *data, a return; out_free_index: + spin_lock(&sd_index_lock); ida_remove(&sd_index_ida, index); + spin_unlock(&sd_index_lock); } /** @@ -2019,6 +2020,7 @@ static int sd_probe(struct device *dev) if (error) goto out_free_index; + dev_set_drvdata(dev, sdkp); sdkp->device = sdp; sdkp->driver = &sd_template; sdkp->disk = gd; @@ -2057,8 +2059,13 @@ static int sd_remove(struct device *dev) { struct scsi_disk *sdkp = dev_get_drvdata(dev); - device_del(&sdkp->dev); - del_gendisk(sdkp->disk); + /* Wait for sd_probe_async to finish */ + async_synchronize_full(); + + if (device_is_registered(&sdkp->dev)) { + device_del(&sdkp->dev); + del_gendisk(sdkp->disk); + } sd_shutdown(dev); mutex_lock(&sd_ref_mutex); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 19:48 ` Alan Stern @ 2009-05-26 20:35 ` James Bottomley 2009-05-26 21:04 ` Alan Stern 0 siblings, 1 reply; 9+ messages in thread From: James Bottomley @ 2009-05-26 20:35 UTC (permalink / raw) To: Alan Stern Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 2009-05-26 at 15:48 -0400, Alan Stern wrote: > On Tue, 26 May 2009, James Bottomley wrote: > > > Details?... In theory the sd driver can be attached at any time and > > nothing should be relying on it being there when the inquiry scan > > finishes, so if there's a bug it would be exposed by async scanning, not > > really caused by it. > > Provided the async scanning is implemented correctly... > > > > (Which reminds me... Are the calls in wait_scan_init() really enough? > > > wait_for_device_probe() does async_synchronize_full() and then > > > scsi_complete_async_scans() finishes the SCSI scanning. But if this > > > scanning involves calling sd_probe(), then more async work will be > > > queued. Maybe a second call to wait_for_device_probe() is needed.) > > You didn't respond to this point. Well this was the response: > > None of this really got reviewed through the SCSI list, so I'll let > > Arjan answer. > > > > > And why is it that the "out_free_index:" code in sd_probe() acquires > > > sd_index_lock but the corresponding code in sd_probe_async() doesn't? > > > > This one looks to be a mismerge between the async tree and the SCSI > > tree. > > Well then, how does this patch look? Well, it's adding complexity, the best fix is to let async only take care of the pieces which can't fail, that way we don't need complex error handling. The piece that slows probing isn't really the sysfs appearances, it's the SCSI probing, and the last piece that needs error handling is the device_add() for sysfs visibility, so that should be the dividing line between sync and async. This should restore the logical flow and fix all the error leg problems (by eliminating the error legs). James --- diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index bcf3bd4..d74315d 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1902,24 +1902,6 @@ static void sd_probe_async(void *data, async_cookie_t cookie) index = sdkp->index; dev = &sdp->sdev_gendev; - if (!sdp->request_queue->rq_timeout) { - if (sdp->type != TYPE_MOD) - blk_queue_rq_timeout(sdp->request_queue, SD_TIMEOUT); - else - blk_queue_rq_timeout(sdp->request_queue, - SD_MOD_TIMEOUT); - } - - device_initialize(&sdkp->dev); - sdkp->dev.parent = &sdp->sdev_gendev; - sdkp->dev.class = &sd_disk_class; - dev_set_name(&sdkp->dev, dev_name(&sdp->sdev_gendev)); - - if (device_add(&sdkp->dev)) - goto out_free_index; - - get_device(&sdp->sdev_gendev); - if (index < SD_MAX_DISKS) { gd->major = sd_major((index & 0xf0) >> 4); gd->first_minor = ((index & 0xf) << 4) | (index & 0xfff00); @@ -1954,11 +1936,6 @@ static void sd_probe_async(void *data, async_cookie_t cookie) sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n", sdp->removable ? "removable " : ""); - - return; - - out_free_index: - ida_remove(&sd_index_ida, index); } /** @@ -2026,6 +2003,24 @@ static int sd_probe(struct device *dev) sdkp->openers = 0; sdkp->previous_state = 1; + if (!sdp->request_queue->rq_timeout) { + if (sdp->type != TYPE_MOD) + blk_queue_rq_timeout(sdp->request_queue, SD_TIMEOUT); + else + blk_queue_rq_timeout(sdp->request_queue, + SD_MOD_TIMEOUT); + } + + device_initialize(&sdkp->dev); + sdkp->dev.parent = &sdp->sdev_gendev; + sdkp->dev.class = &sd_disk_class; + dev_set_name(&sdkp->dev, dev_name(&sdp->sdev_gendev)); + + if (device_add(&sdkp->dev)) + goto out_free_index; + + get_device(&sdp->sdev_gendev); + async_schedule(sd_probe_async, sdkp); return 0; @@ -2057,6 +2052,7 @@ static int sd_remove(struct device *dev) { struct scsi_disk *sdkp = dev_get_drvdata(dev); + async_synchronize_full(); device_del(&sdkp->dev); del_gendisk(sdkp->disk); sd_shutdown(dev); ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 20:35 ` James Bottomley @ 2009-05-26 21:04 ` Alan Stern 2009-05-26 21:23 ` James Bottomley 0 siblings, 1 reply; 9+ messages in thread From: Alan Stern @ 2009-05-26 21:04 UTC (permalink / raw) To: James Bottomley Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 26 May 2009, James Bottomley wrote: > > > > (Which reminds me... Are the calls in wait_scan_init() really enough? > > > > wait_for_device_probe() does async_synchronize_full() and then > > > > scsi_complete_async_scans() finishes the SCSI scanning. But if this > > > > scanning involves calling sd_probe(), then more async work will be > > > > queued. Maybe a second call to wait_for_device_probe() is needed.) > > > > You didn't respond to this point. > > Well this was the response: > > > > None of this really got reviewed through the SCSI list, so I'll let > > > Arjan answer. Whoops, for some reason I had gotten the idea that you wrote wait_scan_init(). > > Well then, how does this patch look? > > Well, it's adding complexity, the best fix is to let async only take > care of the pieces which can't fail, that way we don't need complex > error handling. The piece that slows probing isn't really the sysfs > appearances, it's the SCSI probing, and the last piece that needs error > handling is the device_add() for sysfs visibility, so that should be the > dividing line between sync and async. I had exactly the same thought, but it seemed less intrusive to keep the dividing line where Arjan put it. > This should restore the logical flow and fix all the error leg problems > (by eliminating the error legs). You forgot to move the dev_set_drvdata() call into the synchronous part. Apart from that it looks fine. Should I submit it officially? Also, I do still think that wait_scan_init() needs an extra call to async_synchronize_full() at the end. Alan Stern ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 21:04 ` Alan Stern @ 2009-05-26 21:23 ` James Bottomley 2009-05-26 21:52 ` Alan Stern 0 siblings, 1 reply; 9+ messages in thread From: James Bottomley @ 2009-05-26 21:23 UTC (permalink / raw) To: Alan Stern Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 2009-05-26 at 17:04 -0400, Alan Stern wrote: > On Tue, 26 May 2009, James Bottomley wrote: > > > > > > (Which reminds me... Are the calls in wait_scan_init() really enough? > > > > > wait_for_device_probe() does async_synchronize_full() and then > > > > > scsi_complete_async_scans() finishes the SCSI scanning. But if this > > > > > scanning involves calling sd_probe(), then more async work will be > > > > > queued. Maybe a second call to wait_for_device_probe() is needed.) > > > > > > You didn't respond to this point. > > > > Well this was the response: > > > > > > None of this really got reviewed through the SCSI list, so I'll let > > > > Arjan answer. > > Whoops, for some reason I had gotten the idea that you wrote > wait_scan_init(). > > > > Well then, how does this patch look? > > > > Well, it's adding complexity, the best fix is to let async only take > > care of the pieces which can't fail, that way we don't need complex > > error handling. The piece that slows probing isn't really the sysfs > > appearances, it's the SCSI probing, and the last piece that needs error > > handling is the device_add() for sysfs visibility, so that should be the > > dividing line between sync and async. > > I had exactly the same thought, but it seemed less intrusive to keep > the dividing line where Arjan put it. Not really ... the problem where it is becomes one of error handling. It's better to complete all the inline error handling before becoming async ... given the code state, that's provably correct (assuming the original was). > > This should restore the logical flow and fix all the error leg problems > > (by eliminating the error legs). > > You forgot to move the dev_set_drvdata() call into the synchronous > part. Apart from that it looks fine. Should I submit it officially? Well, no ... I forgot to get the drive data after the async_synchronize. Since this is an sd bug fix, I'll take it through the SCSI trees. > Also, I do still think that wait_scan_init() needs an extra call to > async_synchronize_full() at the end. I need to understand it before commenting ... I'm slowly working my way through the async patches. James ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Bug in SCSI async probing 2009-05-26 21:23 ` James Bottomley @ 2009-05-26 21:52 ` Alan Stern 0 siblings, 0 replies; 9+ messages in thread From: Alan Stern @ 2009-05-26 21:52 UTC (permalink / raw) To: James Bottomley Cc: Arjan van de Ven, SCSI development list, Kernel development list On Tue, 26 May 2009, James Bottomley wrote: > > Also, I do still think that wait_scan_init() needs an extra call to > > async_synchronize_full() at the end. > > I need to understand it before commenting ... I'm slowly working my way > through the async patches. There are two lists of queued asynchronous tasks: Arjan's list and the SCSI scanning list. As things stand now, routines in each list are capable of submitting new jobs for the other list. That is, an async task may discover and register a new SCSI host (requiring scanning), and the probe of a SCSI disk discovered during scanning will submit a sd_probe_async task. This makes it difficult to wait for both lists to become idle. Maybe the whole idea behind scsi_wait_scan.c needs some rethinking... Alan Stern ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-05-26 21:52 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-26 15:22 Bug in SCSI async probing Alan Stern 2009-05-26 15:34 ` James Bottomley 2009-05-26 18:34 ` Alan Stern 2009-05-26 18:56 ` James Bottomley 2009-05-26 19:48 ` Alan Stern 2009-05-26 20:35 ` James Bottomley 2009-05-26 21:04 ` Alan Stern 2009-05-26 21:23 ` James Bottomley 2009-05-26 21:52 ` Alan Stern
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox