[PATCH] sym53_8xx_2: fixes two bugs related to chip reset

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
@ 2008-01-09 22:59 Krzysztof Helt
  2008-01-09 23:51 ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: Krzysztof Helt @ 2008-01-09 22:59 UTC (permalink / raw)
  To: linux-scsi, James Bottomley; +Cc: matthew

From: Krzysztof Helt <krzysztof.h1@wp.pl>

This patch fixes two bugs pointed by James Bottomley:

 1. the if (!sym_data->io_reset).  That variable is only ever filled
    by a stack based completion.  If we find it non empty it means
    this code has been entered twice and we have a severe problem,
    so that should just become a BUG_ON(!sym_data->io_reset).
 2. sym_data->io_reset should be set to NULL before the routine is
    exited otherwise the PCI recovery code could end up completing
    what will be a bogus pointer into the stack.

Big thanks to James Bottomley for help with the patch.

Signed-off-by: Krzysztof Helt <krzysztof.h1@w.pl>

---
I do not know if I understood correctly all James' tips.

diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c
--- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-12-23 20:39:44.000000000 +0100
+++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c	2008-01-09 22:22:30.000000000 +0100
@@ -609,22 +609,22 @@ static int sym_eh_handler(int op, char *
 	 */
 #define WAIT_FOR_PCI_RECOVERY	35
 	if (pci_channel_offline(pdev)) {
-		struct completion *io_reset;
 		int finished_reset = 0;
 		init_completion(&eh_done);
 		spin_lock_irq(shost->host_lock);
 		/* Make sure we didn't race */
 		if (pci_channel_offline(pdev)) {
-			if (!sym_data->io_reset)
-				sym_data->io_reset = &eh_done;
-			io_reset = sym_data->io_reset;
+			BUG_ON(!sym_data->io_reset);
+			sym_data->io_reset = &eh_done;
 		} else {
 			finished_reset = 1;
 		}
 		spin_unlock_irq(shost->host_lock);
 		if (!finished_reset)
-			finished_reset = wait_for_completion_timeout(io_reset,
+			finished_reset = wait_for_completion_timeout
+						(sym_data->io_reset,
 						WAIT_FOR_PCI_RECOVERY*HZ);
+		sym_data->io_reset = NULL;
 		if (!finished_reset)
 			return SCSI_FAILED;
 	}
@@ -1879,7 +1879,6 @@ static void sym2_io_resume(struct pci_de
 	spin_lock_irq(shost->host_lock);
 	if (sym_data->io_reset)
 		complete_all(sym_data->io_reset);
-	sym_data->io_reset = NULL;
 	spin_unlock_irq(shost->host_lock);
 }
 


----------------------------------------------------------------------
Sprawdz, ktore komorki sa najmodniejsze!
Kliknij >>> http://link.interia.pl/f1cd4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
  2008-01-09 22:59 [PATCH] sym53_8xx_2: fixes two bugs related to chip reset Krzysztof Helt
@ 2008-01-09 23:51 ` James Bottomley
  2008-01-10 22:31   ` Krzysztof Helt
  0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2008-01-09 23:51 UTC (permalink / raw)
  To: Krzysztof Helt; +Cc: linux-scsi, matthew


On Wed, 2008-01-09 at 23:59 +0100, Krzysztof Helt wrote:
> From: Krzysztof Helt <krzysztof.h1@wp.pl>
> 
> This patch fixes two bugs pointed by James Bottomley:
> 
>  1. the if (!sym_data->io_reset).  That variable is only ever filled
>     by a stack based completion.  If we find it non empty it means
>     this code has been entered twice and we have a severe problem,
>     so that should just become a BUG_ON(!sym_data->io_reset).
>  2. sym_data->io_reset should be set to NULL before the routine is
>     exited otherwise the PCI recovery code could end up completing
>     what will be a bogus pointer into the stack.
> 
> Big thanks to James Bottomley for help with the patch.
> 
> Signed-off-by: Krzysztof Helt <krzysztof.h1@w.pl>

Well done .. there's actually just one problem remaining:

> ---
> I do not know if I understood correctly all James' tips.
> 
> diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c
> --- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-12-23 20:39:44.000000000 +0100
> +++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c	2008-01-09 22:22:30.000000000 +0100
> @@ -609,22 +609,22 @@ static int sym_eh_handler(int op, char *
>  	 */
>  #define WAIT_FOR_PCI_RECOVERY	35
>  	if (pci_channel_offline(pdev)) {
> -		struct completion *io_reset;
>  		int finished_reset = 0;
>  		init_completion(&eh_done);
>  		spin_lock_irq(shost->host_lock);
>  		/* Make sure we didn't race */
>  		if (pci_channel_offline(pdev)) {
> -			if (!sym_data->io_reset)
> -				sym_data->io_reset = &eh_done;
> -			io_reset = sym_data->io_reset;
> +			BUG_ON(!sym_data->io_reset);
> +			sym_data->io_reset = &eh_done;
>  		} else {
>  			finished_reset = 1;
>  		}
>  		spin_unlock_irq(shost->host_lock);
>  		if (!finished_reset)
> -			finished_reset = wait_for_completion_timeout(io_reset,
> +			finished_reset = wait_for_completion_timeout
> +						(sym_data->io_reset,
>  						WAIT_FOR_PCI_RECOVERY*HZ);
> +		sym_data->io_reset = NULL;

This has to be cleared under the host_lock to forestall the (tiny) race
where the pci recovery code checks the value of sym_data->io_reset, we
change it to null and then the pci recovery code completes a NULL
pointer.

Other than this one problem, the code looks fine.

James



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
  2008-01-09 23:51 ` James Bottomley
@ 2008-01-10 22:31   ` Krzysztof Helt
  2008-01-10 22:47     ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: Krzysztof Helt @ 2008-01-10 22:31 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi, matthew

On Wed, 09 Jan 2008 17:51:43 -0600
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> > diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c
> > --- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-12-23 20:39:44.000000000 +0100
> > +++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c	2008-01-09 22:22:30.000000000 +0100
> > @@ -609,22 +609,22 @@ static int sym_eh_handler(int op, char *
> >  	 */
> >  #define WAIT_FOR_PCI_RECOVERY	35
> >  	if (pci_channel_offline(pdev)) {
> > -		struct completion *io_reset;
> >  		int finished_reset = 0;
> >  		init_completion(&eh_done);
> >  		spin_lock_irq(shost->host_lock);
> >  		/* Make sure we didn't race */
> >  		if (pci_channel_offline(pdev)) {
> > -			if (!sym_data->io_reset)
> > -				sym_data->io_reset = &eh_done;
> > -			io_reset = sym_data->io_reset;
> > +			BUG_ON(!sym_data->io_reset);
> > +			sym_data->io_reset = &eh_done;
> >  		} else {
> >  			finished_reset = 1;
> >  		}
> >  		spin_unlock_irq(shost->host_lock);
> >  		if (!finished_reset)
> > -			finished_reset = wait_for_completion_timeout(io_reset,
> > +			finished_reset = wait_for_completion_timeout
> > +						(sym_data->io_reset,
> >  						WAIT_FOR_PCI_RECOVERY*HZ);
> > +		sym_data->io_reset = NULL;
> 
> This has to be cleared under the host_lock to forestall the (tiny) race
> where the pci recovery code checks the value of sym_data->io_reset, we
> change it to null and then the pci recovery code completes a NULL
> pointer.
> 

It is impossible as the io_reset value is not NULL before and during wait
completion. The case above would happen only if one thread checked the
io_reset value (under lock) and it was NULL and before setting it (inside
locked section) another thread checked the io_reset value (still NULL 
and also inside locked section = impossible). Otherwise, the BUG_ON() 
kicked in (the value is already not NULL).

Another case is if you consider changing the io_reset value after the
locked section but before wait_for_completion_timeout(). In this case,
putting spinlock around the io_reset clearing does not change anything.
There is still a chance of race if the io_reset is cleared after one thread
leaves the locked section then another one clearing the io_reset (under lock)
then completion by the first thread happens with NULL pointer.

Am I right? I understand that you asked for change like this:
+  		spin_lock_irq(shost->host_lock);
+		sym_data->io_reset = NULL;
+  		spin_unlock_irq(shost->host_lock);

Maybe, better solution is to clear the io_reset field inside the 
sym2_io_resume() (as it was done)? It would be cleared only
after the completion and before it is cleared the BUG_ON() guards
against race. The 2nd race case described above is impossible 
in such solution.

However, another version of the patch is needed as the BUG_ON condition
should be BUG_ON(sym_data->io_reset) and not 
BUG_ON(!sym_data->io_reset).

So, I wait for your opinion to the lock issue. If you still consider version with
the change presented above better, I'll add it.

Kind regards,
Krzysztof

----------------------------------------------------------------------
Rozdajemy nagrody! 

Sprawdz >> http://link.interia.pl/f1cbf

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
  2008-01-10 22:31   ` Krzysztof Helt
@ 2008-01-10 22:47     ` James Bottomley
  2008-01-11 20:50       ` Krzysztof Helt
  0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2008-01-10 22:47 UTC (permalink / raw)
  To: Krzysztof Helt; +Cc: linux-scsi, matthew


On Thu, 2008-01-10 at 23:31 +0100, Krzysztof Helt wrote:
> On Wed, 09 Jan 2008 17:51:43 -0600
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > > diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c
> > > --- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-12-23 20:39:44.000000000 +0100
> > > +++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c	2008-01-09 22:22:30.000000000 +0100
> > > @@ -609,22 +609,22 @@ static int sym_eh_handler(int op, char *
> > >  	 */
> > >  #define WAIT_FOR_PCI_RECOVERY	35
> > >  	if (pci_channel_offline(pdev)) {
> > > -		struct completion *io_reset;
> > >  		int finished_reset = 0;
> > >  		init_completion(&eh_done);
> > >  		spin_lock_irq(shost->host_lock);
> > >  		/* Make sure we didn't race */
> > >  		if (pci_channel_offline(pdev)) {
> > > -			if (!sym_data->io_reset)
> > > -				sym_data->io_reset = &eh_done;
> > > -			io_reset = sym_data->io_reset;
> > > +			BUG_ON(!sym_data->io_reset);
> > > +			sym_data->io_reset = &eh_done;
> > >  		} else {
> > >  			finished_reset = 1;
> > >  		}
> > >  		spin_unlock_irq(shost->host_lock);
> > >  		if (!finished_reset)
> > > -			finished_reset = wait_for_completion_timeout(io_reset,
> > > +			finished_reset = wait_for_completion_timeout
> > > +						(sym_data->io_reset,
> > >  						WAIT_FOR_PCI_RECOVERY*HZ);
> > > +		sym_data->io_reset = NULL;
> > 
> > This has to be cleared under the host_lock to forestall the (tiny) race
> > where the pci recovery code checks the value of sym_data->io_reset, we
> > change it to null and then the pci recovery code completes a NULL
> > pointer.
> > 
> 
> It is impossible as the io_reset value is not NULL before and during wait
> completion. The case above would happen only if one thread checked the
> io_reset value (under lock) and it was NULL and before setting it (inside
> locked section) another thread checked the io_reset value (still NULL 
> and also inside locked section = impossible). Otherwise, the BUG_ON() 
> kicked in (the value is already not NULL).

Er, no.  Let me be clearer about the sequence

sym2_io_resume() is racing to complete sym_data->io_reset with
sym_eh_handler()s timeout.  Because you don't set it to NULL under the
host_lock, you can get the sequence

sym2_io_resume() acquires the host lock and checks the value of
sym_data->io_reset and finds it to be not NULL.

sym_eh_handler() NULLs sym_data->io_resume *without* acquiring the host
lock. probably because the wait__for_completion_timeout() times out.

sym2_io_resume() calls complete() on sym_data->io_resume which is now a
NULL pointer and boom.

We never get multiple threads into sym_eh_handler() because it's single
threaded from the error handler thread.

James



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
  2008-01-10 22:47     ` James Bottomley
@ 2008-01-11 20:50       ` Krzysztof Helt
  2008-01-11 20:52         ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Krzysztof Helt @ 2008-01-11 20:50 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi, matthew

From: Krzysztof Helt <krzysztof.h1@wp.pl>

This patch fixes two bugs pointed by James Bottomley:

 1. the if (!sym_data->io_reset).  That variable is only ever filled
    by a stack based completion.  If we find it non empty it means
    this code has been entered twice and we have a severe problem,
    so that should just become a BUG_ON(!sym_data->io_reset).
 2. sym_data->io_reset should be set to NULL before the routine is
    exited otherwise the PCI recovery code could end up completing
    what will be a bogus pointer into the stack.

Big thanks to James Bottomley for help with the patch.

Signed-off-by: Krzysztof Helt <krzysztof.h1@w.pl>

---
This is the second version with clearing io_reset value
under lock. I haven't got that it was a race between resume
and handler functions.

diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c
--- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-12-23 20:39:44.000000000 +0100
+++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c	2008-01-10 21:26:32.000000000 +0100
@@ -609,22 +609,24 @@ static int sym_eh_handler(int op, char *
 	 */
 #define WAIT_FOR_PCI_RECOVERY	35
 	if (pci_channel_offline(pdev)) {
-		struct completion *io_reset;
 		int finished_reset = 0;
 		init_completion(&eh_done);
 		spin_lock_irq(shost->host_lock);
 		/* Make sure we didn't race */
 		if (pci_channel_offline(pdev)) {
-			if (!sym_data->io_reset)
-				sym_data->io_reset = &eh_done;
-			io_reset = sym_data->io_reset;
+			BUG_ON(!sym_data->io_reset);
+			sym_data->io_reset = &eh_done;
 		} else {
 			finished_reset = 1;
 		}
 		spin_unlock_irq(shost->host_lock);
 		if (!finished_reset)
-			finished_reset = wait_for_completion_timeout(io_reset,
+			finished_reset = wait_for_completion_timeout
+						(sym_data->io_reset,
 						WAIT_FOR_PCI_RECOVERY*HZ);
+		spin_lock_irq(shost->host_lock);
+		sym_data->io_reset = NULL;
+		spin_unlock_irq(shost->host_lock);
 		if (!finished_reset)
 			return SCSI_FAILED;
 	}
@@ -1879,7 +1881,6 @@ static void sym2_io_resume(struct pci_de
 	spin_lock_irq(shost->host_lock);
 	if (sym_data->io_reset)
 		complete_all(sym_data->io_reset);
-	sym_data->io_reset = NULL;
 	spin_unlock_irq(shost->host_lock);
 }
 

----------------------------------------------------------------------
Rozdajemy nagrody! 

Sprawdz >> http://link.interia.pl/f1cbf



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
  2008-01-11 20:50       ` Krzysztof Helt
@ 2008-01-11 20:52         ` Matthew Wilcox
  2008-01-12  4:11           ` Krzysztof Helt
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2008-01-11 20:52 UTC (permalink / raw)
  To: Krzysztof Helt; +Cc: James Bottomley, linux-scsi

On Fri, Jan 11, 2008 at 09:50:46PM +0100, Krzysztof Helt wrote:
> +			BUG_ON(!sym_data->io_reset);
> +			sym_data->io_reset = &eh_done;

Isn't that BUG_ON still the wrong sense?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sym53_8xx_2: fixes two bugs related to chip reset
  2008-01-11 20:52         ` Matthew Wilcox
@ 2008-01-12  4:11           ` Krzysztof Helt
  0 siblings, 0 replies; 7+ messages in thread
From: Krzysztof Helt @ 2008-01-12  4:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: James Bottomley, linux-scsi

On Fri, 11 Jan 2008 13:52:29 -0700
Matthew Wilcox <matthew@wil.cx> wrote:

> On Fri, Jan 11, 2008 at 09:50:46PM +0100, Krzysztof Helt wrote:
> > +			BUG_ON(!sym_data->io_reset);
> > +			sym_data->io_reset = &eh_done;
> 
> Isn't that BUG_ON still the wrong sense?
> 

Unfortunately yes. 

Here is the 3rd version of the patch. Thank you for your patience.

---

From: Krzysztof Helt <krzysztof.h1@wp.pl>

This patch fixes two bugs pointed by James Bottomley:

 1. the if (!sym_data->io_reset).  That variable is only ever filled
    by a stack based completion.  If we find it non empty it means
    this code has been entered twice and we have a severe problem,
    so that should just become a BUG_ON(!sym_data->io_reset).
 2. sym_data->io_reset should be set to NULL before the routine is
    exited otherwise the PCI recovery code could end up completing
    what will be a bogus pointer into the stack.

Big thanks to James Bottomley for help with the patch.

Signed-off-by: Krzysztof Helt <krzysztof.h1@w.pl>

---

diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c
--- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-12-23 20:39:44.000000000 +0100
+++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c	2008-01-10 21:26:32.000000000 +0100
@@ -609,22 +609,24 @@ static int sym_eh_handler(int op, char *
 	 */
 #define WAIT_FOR_PCI_RECOVERY	35
 	if (pci_channel_offline(pdev)) {
-		struct completion *io_reset;
 		int finished_reset = 0;
 		init_completion(&eh_done);
 		spin_lock_irq(shost->host_lock);
 		/* Make sure we didn't race */
 		if (pci_channel_offline(pdev)) {
-			if (!sym_data->io_reset)
-				sym_data->io_reset = &eh_done;
-			io_reset = sym_data->io_reset;
+			BUG_ON(sym_data->io_reset);
+			sym_data->io_reset = &eh_done;
 		} else {
 			finished_reset = 1;
 		}
 		spin_unlock_irq(shost->host_lock);
 		if (!finished_reset)
-			finished_reset = wait_for_completion_timeout(io_reset,
+			finished_reset = wait_for_completion_timeout
+						(sym_data->io_reset,
 						WAIT_FOR_PCI_RECOVERY*HZ);
+		spin_lock_irq(shost->host_lock);
+		sym_data->io_reset = NULL;
+		spin_unlock_irq(shost->host_lock);
 		if (!finished_reset)
 			return SCSI_FAILED;
 	}
@@ -1879,7 +1881,6 @@ static void sym2_io_resume(struct pci_de
 	spin_lock_irq(shost->host_lock);
 	if (sym_data->io_reset)
 		complete_all(sym_data->io_reset);
-	sym_data->io_reset = NULL;
 	spin_unlock_irq(shost->host_lock);
 }
 

----------------------------------------------------------------------
Rozdajemy nagrody! 

Sprawdz >> http://link.interia.pl/f1cbf



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-01-12  4:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-09 22:59 [PATCH] sym53_8xx_2: fixes two bugs related to chip reset Krzysztof Helt
2008-01-09 23:51 ` James Bottomley
2008-01-10 22:31   ` Krzysztof Helt
2008-01-10 22:47     ` James Bottomley
2008-01-11 20:50       ` Krzysztof Helt
2008-01-11 20:52         ` Matthew Wilcox
2008-01-12  4:11           ` Krzysztof Helt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox