Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-10  8:15 Jean-Baptiste Vignaud
  2007-08-10  8:37 ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-10  8:15 UTC (permalink / raw)
  To: jarkao2
  Cc: marcin.slusarz, mingo, tglx, torvalds, linux-kernel, shemminger,
	linux-net, netdev, akpm, alan

> So, we still have to wait for the exact explanation...
> 
> Thanks very much Marcin!
> 
> I think, there is this one possible for your testing yet?:
> Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> Date: Wed, 8 Aug 2007 13:00:37 +0200
> 
> If it's not a great problem it would be interesting to try this with
> different CONFIG_HZ too e.g. you could start with 100 (I guess, you
> tested very similar thing in 2.6.23-rc2 with 1000(?) already).
> 
> Jean-Baptiste: you can skip/break testing of this 'experimental'

ok

I was still testing on -rc2:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

For me after 1day 20hours, the network is still up, with more than 1To 
of network traffic. HZ was 1000, i restart with HZ=100.

Jb

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  8:15 [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
@ 2007-08-10  8:37 ` Jarek Poplawski
  2007-08-10  8:48   ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-10  8:37 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: marcin.slusarz, mingo, tglx, torvalds, linux-kernel, shemminger,
	linux-net, netdev, akpm, alan

On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
...
> I was still testing on -rc2:
> Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> Date: Wed, 8 Aug 2007 13:00:37 +0200
> 
> For me after 1day 20hours, the network is still up, with more than 1To
> of network traffic. HZ was 1000, i restart with HZ=100.

For me it's enough too but Thomas seems to doubt.

You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND
prepared too. So, if this is not a great problem maybe you could try
this first. Tomorrow Thomas may send something, so this 100HZ could
wait yet, I hope?

Many thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  8:37 ` Jarek Poplawski
@ 2007-08-10  8:48   ` Ingo Molnar
  2007-08-10  9:03     ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2007-08-10  8:48 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Jean-Baptiste Vignaud, marcin.slusarz, tglx, torvalds,
	linux-kernel, shemminger, linux-net, netdev, akpm, alan


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
> ...
> > I was still testing on -rc2:
> > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> > Date: Wed, 8 Aug 2007 13:00:37 +0200
> > 
> > For me after 1day 20hours, the network is still up, with more than 
> > 1To of network traffic. HZ was 1000, i restart with HZ=100.
> 
> For me it's enough too but Thomas seems to doubt.

seem to doubt what? That rc2 fixes the symptom? That is a sure thing, 
and we never doubted that. I think you might have misunderstood what 
Thomas said and meant, so please just state your opinion unambiguously 
so that we can fix any mis-communication :)

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  8:48   ` Ingo Molnar
@ 2007-08-10  9:03     ` Jarek Poplawski
  2007-08-10  9:08       ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-10  9:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean-Baptiste Vignaud, marcin.slusarz, tglx, torvalds,
	linux-kernel, shemminger, linux-net, netdev, akpm, alan

On Fri, Aug 10, 2007 at 10:48:41AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
> > ...
> > > I was still testing on -rc2:
> > > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> > > Date: Wed, 8 Aug 2007 13:00:37 +0200
> > > 
> > > For me after 1day 20hours, the network is still up, with more than 
> > > 1To of network traffic. HZ was 1000, i restart with HZ=100.
> > 
> > For me it's enough too but Thomas seems to doubt.
> 
> seem to doubt what? That rc2 fixes the symptom? That is a sure thing, 
> and we never doubted that. I think you might have misunderstood what 
> Thomas said and meant, so please just state your opinion unambiguously 
> so that we can fix any mis-communication :)
> 
> 	Ingo
> 


On 25-07-2007 02:19, Thomas Gleixner wrote:
...
> Actually we only need the resend for edge type interrupts. Level type
> interrupts come back once enable_irq() re-enables the interrupt line.
>

On 10-08-2007 10:05, Thomas Gleixner wrote:
...
> But suppressing the resend is not fixing the driver problem. The problem
> can show up with spurious interrupts and with interrupts on a shared PCI
> interrupt line at any time. It just might take weeks instead of minutes.

Maybe I miss something but it's not the same!

So, should Jean-Baptiste or Marcin test this for weeks or it's enough?

Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  9:03     ` Jarek Poplawski
@ 2007-08-10  9:08       ` Ingo Molnar
  2007-08-10  9:19         ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2007-08-10  9:08 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Jean-Baptiste Vignaud, marcin.slusarz, tglx, torvalds,
	linux-kernel, shemminger, linux-net, netdev, akpm, alan


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> On 10-08-2007 10:05, Thomas Gleixner wrote:
> ...
> > But suppressing the resend is not fixing the driver problem. The 
> > problem can show up with spurious interrupts and with interrupts on 
> > a shared PCI interrupt line at any time. It just might take weeks 
> > instead of minutes.
> 
> Maybe I miss something but it's not the same!

_now_ i finally understand what you probably meant: because sw-resend 
worked and hw-resend didnt, it's hw-resend that is causing the breakage, 
not any driver or irqflow bug - correct?

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  9:08       ` Ingo Molnar
@ 2007-08-10  9:19         ` Jarek Poplawski
  2007-08-10  9:38           ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-10  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jean-Baptiste Vignaud, marcin.slusarz, tglx, torvalds,
	linux-kernel, shemminger, linux-net, netdev, akpm, alan

On Fri, Aug 10, 2007 at 11:08:33AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > On 10-08-2007 10:05, Thomas Gleixner wrote:
> > ...
> > > But suppressing the resend is not fixing the driver problem. The 
> > > problem can show up with spurious interrupts and with interrupts on 
> > > a shared PCI interrupt line at any time. It just might take weeks 
> > > instead of minutes.
> > 
> > Maybe I miss something but it's not the same!
> 
> _now_ i finally understand what you probably meant: because sw-resend 
> worked and hw-resend didnt, it's hw-resend that is causing the breakage, 
> not any driver or irqflow bug - correct?

All correct! There was also checked a possibility it can be not
hw itself, but wrong way of handling after hw (acking too late). This
was false idea (or bad implementation), so it looks like hw vs lapic
problem.

Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  9:19         ` Jarek Poplawski
@ 2007-08-10  9:38           ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2007-08-10  9:38 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Jean-Baptiste Vignaud, marcin.slusarz, tglx, torvalds,
	linux-kernel, shemminger, linux-net, netdev, akpm, alan

* Jarek Poplawski <jarkao2@o2.pl> wrote:

> All correct! There was also checked a possibility it can be not hw 
> itself, but wrong way of handling after hw (acking too late). This was 
> false idea (or bad implementation), so it looks like hw vs lapic 
> problem.

i think the problem is that local APIC 'self vectors' might be 
edge-triggered by default. I'm not exactly sure whether passing in 
APIC_INT_LEVELTRIG to send_IPI_self() will truly be interpreted by the 
local APIC into any external IO-APIC ACK sequence (the local APIC might 
just treat self-vectors as always-edge) - and it might also be that the 
pure act of mixing self-triggered vectors with level-triggered external 
irqs sometimes confuses the IO-APIC <-> local-APIC messaging. One more 
test of the patch below will tell us a bit more about this part of the 
story.

	Ingo

Index: linux/arch/i386/kernel/io_apic.c
===================================================================
--- linux.orig/arch/i386/kernel/io_apic.c
+++ linux/arch/i386/kernel/io_apic.c
@@ -735,7 +735,8 @@ void fastcall send_IPI_self(int vector)
 	 * Wait for idle.
 	 */
 	apic_wait_icr_idle();
-	cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL;
+	cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL |
+		APIC_INT_LEVELTRIG;
 	/*
 	 * Send the IPI. The write to APIC_ICR fires this off.
 	 */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-08-10  8:41 Jean-Baptiste Vignaud
  0 siblings, 0 replies; 12+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-08-10  8:41 UTC (permalink / raw)
  To: jarkao2
  Cc: marcin.slusarz, mingo, tglx, torvalds, linux-kernel, shemminger,
	linux-net, netdev, akpm, alan

> For me it's enough too but Thomas seems to doubt.
> 
> You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND
> prepared too. So, if this is not a great problem maybe you could try
> this first. Tomorrow Thomas may send something, so this 100HZ could
> wait yet, I hope?

Ok, i'll test 2.6.23-rc1 with HARDIRQS_SW_RESEND first.

Jb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-07-31 13:20 Jarek Poplawski
  2007-08-06  7:00 ` Marcin Ślusarz
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-07-31 13:20 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Mon, Jul 30, 2007 at 09:29:38AM +0200, Marcin Ślusarz wrote:
...
> ps: I retested all patches posted in this thread on top of 2.6.22.1
> and behavior from 2.6.21.3 didn't changed. My next tests will be on
> 2.6.22.x only.

Marcin,

I see you're quite busy, but if after testing this next Ingo's patch
you are alive yet, maybe you could try one more "idea"? No patch this
time, but if you could try this after adding boot option "noirqdebug"
(I'd like to be sure it's not about timinig after all).

Cheers & thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-31 13:20 Jarek Poplawski
@ 2007-08-06  7:00 ` Marcin Ślusarz
  2007-08-06  7:03   ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  7:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/31, Jarek Poplawski <jarkao2@o2.pl>:
> Marcin,
>
> I see you're quite busy, but if after testing this next Ingo's patch
> you are alive yet, maybe you could try one more "idea"? No patch this
> time, but if you could try this after adding boot option "noirqdebug"
> (I'd like to be sure it's not about timinig after all).
It didn't change anything. Network card still timed out.

Marcin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:00 ` Marcin Ślusarz
@ 2007-08-06  7:03   ` Ingo Molnar
  2007-08-07  7:46     ` Marcin Ślusarz
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2007-08-06  7:03 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> 2007/7/31, Jarek Poplawski <jarkao2@o2.pl>:
> > Marcin,
> >
> > I see you're quite busy, but if after testing this next Ingo's patch
> > you are alive yet, maybe you could try one more "idea"? No patch this
> > time, but if you could try this after adding boot option "noirqdebug"
> > (I'd like to be sure it's not about timinig after all).
> It didn't change anything. Network card still timed out.

please try Jarek's second patch too - there was a missing unmask.

	Ingo

-------------->
Subject: genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski <jarkao2@o2.pl>

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?

So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru

 	spin_lock(&desc->lock);

-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;

 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru

 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str

 	spin_lock(&desc->lock);

-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;

 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str

 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:03   ` Ingo Molnar
@ 2007-08-07  7:46     ` Marcin Ślusarz
  2007-08-07  8:23       ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Marcin Ślusarz @ 2007-08-07  7:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/6, Ingo Molnar <mingo@elte.hu>:
> (..)
> please try Jarek's second patch too - there was a missing unmask.
>
>         Ingo
>
> -------------->
> Subject: genirq: fix simple and fasteoi irq handlers
> From: Jarek Poplawski <jarkao2@o2.pl>
>
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they happen.
> But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
> more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
> driver's work.
>
> The main reason of problems here, pointing the broken patch and making
> the first patch which can fix this was done by Marcin Slusarz.
> Additional test patches of Thomas Gleixner and Ingo Molnar tested by
> Marcin Slusarz helped to narrow possible reasons even more. Thanks.
>
> PS: this patch fixes only one evident error here, but there could be
> more places affected by above-mentioned change in irq handling.
>
> PS 2:
> After rethinking, IMHO, there are two most probable scenarios here:
>
> 1. After hw resend there could be a conflict between retriggered
> edge type irq and the next level type one: e.g. if this level type
> irq (io_apic is enabled then) is triggered while retriggered irq is
> serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
> the next such levels are triggered and looping, so probably kind of
> flood in io_apic until this retriggered edge service has ended.
> 2. There is something wrong with ioapic_retrigger_irq (less probable
> because this should be probably seen with 'normal' edge retriggers,
> but on the other hand, they could be less common).
>
> So, if there is #1, this fixed patch should work.
>
> But, since level types don't need this retriggers too much I think
> this "don't mask interrupts by default" idea should be rethinked:
> is there enough gain to risk such hard to diagnose errors?
>
> So, IMHO, there should be at least possibility to turn this off for
> level types in config (it should be a visible option, so people could
> find & try this before writing for help or changing a network card).
>
>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
>
> ---
>
> diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> --- 2.6.23-rc1-/kernel/irq/chip.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.23-rc1/kernel/irq/chip.c        2007-08-05 21:49:46.000000000 +0200
> @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out_unlock;
>         kstat_cpu(cpu).irqs[irq]++;
>
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
>                 desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> @@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>         desc->status &= ~IRQ_INPROGRESS;
> +       if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +               desc->chip->unmask(irq);
>  out_unlock:
>         spin_unlock(&desc->lock);
>  }
> @@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out;
> -
>         desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
>         kstat_cpu(cpu).irqs[irq]++;
>
>         /*
> -        * If its disabled or no action available
> +        * If it's running, disabled or no action available
>          * then mask it and get out of here:
>          */
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 desc->status |= IRQ_PENDING;
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
> @@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>         desc->status &= ~IRQ_INPROGRESS;
> +       if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +               desc->chip->unmask(irq);
>  out:
>         desc->chip->eoi(irq);
>
>
Network card still locks up (tested on 2.6.22.1). I had to upload more
data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
might be a coincidence...

Marcin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  7:46     ` Marcin Ślusarz
@ 2007-08-07  8:23       ` Jarek Poplawski
       [not found]         ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-07  8:23 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> 2007/8/6, Ingo Molnar <mingo@elte.hu>:
> > (..)
> > please try Jarek's second patch too - there was a missing unmask.
> >
> >         Ingo
> >
> > -------------->
> > Subject: genirq: fix simple and fasteoi irq handlers
> > From: Jarek Poplawski <jarkao2@o2.pl>
...
> Network card still locks up (tested on 2.6.22.1). I had to upload more
> data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> might be a coincidence...

Thanks! It's a good news after all - it would be really strange why
this place doesn't hit more people (it seems there is some safety
elsewhere for this).

BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
(with a warning), had no problems with a similar load?

So, once more, I would suspect hw retrigger code. Ingo, IMHO, this
patch for testing HARDIRQS_SW_RESEND could be reworked, so that
desc->chip->retrigger() is done only for eadges and the tasklet only
for levels. BTW, I think this current warning in the "temporary" is
is too early - we don't know if after this the ->retrigger() will
take place.

Regards,
Jarek P.

PS: Marcin, if you need a break in this testing let us know!
I think the main idea of this bug is known enough.

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>]

* Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]         ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
@ 2007-08-07  9:52           ` Jarek Poplawski
  2007-08-07 12:13             ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-07  9:52 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> > > Network card still locks up (tested on 2.6.22.1). I had to upload more
> > > data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> > > might be a coincidence...
> >
> > Thanks! It's a good news after all - it would be really strange why
> > this place doesn't hit more people (it seems there is some safety
> > elsewhere for this).
> >
> > BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
> > (with a warning), had no problems with a similar load?
> I always tested on 500-600 MB "dataset"
> 
> > PS: Marcin, if you need a break in this testing let us know!
> No, i don't need a break. I'll have more time in next weeks.

Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
if Ingo doesn't prepare something for you.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  9:52           ` Jarek Poplawski
@ 2007-08-07 12:13             ` Jarek Poplawski
  2007-08-08 11:09               ` Marcin Ślusarz
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-07 12:13 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
> On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
> > 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> > > > Network card still locks up (tested on 2.6.22.1). I had to upload more
> > > > data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> > > > might be a coincidence...
> > >
> > > Thanks! It's a good news after all - it would be really strange why
> > > this place doesn't hit more people (it seems there is some safety
> > > elsewhere for this).
> > >
> > > BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
> > > (with a warning), had no problems with a similar load?
> > I always tested on 500-600 MB "dataset"
> > 
> > > PS: Marcin, if you need a break in this testing let us know!
> > No, i don't need a break. I'll have more time in next weeks.
> 
> Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
> if Ingo doesn't prepare something for you.

So, the let's try this idea yet: modified Ingo's "x86: activate
HARDIRQS_SW_RESEND" patch.
(Don't forget about make oldconfig before make.)
For testing only.

Cheers,
Jarek P.

PS: alas there was not even time for "compile checking"...

---

diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
--- 2.6.22.1-/arch/i386/Kconfig	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/i386/Kconfig	2007-08-07 13:13:03.000000000 +0200
@@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
--- 2.6.22.1-/arch/x86_64/Kconfig	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/x86_64/Kconfig	2007-08-07 13:13:03.000000000 +0200
@@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 menu "Power management options"
 
 source kernel/power/Kconfig
diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
--- 2.6.22.1-/kernel/irq/manage.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/kernel/irq/manage.c	2007-08-07 13:13:03.000000000 +0200
@@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 
diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
--- 2.6.22.1-/kernel/irq/resend.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/kernel/irq/resend.c	2007-08-07 13:57:54.000000000 +0200
@@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);
 
+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 
-		if (!desc->chip || !desc->chip->retrigger ||
-					!desc->chip->retrigger(irq)) {
+		if (desc->handle_irq == handle_edge_irq) {
+			if (desc->chip->retrigger)
+				desc->chip->retrigger(irq);
+			return;
+		}
 #ifdef CONFIG_HARDIRQS_SW_RESEND
-			/* Set it pending and activate the softirq: */
-			set_bit(irq, irqs_resend);
-			tasklet_schedule(&resend_tasklet);
+		WARN_ON_ONCE(1);
+		/* Set it pending and activate the softirq: */
+		set_bit(irq, irqs_resend);
+		tasklet_schedule(&resend_tasklet);
 #endif
-		}
 	}
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:13             ` Jarek Poplawski
@ 2007-08-08 11:09               ` Marcin Ślusarz
  2007-08-08 11:42                 ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Marcin Ślusarz @ 2007-08-08 11:09 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> So, the let's try this idea yet: modified Ingo's "x86: activate
> HARDIRQS_SW_RESEND" patch.
> (Don't forget about make oldconfig before make.)
> For testing only.
>
> Cheers,
> Jarek P.
>
> PS: alas there was not even time for "compile checking"...
>
> ---
>
> diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  config X86_SMP
>         bool
>         depends on SMP && !X86_VOYAGER
> diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  menu "Power management options"
>
>  source kernel/power/Kconfig
> diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
>                 desc->depth--;
>         }
>         spin_unlock_irqrestore(&desc->lock, flags);
> +#ifdef CONFIG_HARDIRQS_SW_RESEND
> +       /*
> +        * Do a bh disable/enable pair to trigger any pending
> +        * irq resend logic:
> +        */
> +       local_bh_disable();
> +       local_bh_enable();
> +#endif
>  }
>  EXPORT_SYMBOL(enable_irq);
>
> diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
>          */
>         desc->chip->enable(irq);
>
> +       /*
> +        * Temporary hack to figure out more about the problem, which
> +        * is causing the ancient network cards to die.
> +        */
> +
>         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
>                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
>
> -               if (!desc->chip || !desc->chip->retrigger ||
> -                                       !desc->chip->retrigger(irq)) {
> +               if (desc->handle_irq == handle_edge_irq) {
> +                       if (desc->chip->retrigger)
> +                               desc->chip->retrigger(irq);
> +                       return;
> +               }
>  #ifdef CONFIG_HARDIRQS_SW_RESEND
> -                       /* Set it pending and activate the softirq: */
> -                       set_bit(irq, irqs_resend);
> -                       tasklet_schedule(&resend_tasklet);
> +               WARN_ON_ONCE(1);
> +               /* Set it pending and activate the softirq: */
> +               set_bit(irq, irqs_resend);
> +               tasklet_schedule(&resend_tasklet);
>  #endif
> -               }
>         }
>  }
>
Works fine with:
WARNING: at kernel/irq/resend.c:79 check_irq_resend()

Call Trace:
 [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
 [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
 [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
 [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
 [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
 [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
 [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
 [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
 [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
 [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
 [<ffffffff8054f5c0>] dst_output+0x0/0x10
 [<ffffffff80552aff>] ip_output+0x19f/0x340
 [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
 [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
 [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
 [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
 [<ffffffff80577735>] inet_sendmsg+0x45/0x80
 [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
 [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
 [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
 [<ffffffff802868e9>] vfs_write+0x159/0x170
 [<ffffffff80286ef0>] sys_write+0x50/0x90
 [<ffffffff802097fe>] system_call+0x7e/0x83

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:09               ` Marcin Ślusarz
@ 2007-08-08 11:42                 ` Jarek Poplawski
  2007-08-09  9:19                   ` [patch (testing)] " Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-08 11:42 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

Read below please:

On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > So, the let's try this idea yet: modified Ingo's "x86: activate
> > HARDIRQS_SW_RESEND" patch.
> > (Don't forget about make oldconfig before make.)
> > For testing only.
> >
> > Cheers,
> > Jarek P.
> >
> > PS: alas there was not even time for "compile checking"...
> >
> > ---
> >
> > diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> > --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> > @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  config X86_SMP
> >         bool
> >         depends on SMP && !X86_VOYAGER
> > diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> > --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> > @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  menu "Power management options"
> >
> >  source kernel/power/Kconfig
> > diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> > --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> > @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
> >                 desc->depth--;
> >         }
> >         spin_unlock_irqrestore(&desc->lock, flags);
> > +#ifdef CONFIG_HARDIRQS_SW_RESEND
> > +       /*
> > +        * Do a bh disable/enable pair to trigger any pending
> > +        * irq resend logic:
> > +        */
> > +       local_bh_disable();
> > +       local_bh_enable();
> > +#endif
> >  }
> >  EXPORT_SYMBOL(enable_irq);
> >
> > diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> > --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> > @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
> >          */
> >         desc->chip->enable(irq);
> >
> > +       /*
> > +        * Temporary hack to figure out more about the problem, which
> > +        * is causing the ancient network cards to die.
> > +        */
> > +
> >         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
> >                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
> >
> > -               if (!desc->chip || !desc->chip->retrigger ||
> > -                                       !desc->chip->retrigger(irq)) {
> > +               if (desc->handle_irq == handle_edge_irq) {
> > +                       if (desc->chip->retrigger)
> > +                               desc->chip->retrigger(irq);
> > +                       return;
> > +               }
> >  #ifdef CONFIG_HARDIRQS_SW_RESEND
> > -                       /* Set it pending and activate the softirq: */
> > -                       set_bit(irq, irqs_resend);
> > -                       tasklet_schedule(&resend_tasklet);
> > +               WARN_ON_ONCE(1);
> > +               /* Set it pending and activate the softirq: */
> > +               set_bit(irq, irqs_resend);
> > +               tasklet_schedule(&resend_tasklet);
> >  #endif
> > -               }
> >         }
> >  }
> >
> Works fine with:

Very nice! It would be about time this kernel should start behave...

> WARNING: at kernel/irq/resend.c:79 check_irq_resend()
> 
> Call Trace:
>  [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
>  [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
>  [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
>  [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
>  [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
>  [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
>  [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
>  [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
>  [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
>  [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
>  [<ffffffff8054f5c0>] dst_output+0x0/0x10
>  [<ffffffff80552aff>] ip_output+0x19f/0x340
>  [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
>  [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
>  [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
>  [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
>  [<ffffffff80577735>] inet_sendmsg+0x45/0x80
>  [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
>  [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
>  [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
>  [<ffffffff802868e9>] vfs_write+0x159/0x170
>  [<ffffffff80286ef0>] sys_write+0x50/0x90
>  [<ffffffff802097fe>] system_call+0x7e/0x83
> 

So, it looks like x86_64 io_apic's IPI code was unused too long...
I hope it's a piece of cake for Ingo now...

Thanks very much Marcin!

If it's possible for you and Jean-Baptiste, try this today patch
with -rc2, and maybe once more this one patch (-rc1 or older).

Regards,
Jarek P. 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:42                 ` Jarek Poplawski
@ 2007-08-09  9:19                   ` Jarek Poplawski
       [not found]                     ` <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-09  9:19 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
> Read below please:
> 
> On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
> > 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > > So, the let's try this idea yet: modified Ingo's "x86: activate
> > > HARDIRQS_SW_RESEND" patch.
> > > (Don't forget about make oldconfig before make.)
> > > For testing only.
...
> > > diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> > > --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> > > +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> > > @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
> > >         depends on GENERIC_HARDIRQS && SMP
> > >         default y
> > >
> > > +config HARDIRQS_SW_RESEND
...
> > Works fine with:
> 
> Very nice! It would be about time this kernel should start behave...
> 
> > WARNING: at kernel/irq/resend.c:79 check_irq_resend()
> > 
> > Call Trace:
...
> So, it looks like x86_64 io_apic's IPI code was unused too long...
> I hope it's a piece of cake for Ingo now...

So, we know now it's almost definitely something about lapic and IPIs
but, maybe it's not this code to blame...

Here is one more patch to check the possibility it's about the way
the resend edge type irqs are handled by level type handlers:
so, let's check if acking isn't too late...

Marcin and Jean-Baptiste: I would be very glad, as usual! And no need
to hurry; I think we know enough to fix this for you, but maybe this
test could explain if there are errors in lapics or only bad handling.

Many thanks,
Jarek P. 

PS: this patch is very experimental, and only intended for testing.
It should be applied to clean 2.6.23-rc1 or a bit older (eg. 2.6.22)
(so 2.6.23-rc2 or any patches from this thread shouldn't be around) 

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-08 20:49:07.000000000 +0200
@@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
 	unsigned int cpu = smp_processor_id();
 	struct irqaction *action;
 	irqreturn_t action_ret;
+	int edge = 0;
 
 	spin_lock(&desc->lock);
 
 	if (unlikely(desc->status & IRQ_INPROGRESS))
 		goto out;
 
+	if ((desc->status & (IRQ_PENDING | IRQ_REPLAY)) ==
+							IRQ_REPLAY) {
+		desc->chip->ack(irq);
+		edge = 1;
+	}
+
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
@@ -421,7 +428,8 @@ handle_fasteoi_irq(unsigned int irq, str
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
 out:
-	desc->chip->eoi(irq);
+	if (!edge)
+		desc->chip->eoi(irq);
 
 	spin_unlock(&desc->lock);
 }
diff -Nurp 2.6.23-rc1-/kernel/irq/resend.c 2.6.23-rc1/kernel/irq/resend.c
--- 2.6.23-rc1-/kernel/irq/resend.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/resend.c	2007-08-08 20:44:14.000000000 +0200
@@ -57,14 +57,10 @@ void check_irq_resend(struct irq_desc *d
 {
 	unsigned int status = desc->status;
 
-	/*
-	 * Make sure the interrupt is enabled, before resending it:
-	 */
-	desc->chip->enable(irq);
-
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 
+		WARN_ON_ONCE(1);
 		if (!desc->chip || !desc->chip->retrigger ||
 					!desc->chip->retrigger(irq)) {
 #ifdef CONFIG_HARDIRQS_SW_RESEND
@@ -74,4 +70,5 @@ void check_irq_resend(struct irq_desc *d
 #endif
 		}
 	}
+	desc->chip->enable(irq);
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>]

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]                     ` <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>
@ 2007-08-10  7:10                       ` Jarek Poplawski
  2007-08-10 10:43                         ` Marcin Ślusarz
  0 siblings, 1 reply; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-10  7:10 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote:
> 2007/8/9, Jarek Poplawski <jarkao2@o2.pl>:
...
> > diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> > --- 2.6.23-rc1-/kernel/irq/chip.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.23-rc1/kernel/irq/chip.c        2007-08-08 20:49:07.000000000 +0200
> > @@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
> >         unsigned int cpu = smp_processor_id();
> >         struct irqaction *action;
> >         irqreturn_t action_ret;
> > +       int edge = 0;
> >
...
> NETDEV WATCHDOG: eth0: transmit timed out
> eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=351.
> eth0: Resetting the 8390 t=4295229000...<6>NETDEV WATCHDOG: eth0:
> transmit timed out
> eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=718.
> eth0: Resetting the 8390 t=4295230000...<6>NETDEV WATCHDOG: eth0:
> transmit timed out
> eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=874.
> etc...

So, we still have to wait for the exact explanation...

Thanks very much Marcin!

I think, there is this one possible for your testing yet?:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

If it's not a great problem it would be interesting to try this with
different CONFIG_HZ too e.g. you could start with 100 (I guess, you
tested very similar thing in 2.6.23-rc2 with 1000(?) already).

Jean-Baptiste: you can skip/break testing of this 'experimental'
patch, too.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  7:10                       ` Jarek Poplawski
@ 2007-08-10 10:43                         ` Marcin Ślusarz
  2007-08-10 11:37                           ` Jarek Poplawski
  0 siblings, 1 reply; 12+ messages in thread
From: Marcin Ślusarz @ 2007-08-10 10:43 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/10, Jarek Poplawski <jarkao2@o2.pl>:
> (..)
> I think, there is this one possible for your testing yet?:
> Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> Date: Wed, 8 Aug 2007 13:00:37 +0200
I think I already tested this patch, but this thread is sooo big and I
can't find my response...

> If it's not a great problem it would be interesting to try this with
> different CONFIG_HZ too e.g. you could start with 100 (I guess, you
> tested very similar thing in 2.6.23-rc2 with 1000(?) already).
My all tests were done on 2.6.22.1

Marcin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10 10:43                         ` Marcin Ślusarz
@ 2007-08-10 11:37                           ` Jarek Poplawski
  0 siblings, 0 replies; 12+ messages in thread
From: Jarek Poplawski @ 2007-08-10 11:37 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote:
> 2007/8/10, Jarek Poplawski <jarkao2@o2.pl>:
> > (..)
> > I think, there is this one possible for your testing yet?:
> > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> > Date: Wed, 8 Aug 2007 13:00:37 +0200
> I think I already tested this patch, but this thread is sooo big and I
> can't find my response...

I think it was very similar Ingo's patch, which after your testing
is in 2.6.23-rc2 now. I've moved return for level type irqs a little
later, and it works a new way only for "x86_64".

But, I think, now this patch is less important: if you find some time,
try mostly new Ingo's or Thomas' patches (latest possible versions).

> 
> > If it's not a great problem it would be interesting to try this with
> > different CONFIG_HZ too e.g. you could start with 100 (I guess, you
> > tested very similar thing in 2.6.23-rc2 with 1000(?) already).
> My all tests were done on 2.6.22.1

Fine! If it's not said otherwise this version should be appropriate for
most of these patches. But, please, send your current configs on next
posibility (dmesg, /proc/interrupts and .config).

Bye (till monday),
Jarek P.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-08-10 11:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-10  8:15 [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
2007-08-10  8:37 ` Jarek Poplawski
2007-08-10  8:48   ` Ingo Molnar
2007-08-10  9:03     ` Jarek Poplawski
2007-08-10  9:08       ` Ingo Molnar
2007-08-10  9:19         ` Jarek Poplawski
2007-08-10  9:38           ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2007-08-10  8:41 Jean-Baptiste Vignaud
2007-07-31 13:20 Jarek Poplawski
2007-08-06  7:00 ` Marcin Ślusarz
2007-08-06  7:03   ` Ingo Molnar
2007-08-07  7:46     ` Marcin Ślusarz
2007-08-07  8:23       ` Jarek Poplawski
     [not found]         ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
2007-08-07  9:52           ` Jarek Poplawski
2007-08-07 12:13             ` Jarek Poplawski
2007-08-08 11:09               ` Marcin Ślusarz
2007-08-08 11:42                 ` Jarek Poplawski
2007-08-09  9:19                   ` [patch (testing)] " Jarek Poplawski
     [not found]                     ` <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>
2007-08-10  7:10                       ` Jarek Poplawski
2007-08-10 10:43                         ` Marcin Ślusarz
2007-08-10 11:37                           ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).