* Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-16 9:59 Andrew Morton
2005-05-16 10:41 ` Jian Jun He
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Andrew Morton @ 2005-05-16 9:59 UTC (permalink / raw)
To: netdev; +Cc: hejianj, linuxppc64-dev, Anton Blanchard
Might be a bug in the e100 driver, might not be.
I assume this is the
BUG_ON(skb->list != NULL);
in __kfree_skb(), although the line number is off-by-one, and the
.__kfree_skb+0x188/0x240 would tend to contradict that. Anton, can you
help work out where we went splat please?
tx timeouts are fairly rare events, so this might not be a recently-added
bug.
Do we know if it is repeatable?
Begin forwarded message:
Date: Mon, 16 May 2005 02:44:04 -0700
From: bugme-daemon@osdl.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
http://bugme.osdl.org/show_bug.cgi?id=4628
Summary: Test server hang while running rhr (network) test on
RHEL4 with kernel 2.6.12-rc1-mm4
Kernel Version: 2.6.12-rc1 with mm4 patch
Status: NEW
Severity: normal
Owner: anton@samba.org
Submitter: hejianj@cn.ibm.com
CC: hanwenb@cn.ibm.com,mridge@us.ibm.com,rende@cn.ibm.com,wa
ngjs@cn.ibm.com
Distribution:
RHEL4 with kernel 2.6.12-rc1-mm4
Hardware Environment:
IBM OpenPower( CHRP IBM,9124-720 )
Software Environment:
RHEL4
RHR: rhr2-rhel4-1.0-14a.noarch.rpm
Problem Description:
The test server hang while running rhr (network) test on RHEL4 with kernel
2.6.12-rc1-mm4.
Steps to reproduce:
1. Download kernel 2.6.12-rc1 and 2.6.12-rc1-mm4 patch from kernel.org, then
build the kernel on OpenPower 720
2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install it on
the test machine.
3. Configure and run the rhr test via invoking redhat-ready.
Additional information:
Here is the backtrace from xmon.
3:mon> e
cpu 0x3: Vector: 700 (Program Check) at [c00000000ffe7920]
pc: c00000000029632c: .__kfree_skb+0x188/0x240
lr: c000000000296328: .__kfree_skb+0x184/0x240
sp: c00000000ffe7ba0
msr: 8000000000029032
current = 0xc000000107f94040
paca = 0xc000000000431c00
pid = 0, comm = swapper
kernel BUG in __kfree_skb at net/core/skbuff.c:282!
3:mon> t
[c00000000ffe7c40] d0000000000ebac4 .e100_rx_clean_list+0xa0/0x144 [e100]
[c00000000ffe7ce0] d0000000000ed6dc .e100_tx_timeout+0x7c/0xb0 [e100]
[c00000000ffe7d70] c0000000002b87bc .dev_watchdog+0xc8/0x154
[c00000000ffe7e00] c00000000006d6b4 .run_timer_softirq+0x180/0x298
[c00000000ffe7ed0] c0000000000667d8 .__do_softirq+0xdc/0x1b8
[c00000000ffe7f90] c000000000014bf0 .call_do_softirq+0x14/0x24
[c000000086b43860] c0000000000102c4 .do_softirq+0x98/0xac
[c000000086b438f0] c0000000000669cc .irq_exit+0x70/0x8c
[c000000086b43970] c000000000011fb8 .timer_interrupt+0x398/0x47c
[c000000086b43a90] c00000000000a2b4 decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c000000000010554 .dedicated_idle+0x114/0x280
[c000000086b43e80] c0000000000108c8 .cpu_idle+0x3c/0x54
[c000000086b43f00] c00000000003cc8c .start_secondary+0x108/0x148
[c000000086b43f90] c00000000000bd84 .enable_64b_mode+0x0/0x28
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-16 9:59 Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4 Andrew Morton
@ 2005-05-16 10:41 ` Jian Jun He
2005-05-16 11:00 ` Herbert Xu
2005-05-24 18:36 ` Ganesh Venkatesan
2 siblings, 0 replies; 24+ messages in thread
From: Jian Jun He @ 2005-05-16 10:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Anton Blanchard, linuxppc64-dev, netdev, Dang En Ren,
Jia Sen Wang, Lei CDL Wang
[-- Attachment #1.1: Type: text/plain, Size: 5101 bytes --]
This is a reproducible defect.
At first, I can't believe that the server will suspend. But I retested the
rhr and the server hung up again. So I captured the backtrace from xmon.
BTW, the e100 driver version is 3.3.6-k2.
To Andrew:
Re-send the mail with CC list. Thanks.
Best Regards!
Jian Jun He
CSDL, Beijing
Email: hejianj@cn.ibm.com
Andrew Morton
<akpm@osdl.org>
To
2005-05-16 17:59 netdev@oss.sgi.com
cc
Jian Jun He/China/Contr/IBM@IBMCN,
linuxppc64-dev@lists.linuxppc.org,
Anton Blanchard <anton@samba.org>
Subject
Fw: [Bugme-new] [Bug 4628] New:
Test server hang while running rhr
(network) test on RHEL4 with kernel
2.6.12-rc1-mm4
Might be a bug in the e100 driver, might not be.
I assume this is the
BUG_ON(skb->list != NULL);
in __kfree_skb(), although the line number is off-by-one, and the
.__kfree_skb+0x188/0x240 would tend to contradict that. Anton, can you
help work out where we went splat please?
tx timeouts are fairly rare events, so this might not be a recently-added
bug.
Do we know if it is repeatable?
Begin forwarded message:
Date: Mon, 16 May 2005 02:44:04 -0700
From: bugme-daemon@osdl.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 4628] New: Test server hang while running rhr
(network) test on RHEL4 with kernel 2.6.12-rc1-mm4
http://bugme.osdl.org/show_bug.cgi?id=4628
Summary: Test server hang while running rhr (network) test on
RHEL4 with kernel 2.6.12-rc1-mm4
Kernel Version: 2.6.12-rc1 with mm4 patch
Status: NEW
Severity: normal
Owner: anton@samba.org
Submitter: hejianj@cn.ibm.com
CC:
hanwenb@cn.ibm.com,mridge@us.ibm.com,rende@cn.ibm.com,wa
ngjs@cn.ibm.com
Distribution:
RHEL4 with kernel 2.6.12-rc1-mm4
Hardware Environment:
IBM OpenPower( CHRP IBM,9124-720 )
Software Environment:
RHEL4
RHR: rhr2-rhel4-1.0-14a.noarch.rpm
Problem Description:
The test server hang while running rhr (network) test on RHEL4 with kernel
2.6.12-rc1-mm4.
Steps to reproduce:
1. Download kernel 2.6.12-rc1 and 2.6.12-rc1-mm4 patch from kernel.org,
then
build the kernel on OpenPower 720
2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install
it on
the test machine.
3. Configure and run the rhr test via invoking redhat-ready.
Additional information:
Here is the backtrace from xmon.
3:mon> e
cpu 0x3: Vector: 700 (Program Check) at [c00000000ffe7920]
pc: c00000000029632c: .__kfree_skb+0x188/0x240
lr: c000000000296328: .__kfree_skb+0x184/0x240
sp: c00000000ffe7ba0
msr: 8000000000029032
current = 0xc000000107f94040
paca = 0xc000000000431c00
pid = 0, comm = swapper
kernel BUG in __kfree_skb at net/core/skbuff.c:282!
3:mon> t
[c00000000ffe7c40] d0000000000ebac4 .e100_rx_clean_list+0xa0/0x144 [e100]
[c00000000ffe7ce0] d0000000000ed6dc .e100_tx_timeout+0x7c/0xb0 [e100]
[c00000000ffe7d70] c0000000002b87bc .dev_watchdog+0xc8/0x154
[c00000000ffe7e00] c00000000006d6b4 .run_timer_softirq+0x180/0x298
[c00000000ffe7ed0] c0000000000667d8 .__do_softirq+0xdc/0x1b8
[c00000000ffe7f90] c000000000014bf0 .call_do_softirq+0x14/0x24
[c000000086b43860] c0000000000102c4 .do_softirq+0x98/0xac
[c000000086b438f0] c0000000000669cc .irq_exit+0x70/0x8c
[c000000086b43970] c000000000011fb8 .timer_interrupt+0x398/0x47c
[c000000086b43a90] c00000000000a2b4 decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c000000000010554
.dedicated_idle+0x114/0x280
[c000000086b43e80] c0000000000108c8 .cpu_idle+0x3c/0x54
[c000000086b43f00] c00000000003cc8c .start_secondary+0x108/0x148
[c000000086b43f90] c00000000000bd84 .enable_64b_mode+0x0/0x28
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
[-- Attachment #1.2: Type: text/html, Size: 6934 bytes --]
[-- Attachment #2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]
[-- Attachment #3: pic29038.gif --]
[-- Type: image/gif, Size: 1255 bytes --]
[-- Attachment #4: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-16 9:59 Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4 Andrew Morton
2005-05-16 10:41 ` Jian Jun He
@ 2005-05-16 11:00 ` Herbert Xu
2005-05-16 17:43 ` Ganesh Venkatesan
2005-05-24 18:36 ` Ganesh Venkatesan
2 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2005-05-16 11:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: netdev, hejianj, linuxppc64-dev, anton, jgarzik
Andrew Morton <akpm@osdl.org> wrote:
>
> Might be a bug in the e100 driver, might not be.
>
> I assume this is the
>
> BUG_ON(skb->list != NULL);
It certainly is a bug in e100.
e100_tx_timeout -> e100_down -> e100_rx_clean_list
is racing against
e100_poll -> e100_rx_clean -> e100_rx_indicate
e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
while it's being processed e100_rx_clean_list comes along and
frees it.
>From a quick check similar problems may exist in other drivers that
have lockless ->poll() functions with RX rings.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-16 11:00 ` Herbert Xu
@ 2005-05-16 17:43 ` Ganesh Venkatesan
2005-05-16 21:29 ` Herbert Xu
[not found] ` <OFB1F7DBFD.6A6514AD-ON48257004.0038A154-48257004.0038E08A@cn.ibm.com>
0 siblings, 2 replies; 24+ messages in thread
From: Ganesh Venkatesan @ 2005-05-16 17:43 UTC (permalink / raw)
To: Herbert Xu; +Cc: Andrew Morton, netdev, hejianj, linuxppc64-dev, anton, jgarzik
Jian:
Could you try the e100 from
http://prdownloads.sourceforge.net/e1000/e100-3.4.8.tar.gz?download?
This (e100 3.4.8) has a fix for the problem you've encountered.
Specifically this driver uses netif_poll_{enable|disable} to avoid the
race.
static int e100_up(struct nic *nic)
{
@@ -1688,13 +1753,18 @@ static int e100_up(struct nic *nic)
if((err = e100_hw_init(nic)))
goto err_clean_cbs;
e100_set_multicast_list(nic->netdev);
- e100_start_receiver(nic);
+ e100_start_receiver(nic, 0);
mod_timer(&nic->watchdog, jiffies);
if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ,
nic->netdev->name, nic->netdev)))
goto err_no_irq;
- e100_enable_irq(nic);
netif_wake_queue(nic->netdev);
+#ifdef CONFIG_E100_NAPI
+ netif_poll_enable(nic->netdev);
+ /* enable ints _after_ enabling poll, preventing a race between
+ * disable ints+schedule */
+#endif
+ e100_enable_irq(nic);
return 0;
err_no_irq:
@@ -1708,11 +1778,15 @@ err_rx_clean_list:
static void e100_down(struct nic *nic)
{
+#ifdef CONFIG_E100_NAPI
+ /* wait here for poll to complete */
+ netif_poll_disable(nic->netdev);
+#endif
+ netif_stop_queue(nic->netdev);
e100_hw_reset(nic);
free_irq(nic->pdev->irq, nic->netdev);
del_timer_sync(&nic->watchdog);
netif_carrier_off(nic->netdev);
- netif_stop_queue(nic->netdev);
e100_clean_cbs(nic);
e100_rx_clean_list(nic);
ganesh.
On 5/16/05, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Might be a bug in the e100 driver, might not be.
> >
> > I assume this is the
> >
> > BUG_ON(skb->list != NULL);
>
> It certainly is a bug in e100.
>
> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>
> is racing against
>
> e100_poll -> e100_rx_clean -> e100_rx_indicate
>
> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
> while it's being processed e100_rx_clean_list comes along and
> frees it.
>
> From a quick check similar problems may exist in other drivers that
> have lockless ->poll() functions with RX rings.
>
> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-16 17:43 ` Ganesh Venkatesan
@ 2005-05-16 21:29 ` Herbert Xu
2005-05-16 21:58 ` Jeff Garzik
[not found] ` <OFB1F7DBFD.6A6514AD-ON48257004.0038A154-48257004.0038E08A@cn.ibm.com>
1 sibling, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2005-05-16 21:29 UTC (permalink / raw)
To: Ganesh Venkatesan
Cc: Andrew Morton, netdev, hejianj, linuxppc64-dev, anton, jgarzik
On Mon, May 16, 2005 at 10:43:02AM -0700, Ganesh Venkatesan wrote:
>
> @@ -1708,11 +1778,15 @@ err_rx_clean_list:
>
> static void e100_down(struct nic *nic)
> {
> +#ifdef CONFIG_E100_NAPI
> + /* wait here for poll to complete */
> + netif_poll_disable(nic->netdev);
> +#endif
Sorry, you can't do that here since you're in softirq context and
netif_poll_disable may sleep.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-16 21:29 ` Herbert Xu
@ 2005-05-16 21:58 ` Jeff Garzik
0 siblings, 0 replies; 24+ messages in thread
From: Jeff Garzik @ 2005-05-16 21:58 UTC (permalink / raw)
To: Herbert Xu
Cc: Ganesh Venkatesan, Andrew Morton, netdev, hejianj, linuxppc64-dev,
anton
Herbert Xu wrote:
> On Mon, May 16, 2005 at 10:43:02AM -0700, Ganesh Venkatesan wrote:
>
>>@@ -1708,11 +1778,15 @@ err_rx_clean_list:
>>
>> static void e100_down(struct nic *nic)
>> {
>>+#ifdef CONFIG_E100_NAPI
>>+ /* wait here for poll to complete */
>>+ netif_poll_disable(nic->netdev);
>>+#endif
>
>
> Sorry, you can't do that here since you're in softirq context and
> netif_poll_disable may sleep.
I think the intention is that e100_down() may sleep, from looking at all
the callsites.
Only e100_tx_timeout() calls it in a context that prevents sleep.
Jeff
^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <OFB1F7DBFD.6A6514AD-ON48257004.0038A154-48257004.0038E08A@cn.ibm.com>]
* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
[not found] ` <OFB1F7DBFD.6A6514AD-ON48257004.0038A154-48257004.0038E08A@cn.ibm.com>
@ 2005-05-26 7:38 ` Andrew Morton
2005-05-26 7:53 ` Jeff Garzik
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2005-05-26 7:38 UTC (permalink / raw)
To: Jian Jun He, john.ronciak, ganesh.venkatesan, jesse.brandeburg
Cc: ganesh.venkatesan, anton, herbert, jgarzik, linuxppc64-dev,
netdev, rende, wangjs, cdlwangl
Jian Jun He <hejianj@cn.ibm.com> wrote:
>
> I download e100-3.4.8 and installed in the test machine (both client and
> server). But the server still hang while running rhr (network) test. :(
e100 is one of those drivers which we'd rather like to have working properly.
Can we please confirm that a) this bug is not fixed in 2.6.12-rc5 and b)
nobody has seen a patch which fixes it?
For reference:
Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Might be a bug in the e100 driver, might not be.
> >
> > I assume this is the
> >
> > BUG_ON(skb->list != NULL);
>
> It certainly is a bug in e100.
>
> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>
> is racing against
>
> e100_poll -> e100_rx_clean -> e100_rx_indicate
>
> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
> while it's being processed e100_rx_clean_list comes along and
> frees it.
>
> >From a quick check similar problems may exist in other drivers that
> have lockless ->poll() functions with RX rings.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 7:38 ` Andrew Morton
@ 2005-05-26 7:53 ` Jeff Garzik
0 siblings, 0 replies; 24+ messages in thread
From: Jeff Garzik @ 2005-05-26 7:53 UTC (permalink / raw)
To: Andrew Morton
Cc: Jian Jun He, john.ronciak, ganesh.venkatesan, jesse.brandeburg,
ganesh.venkatesan, anton, herbert, linuxppc64-dev, netdev, rende,
wangjs, cdlwangl
Andrew Morton wrote:
> Jian Jun He <hejianj@cn.ibm.com> wrote:
>
>>I download e100-3.4.8 and installed in the test machine (both client and
>> server). But the server still hang while running rhr (network) test. :(
>
>
> e100 is one of those drivers which we'd rather like to have working properly.
>
> Can we please confirm that a) this bug is not fixed in 2.6.12-rc5 and b)
> nobody has seen a patch which fixes it?
2.6.12-rc5-git1 should have an e100 update in it, too...
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-16 9:59 Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4 Andrew Morton
2005-05-16 10:41 ` Jian Jun He
2005-05-16 11:00 ` Herbert Xu
@ 2005-05-24 18:36 ` Ganesh Venkatesan
2005-05-25 3:21 ` Jian Jun He
2 siblings, 1 reply; 24+ messages in thread
From: Ganesh Venkatesan @ 2005-05-24 18:36 UTC (permalink / raw)
To: Andrew Morton; +Cc: netdev, hejianj, linuxppc64-dev, Anton Blanchard
Could you tell me what the rhr test is? Is this an IBM internal test tool?
On 5/16/05, Andrew Morton <akpm@osdl.org> wrote:
>
> Might be a bug in the e100 driver, might not be.
>
> I assume this is the
>
> BUG_ON(skb->list != NULL);
>
> in __kfree_skb(), although the line number is off-by-one, and the
> .__kfree_skb+0x188/0x240 would tend to contradict that. Anton, can you
> help work out where we went splat please?
>
> tx timeouts are fairly rare events, so this might not be a recently-added
> bug.
>
> Do we know if it is repeatable?
>
>
>
> Begin forwarded message:
>
> Date: Mon, 16 May 2005 02:44:04 -0700
> From: bugme-daemon@osdl.org
> To: bugme-new@lists.osdl.org
> Subject: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>
> http://bugme.osdl.org/show_bug.cgi?id=4628
>
> Summary: Test server hang while running rhr (network) test on
> RHEL4 with kernel 2.6.12-rc1-mm4
> Kernel Version: 2.6.12-rc1 with mm4 patch
> Status: NEW
> Severity: normal
> Owner: anton@samba.org
> Submitter: hejianj@cn.ibm.com
> CC: hanwenb@cn.ibm.com,mridge@us.ibm.com,rende@cn.ibm.com,wa
> ngjs@cn.ibm.com
>
>
> Distribution:
> RHEL4 with kernel 2.6.12-rc1-mm4
>
> Hardware Environment:
> IBM OpenPower( CHRP IBM,9124-720 )
>
> Software Environment:
> RHEL4
> RHR: rhr2-rhel4-1.0-14a.noarch.rpm
>
> Problem Description:
> The test server hang while running rhr (network) test on RHEL4 with kernel
> 2.6.12-rc1-mm4.
>
> Steps to reproduce:
> 1. Download kernel 2.6.12-rc1 and 2.6.12-rc1-mm4 patch from kernel.org, then
> build the kernel on OpenPower 720
> 2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install it on
> the test machine.
> 3. Configure and run the rhr test via invoking redhat-ready.
>
> Additional information:
> Here is the backtrace from xmon.
>
> 3:mon> e
> cpu 0x3: Vector: 700 (Program Check) at [c00000000ffe7920]
> pc: c00000000029632c: .__kfree_skb+0x188/0x240
> lr: c000000000296328: .__kfree_skb+0x184/0x240
> sp: c00000000ffe7ba0
> msr: 8000000000029032
> current = 0xc000000107f94040
> paca = 0xc000000000431c00
> pid = 0, comm = swapper
> kernel BUG in __kfree_skb at net/core/skbuff.c:282!
>
> 3:mon> t
> [c00000000ffe7c40] d0000000000ebac4 .e100_rx_clean_list+0xa0/0x144 [e100]
> [c00000000ffe7ce0] d0000000000ed6dc .e100_tx_timeout+0x7c/0xb0 [e100]
> [c00000000ffe7d70] c0000000002b87bc .dev_watchdog+0xc8/0x154
> [c00000000ffe7e00] c00000000006d6b4 .run_timer_softirq+0x180/0x298
> [c00000000ffe7ed0] c0000000000667d8 .__do_softirq+0xdc/0x1b8
> [c00000000ffe7f90] c000000000014bf0 .call_do_softirq+0x14/0x24
> [c000000086b43860] c0000000000102c4 .do_softirq+0x98/0xac
> [c000000086b438f0] c0000000000669cc .irq_exit+0x70/0x8c
> [c000000086b43970] c000000000011fb8 .timer_interrupt+0x398/0x47c
> [c000000086b43a90] c00000000000a2b4 decrementer_common+0xb4/0x100
> --- Exception: 901 (Decrementer) at c000000000010554 .dedicated_idle+0x114/0x280
> [c000000086b43e80] c0000000000108c8 .cpu_idle+0x3c/0x54
> [c000000086b43f00] c00000000003cc8c .start_secondary+0x108/0x148
> [c000000086b43f90] c00000000000bd84 .enable_64b_mode+0x0/0x28
>
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-24 18:36 ` Ganesh Venkatesan
@ 2005-05-25 3:21 ` Jian Jun He
0 siblings, 0 replies; 24+ messages in thread
From: Jian Jun He @ 2005-05-25 3:21 UTC (permalink / raw)
To: Ganesh Venkatesan
Cc: Andrew Morton, Anton Blanchard, linuxppc64-dev, netdev,
Dang En Ren, Lei CDL Wang, Jia Sen Wang
[-- Attachment #1.1: Type: text/plain, Size: 5343 bytes --]
"rhr" is a "Red Hat Certification Testing Suite". You can download it from
redhat.com.
Best Regards!
Jian Jun He
CSDL, Beijing
Email: hejianj@cn.ibm.com
Ganesh Venkatesan
<ganesh.venkatesa
n@gmail.com> To
Andrew Morton <akpm@osdl.org>
2005-05-25 02:36 cc
netdev@oss.sgi.com, Jian Jun
He/China/Contr/IBM@IBMCN,
Please respond to linuxppc64-dev@lists.linuxppc.org.s
Ganesh Venkatesan gi.com, Anton Blanchard
<anton@samba.org>
Subject
Re: Fw: [Bugme-new] [Bug 4628] New:
Test server hang while running rhr
(network) test on RHEL4 with kernel
2.6.12-rc1-mm4
Could you tell me what the rhr test is? Is this an IBM internal test tool?
On 5/16/05, Andrew Morton <akpm@osdl.org> wrote:
>
> Might be a bug in the e100 driver, might not be.
>
> I assume this is the
>
> BUG_ON(skb->list != NULL);
>
> in __kfree_skb(), although the line number is off-by-one, and the
> .__kfree_skb+0x188/0x240 would tend to contradict that. Anton, can you
> help work out where we went splat please?
>
> tx timeouts are fairly rare events, so this might not be a recently-added
> bug.
>
> Do we know if it is repeatable?
>
>
>
> Begin forwarded message:
>
> Date: Mon, 16 May 2005 02:44:04 -0700
> From: bugme-daemon@osdl.org
> To: bugme-new@lists.osdl.org
> Subject: [Bugme-new] [Bug 4628] New: Test server hang while running rhr
(network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>
> http://bugme.osdl.org/show_bug.cgi?id=4628
>
> Summary: Test server hang while running rhr (network) test on
> RHEL4 with kernel 2.6.12-rc1-mm4
> Kernel Version: 2.6.12-rc1 with mm4 patch
> Status: NEW
> Severity: normal
> Owner: anton@samba.org
> Submitter: hejianj@cn.ibm.com
> CC:
hanwenb@cn.ibm.com,mridge@us.ibm.com,rende@cn.ibm.com,wa
> ngjs@cn.ibm.com
>
>
> Distribution:
> RHEL4 with kernel 2.6.12-rc1-mm4
>
> Hardware Environment:
> IBM OpenPower( CHRP IBM,9124-720 )
>
> Software Environment:
> RHEL4
> RHR: rhr2-rhel4-1.0-14a.noarch.rpm
>
> Problem Description:
> The test server hang while running rhr (network) test on RHEL4 with
kernel
> 2.6.12-rc1-mm4.
>
> Steps to reproduce:
> 1. Download kernel 2.6.12-rc1 and 2.6.12-rc1-mm4 patch from kernel.org,
then
> build the kernel on OpenPower 720
> 2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install
it on
> the test machine.
> 3. Configure and run the rhr test via invoking redhat-ready.
>
> Additional information:
> Here is the backtrace from xmon.
>
> 3:mon> e
> cpu 0x3: Vector: 700 (Program Check) at [c00000000ffe7920]
> pc: c00000000029632c: .__kfree_skb+0x188/0x240
> lr: c000000000296328: .__kfree_skb+0x184/0x240
> sp: c00000000ffe7ba0
> msr: 8000000000029032
> current = 0xc000000107f94040
> paca = 0xc000000000431c00
> pid = 0, comm = swapper
> kernel BUG in __kfree_skb at net/core/skbuff.c:282!
>
> 3:mon> t
> [c00000000ffe7c40] d0000000000ebac4 .e100_rx_clean_list+0xa0/0x144 [e100]
> [c00000000ffe7ce0] d0000000000ed6dc .e100_tx_timeout+0x7c/0xb0 [e100]
> [c00000000ffe7d70] c0000000002b87bc .dev_watchdog+0xc8/0x154
> [c00000000ffe7e00] c00000000006d6b4 .run_timer_softirq+0x180/0x298
> [c00000000ffe7ed0] c0000000000667d8 .__do_softirq+0xdc/0x1b8
> [c00000000ffe7f90] c000000000014bf0 .call_do_softirq+0x14/0x24
> [c000000086b43860] c0000000000102c4 .do_softirq+0x98/0xac
> [c000000086b438f0] c0000000000669cc .irq_exit+0x70/0x8c
> [c000000086b43970] c000000000011fb8 .timer_interrupt+0x398/0x47c
> [c000000086b43a90] c00000000000a2b4 decrementer_common+0xb4/0x100
> --- Exception: 901 (Decrementer) at c000000000010554
.dedicated_idle+0x114/0x280
> [c000000086b43e80] c0000000000108c8 .cpu_idle+0x3c/0x54
> [c000000086b43f00] c00000000003cc8c .start_secondary+0x108/0x148
> [c000000086b43f90] c00000000000bd84 .enable_64b_mode+0x0/0x28
>
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
>
>
[-- Attachment #1.2: Type: text/html, Size: 7675 bytes --]
[-- Attachment #2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]
[-- Attachment #3: pic29731.gif --]
[-- Type: image/gif, Size: 1255 bytes --]
[-- Attachment #4: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-26 13:00 Venkatesan, Ganesh
2005-05-26 16:09 ` Jian Jun He
0 siblings, 1 reply; 24+ messages in thread
From: Venkatesan, Ganesh @ 2005-05-26 13:00 UTC (permalink / raw)
To: Andrew Morton, Jian Jun He, Ronciak, John, Brandeburg, Jesse
Cc: ganesh.venkatesan, anton, herbert, jgarzik, linuxppc64-dev,
netdev, rende, wangjs, cdlwangl
Jian:
We need more information on the test you run to get this hang. We have
not seen a hang similar to the one you describe. We have a p-series
machine in our lab and all we need is details on the test to run.
Thanks,
Ganesh.
>-----Original Message-----
>From: Andrew Morton [mailto:akpm@osdl.org]
>Sent: Thursday, May 26, 2005 12:38 AM
>To: Jian Jun He; Ronciak, John; Venkatesan, Ganesh; Brandeburg, Jesse
>Cc: ganesh.venkatesan@gmail.com; anton@samba.org;
>herbert@gondor.apana.org.au; jgarzik@pobox.com; linuxppc64-
>dev@lists.linuxppc.org.sgi.com; netdev@oss.sgi.com; rende@cn.ibm.com;
>wangjs@cn.ibm.com; cdlwangl@cn.ibm.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>Jian Jun He <hejianj@cn.ibm.com> wrote:
>>
>> I download e100-3.4.8 and installed in the test machine (both client
and
>> server). But the server still hang while running rhr (network) test.
:(
>
>e100 is one of those drivers which we'd rather like to have working
>properly.
>
>Can we please confirm that a) this bug is not fixed in 2.6.12-rc5 and
b)
>nobody has seen a patch which fixes it?
>
>
>
>For reference:
>
>Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> Andrew Morton <akpm@osdl.org> wrote:
>> >
>> > Might be a bug in the e100 driver, might not be.
>> >
>> > I assume this is the
>> >
>> > BUG_ON(skb->list != NULL);
>>
>> It certainly is a bug in e100.
>>
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>>
>> is racing against
>>
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>>
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>>
>> >From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 13:00 Venkatesan, Ganesh
@ 2005-05-26 16:09 ` Jian Jun He
2005-05-26 20:31 ` Andrew Morton
0 siblings, 1 reply; 24+ messages in thread
From: Jian Jun He @ 2005-05-26 16:09 UTC (permalink / raw)
To: Venkatesan, Ganesh
Cc: Andrew Morton, anton, Dang En Ren, ganesh.venkatesan, herbert,
Brandeburg, Jesse, jgarzik, Jia Sen Wang, Ronciak, John,
Lei CDL Wang, linuxppc64-dev, netdev
[-- Attachment #1.1: Type: text/plain, Size: 6462 bytes --]
hello Ganesh,
This is the detail information about the problem.I will verify the defect
on 2.6.12-rc5 with mm1 patch.
Thank you for your attention.
---------------------------------------
Distribution:
RHEL4 with kernel 2.6.12-rc1-mm4
Hardware Environment:
IBM OpenPower( CHRP IBM,9124-720 )
Software Environment:
RHEL4
RHR: rhr2-rhel4-1.0-14a.noarch.rpm
Problem Description:
The test server hang while running rhr (network) test on RHEL4 with kernel
2.6.12-rc1-mm4.
Steps to reproduce:
1. Download kernel 2.6.12-rc1 and 2.6.12-rc1-mm4 patch from kernel.org,
then
build the kernel on OpenPower 720
2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install
it on
the test machine.
3. Configure and run the rhr test via invoking redhat-ready.
Additional information:
Here is the backtrace from xmon.
3:mon> e
cpu 0x3: Vector: 700 (Program Check) at [c00000000ffe7920]
pc: c00000000029632c: .__kfree_skb+0x188/0x240
lr: c000000000296328: .__kfree_skb+0x184/0x240
sp: c00000000ffe7ba0
msr: 8000000000029032
current = 0xc000000107f94040
paca = 0xc000000000431c00
pid = 0, comm = swapper
kernel BUG in __kfree_skb at net/core/skbuff.c:282!
3:mon> t
[c00000000ffe7c40] d0000000000ebac4 .e100_rx_clean_list+0xa0/0x144 [e100]
[c00000000ffe7ce0] d0000000000ed6dc .e100_tx_timeout+0x7c/0xb0 [e100]
[c00000000ffe7d70] c0000000002b87bc .dev_watchdog+0xc8/0x154
[c00000000ffe7e00] c00000000006d6b4 .run_timer_softirq+0x180/0x298
[c00000000ffe7ed0] c0000000000667d8 .__do_softirq+0xdc/0x1b8
[c00000000ffe7f90] c000000000014bf0 .call_do_softirq+0x14/0x24
[c000000086b43860] c0000000000102c4 .do_softirq+0x98/0xac
[c000000086b438f0] c0000000000669cc .irq_exit+0x70/0x8c
[c000000086b43970] c000000000011fb8 .timer_interrupt+0x398/0x47c
[c000000086b43a90] c00000000000a2b4 decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c000000000010554
.dedicated_idle+0x114/0x280
[c000000086b43e80] c0000000000108c8 .cpu_idle+0x3c/0x54
[c000000086b43f00] c00000000003cc8c .start_secondary+0x108/0x148
[c000000086b43f90] c00000000000bd84 .enable_64b_mode+0x0/0x28
Best Regards!
Jian Jun He
CSDL, Beijing
Email: hejianj@cn.ibm.com
"Venkatesan,
Ganesh"
<ganesh.venkatesa To
n@intel.com> "Andrew Morton" <akpm@osdl.org>,
Jian Jun He/China/Contr/IBM@IBMCN,
2005-05-26 21:00 "Ronciak, John"
<john.ronciak@intel.com>,
"Brandeburg, Jesse"
<jesse.brandeburg@intel.com>
cc
<ganesh.venkatesan@gmail.com>,
<anton@samba.org>,
<herbert@gondor.apana.org.au>,
<jgarzik@pobox.com>,
<linuxppc64-dev@lists.linuxppc.org.
sgi.com>, <netdev@oss.sgi.com>,
Dang En Ren/China/IBM@IBMCN, Jia
Sen Wang/China/IBM@IBMCN, Lei CDL
Wang/China/Contr/IBM@IBMCN
Subject
RE: Fw: [Bugme-new] [Bug 4628] New:
Test server hang while running rhr
(network) test on RHEL4 with kernel
2.6.12-rc1-mm4
Jian:
We need more information on the test you run to get this hang. We have
not seen a hang similar to the one you describe. We have a p-series
machine in our lab and all we need is details on the test to run.
Thanks,
Ganesh.
>-----Original Message-----
>From: Andrew Morton [mailto:akpm@osdl.org]
>Sent: Thursday, May 26, 2005 12:38 AM
>To: Jian Jun He; Ronciak, John; Venkatesan, Ganesh; Brandeburg, Jesse
>Cc: ganesh.venkatesan@gmail.com; anton@samba.org;
>herbert@gondor.apana.org.au; jgarzik@pobox.com; linuxppc64-
>dev@lists.linuxppc.org.sgi.com; netdev@oss.sgi.com; rende@cn.ibm.com;
>wangjs@cn.ibm.com; cdlwangl@cn.ibm.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>Jian Jun He <hejianj@cn.ibm.com> wrote:
>>
>> I download e100-3.4.8 and installed in the test machine (both client
and
>> server). But the server still hang while running rhr (network) test.
:(
>
>e100 is one of those drivers which we'd rather like to have working
>properly.
>
>Can we please confirm that a) this bug is not fixed in 2.6.12-rc5 and
b)
>nobody has seen a patch which fixes it?
>
>
>
>For reference:
>
>Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> Andrew Morton <akpm@osdl.org> wrote:
>> >
>> > Might be a bug in the e100 driver, might not be.
>> >
>> > I assume this is the
>> >
>> > BUG_ON(skb->list != NULL);
>>
>> It certainly is a bug in e100.
>>
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>>
>> is racing against
>>
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>>
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>>
>> >From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.
[-- Attachment #1.2: Type: text/html, Size: 8059 bytes --]
[-- Attachment #2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]
[-- Attachment #3: pic30027.gif --]
[-- Type: image/gif, Size: 1255 bytes --]
[-- Attachment #4: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 16:09 ` Jian Jun He
@ 2005-05-26 20:31 ` Andrew Morton
2005-05-27 6:18 ` Jian Jun He
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2005-05-26 20:31 UTC (permalink / raw)
To: Jian Jun He
Cc: ganesh.venkatesan, anton, rende, ganesh.venkatesan, herbert,
jesse.brandeburg, jgarzik, wangjs, john.ronciak, cdlwangl,
linuxppc64-dev, netdev
Jian Jun He <hejianj@cn.ibm.com> wrote:
>
> 2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install
> it on
> the test machine.
> 3. Configure and run the rhr test via invoking redhat-ready.
This is the problematic bit.
- Please provide a full URL which can be used to obtain rhr.
rhn.redhat.com is subscription-based.
- Please describe the hardware setup - surely the test requires at least
two machines. How are they configured?
- Provide an exact transcript of the commands which are to be used. Is
it just
redhat-ready
with no arguments?
All that begin said, we already have a quite specific diagnosis via code
inspection, from Herbert:
Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Might be a bug in the e100 driver, might not be.
> >
> > I assume this is the
> >
> > BUG_ON(skb->list != NULL);
>
> It certainly is a bug in e100.
>
> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>
> is racing against
>
> e100_poll -> e100_rx_clean -> e100_rx_indicate
>
> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
> while it's being processed e100_rx_clean_list comes along and
> frees it.
>
> From a quick check similar problems may exist in other drivers that
> have lockless ->poll() functions with RX rings.
Do the e100 maintainers agree with this diagnosis? If so then more testing
isn't required at this stage - the next step is to fix the above bug, no?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 20:31 ` Andrew Morton
@ 2005-05-27 6:18 ` Jian Jun He
2005-05-27 8:21 ` Andrew Morton
0 siblings, 1 reply; 24+ messages in thread
From: Jian Jun He @ 2005-05-27 6:18 UTC (permalink / raw)
To: Andrew Morton
Cc: anton, Dang En Ren, ganesh.venkatesan, ganesh.venkatesan, herbert,
jesse.brandeburg, jgarzik, Jia Sen Wang, john.ronciak,
Lei CDL Wang, linuxppc64-dev, netdev
[-- Attachment #1.1.1: Type: text/plain, Size: 4505 bytes --]
hello,
I verified the problem on 2.6.12-rc5 with mm1 patch. The test server works
find during the test procedure.
So I will close this defect in bugme.
Thanks all of your attention on this defect.
to Andrew Morton:
1)If you are a register in rhn.redhat.com, you can search the package
"rhr2", then you can download rhr2.
Also you could download rhr2 from the following links
http://people.redhat.com/rlandry/rhr2/test/1.0-17beta/rhr2-1.0-17beta.noarch.rpm
2)The attachments are the conf files that I used for rhr2 test.
(See attached file: hardware.conf)(See attached file: rhr.conf)(See
attached file: system.conf)(See attached file: tests.conf)
3) invoke redhat-ready is ok, no arguments.
Best Regards!
Jian Jun He
CSDL, Beijing
Email: hejianj@cn.ibm.com
Andrew Morton
<akpm@osdl.org>
To
2005-05-27 04:31 Jian Jun He/China/Contr/IBM@IBMCN
cc
ganesh.venkatesan@intel.com,
anton@samba.org, Dang En
Ren/China/IBM@IBMCN,
ganesh.venkatesan@gmail.com,
herbert@gondor.apana.org.au,
jesse.brandeburg@intel.com,
jgarzik@pobox.com, Jia Sen
Wang/China/IBM@IBMCN,
john.ronciak@intel.com, Lei CDL
Wang/China/Contr/IBM@IBMCN,
linuxppc64-dev@lists.linuxppc.org.s
gi.com, netdev@oss.sgi.com
Subject
Re: Fw: [Bugme-new] [Bug 4628] New:
Test server hang while running rhr
(network) test on RHEL4 with kernel
2.6.12-rc1-mm4
Jian Jun He <hejianj@cn.ibm.com> wrote:
>
> 2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and
install
> it on
> the test machine.
> 3. Configure and run the rhr test via invoking redhat-ready.
This is the problematic bit.
- Please provide a full URL which can be used to obtain rhr.
rhn.redhat.com is subscription-based.
- Please describe the hardware setup - surely the test requires at least
two machines. How are they configured?
- Provide an exact transcript of the commands which are to be used. Is
it just
redhat-ready
with no arguments?
All that begin said, we already have a quite specific diagnosis via code
inspection, from Herbert:
Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Might be a bug in the e100 driver, might not be.
> >
> > I assume this is the
> >
> > BUG_ON(skb->list != NULL);
>
> It certainly is a bug in e100.
>
> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>
> is racing against
>
> e100_poll -> e100_rx_clean -> e100_rx_indicate
>
> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
> while it's being processed e100_rx_clean_list comes along and
> frees it.
>
> From a quick check similar problems may exist in other drivers that
> have lockless ->poll() functions with RX rings.
Do the e100 maintainers agree with this diagnosis? If so then more testing
isn't required at this stage - the next step is to fix the above bug, no?
[-- Attachment #1.1.2: Type: text/html, Size: 5738 bytes --]
[-- Attachment #1.2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]
[-- Attachment #1.3: pic09902.gif --]
[-- Type: image/gif, Size: 1255 bytes --]
[-- Attachment #1.4: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]
[-- Attachment #2: hardware.conf --]
[-- Type: application/octet-stream, Size: 1672 bytes --]
#
# hardware.conf - generated on Fri May 27 00:41:46 EDT 2005
#
# DESCRIPTION: This file contains the hardware class followed by any options
# required by that class tests; usually to include the device or number of
# devices to tests.
#
# CHANGES: The largest change is that this file is now generated instead of
# manually populated. With any luck this should avoid the need for a person
# to figure it out. The format is largely in tact (compared with rhr.conf)
# with two exceptions; the requirement to list the same device multiple times
# has been lifted (you may now assume the tests will do the right thing) and
# the test specific configuration data is now in /etc/rhr/tests.conf; more akin
# to rhr-crush and its /etc/rhr/kernel.
# NETWORK if1 [...ifN]
# Requires a server and remote user as configured in /etc/rhr/tests.conf.
# The remote user will need ab and mount privileges; the server is expected
# to NOT be in a production environment as no password logins are enabled
# during testing.
#
# *NOTE: On s390/s390x you may need to manually create the network scripts.
# See the RELEASE-NOTES for details.
NETWORK eth0(100)
# STORAGE - list devices; includes both fixed disks, removable, and MTD's.
STORAGE sdc
# CORE (no parameters)
# While there no parameters here; a kernel source RPM package is required for # testing. The package location is configured in /etc/rhr/tests.conf and is
# shared by MEMORY
CORE
# MEMORY (no parameters)
# While there no parameters here; a kernel source RPM package is required for # testing. The package location is configured in /etc/rhr/tests.conf and is
# shared by CORE.
MEMORY
[-- Attachment #3: rhr.conf --]
[-- Type: application/octet-stream, Size: 509 bytes --]
#
# rhr.conf - generated Sun Aug 10 23:04:45 EDT 2003
#
# Description: Allows for the use of alternate locations for the scratch and
# result files. This prevents the need for bind-mounts and any other voodoo
# that could adversely effect the tests. Defaults have been coded into the
# application so configuration is not required. To override these values,
# uncomment the setting line and modify as needed.
# runtime temporary files.
SCRATCH=/tmp/rhr
# results package location.
RESULTS=/redhatready
[-- Attachment #4: system.conf --]
[-- Type: application/octet-stream, Size: 226 bytes --]
#
# system.conf - generated on Fri May 27 00:41:35 EDT 2005
#
# DESCRIPTION: Contains the make/model/release combination for HCL listing
# and results package creation.
MAKE="make"
MODEL="IBM,9124-720"
RHRELEASE="Nahant"
[-- Attachment #5: tests.conf --]
[-- Type: application/octet-stream, Size: 1253 bytes --]
#
# tests.conf - test specific configuration file
#
#
# CORE
#
# SRPM - Required kernel source rpm package; if not set, CORE will attempt
# to locate and use the first kernel-*.src.rpm file it find in /tmp.
#
SRPM="/tmp/kernel-2.6.9-1.906.2.1_EL.patchtest.15.src.rpm"
#
# CDROM
#
# ISO - Used for cd-r and cd-rw testing; cdrom testing requires a physical CD.
# We're not specific on what has to be on the iso; however it should be
# of a reasonable size. Generally we recomend disc 1 of the install set.
# Like SRPM above; if not set we'll settle for what we can find in /tmp.
#
# NOTE: IF cd-r(w) and dvd-r(w) are to be run in the same pass, each iso
# must be specified below.
#
#CDISO="/tmp/taroon-i386-as-disc1.iso"
#DVDISO="/tmp/taroon-i386-as-disc1.iso"
#
# NETWORK:
#
# RHRUSER - The user used to log into the remote machine; should have
# an ssh shell account and permissions to execute ab (apache bench),
# and mount file systems. The default user is root.
#
RHRUSER="root"
#
# SERVER - The remote machine to be contacted; should contain the RHRUSER and
# have apache bench (ab) and be able to mount NFS; defaults to
# 192.168.0.1.
#
SERVER="9.3.189.61"
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-27 6:18 ` Jian Jun He
@ 2005-05-27 8:21 ` Andrew Morton
2005-05-27 10:12 ` Jian Jun He
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2005-05-27 8:21 UTC (permalink / raw)
To: Jian Jun He
Cc: anton, rende, ganesh.venkatesan, ganesh.venkatesan, herbert,
jesse.brandeburg, jgarzik, wangjs, john.ronciak, cdlwangl,
linuxppc64-dev, netdev
Jian Jun He <hejianj@cn.ibm.com> wrote:
>
> I verified the problem on 2.6.12-rc5 with mm1 patch. The test server works
> find during the test procedure.
Great, thanks.
> So I will close this defect in bugme.
Well we should verify that the e100 patches which Linus merged a few hours
ago did contain the fix. So please test 2.6.12-rc5-git2, which should be
at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots in ~8 hours.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-27 8:21 ` Andrew Morton
@ 2005-05-27 10:12 ` Jian Jun He
0 siblings, 0 replies; 24+ messages in thread
From: Jian Jun He @ 2005-05-27 10:12 UTC (permalink / raw)
To: Andrew Morton
Cc: anton, Dang En Ren, ganesh.venkatesan, ganesh.venkatesan, herbert,
jesse.brandeburg, jgarzik, Jia Sen Wang, john.ronciak,
Lei CDL Wang, linuxppc64-dev, netdev
[-- Attachment #1.1: Type: text/plain, Size: 2735 bytes --]
I will verify it on 2.6.12-rc5-git2 when it is ready.
Best Regards!
Jian Jun He
CSDL, Beijing
Email: hejianj@cn.ibm.com
Andrew Morton
<akpm@osdl.org>
To
2005-05-27 16:21 Jian Jun He/China/Contr/IBM@IBMCN
cc
anton@samba.org, Dang En
Ren/China/IBM@IBMCN,
ganesh.venkatesan@gmail.com,
ganesh.venkatesan@intel.com,
herbert@gondor.apana.org.au,
jesse.brandeburg@intel.com,
jgarzik@pobox.com, Jia Sen
Wang/China/IBM@IBMCN,
john.ronciak@intel.com, Lei CDL
Wang/China/Contr/IBM@IBMCN,
linuxppc64-dev@lists.linuxppc.org.s
gi.com, netdev@oss.sgi.com
Subject
Re: Fw: [Bugme-new] [Bug 4628] New:
Test server hang while running rhr
(network) test on RHEL4 with kernel
2.6.12-rc1-mm4
Jian Jun He <hejianj@cn.ibm.com> wrote:
>
> I verified the problem on 2.6.12-rc5 with mm1 patch. The test server
works
> find during the test procedure.
Great, thanks.
> So I will close this defect in bugme.
Well we should verify that the e100 patches which Linus merged a few hours
ago did contain the fix. So please test 2.6.12-rc5-git2, which should be
at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots in ~8 hours.
[-- Attachment #1.2: Type: text/html, Size: 3505 bytes --]
[-- Attachment #2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]
[-- Attachment #3: pic10544.gif --]
[-- Type: image/gif, Size: 1255 bytes --]
[-- Attachment #4: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-26 20:41 Venkatesan, Ganesh
2005-05-26 21:34 ` Herbert Xu
0 siblings, 1 reply; 24+ messages in thread
From: Venkatesan, Ganesh @ 2005-05-26 20:41 UTC (permalink / raw)
To: Andrew Morton, Jian Jun He
Cc: anton, rende, ganesh.venkatesan, herbert, Brandeburg, Jesse,
jgarzik, wangjs, Ronciak, John, cdlwangl, linuxppc64-dev, netdev
Andrew:
I already responded to this analysis before. In any case, here it is:
Later versions of e100 (3.4.8 for instance) includes a call to
netif_poll_disable in e100_down. This is supposed to wait and when it
returns we are guaranteed that e100_poll will no longer be called. In
addition, if there happens to be an interrupt, our call to
netif_rx_schedule() will not add our poll routine to the poll-list since
poll is disabled. So this race can never happen.
Ganesh.
>-----Original Message-----
>From: Andrew Morton [mailto:akpm@osdl.org]
>Sent: Thursday, May 26, 2005 1:31 PM
>To: Jian Jun He
>Cc: Venkatesan, Ganesh; anton@samba.org; rende@cn.ibm.com;
>ganesh.venkatesan@gmail.com; herbert@gondor.apana.org.au; Brandeburg,
>Jesse; jgarzik@pobox.com; wangjs@cn.ibm.com; Ronciak, John;
>cdlwangl@cn.ibm.com; linuxppc64-dev@lists.linuxppc.org.sgi.com;
>netdev@oss.sgi.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>Jian Jun He <hejianj@cn.ibm.com> wrote:
>>
>> 2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and
>install
>> it on
>> the test machine.
>> 3. Configure and run the rhr test via invoking redhat-ready.
>
>This is the problematic bit.
>
>- Please provide a full URL which can be used to obtain rhr.
> rhn.redhat.com is subscription-based.
>
>- Please describe the hardware setup - surely the test requires at
least
> two machines. How are they configured?
>
>- Provide an exact transcript of the commands which are to be used. Is
> it just
>
> redhat-ready
>
> with no arguments?
>
>
>
>All that begin said, we already have a quite specific diagnosis via
code
>inspection, from Herbert:
>
>
>Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> Andrew Morton <akpm@osdl.org> wrote:
>> >
>> > Might be a bug in the e100 driver, might not be.
>> >
>> > I assume this is the
>> >
>> > BUG_ON(skb->list != NULL);
>>
>> It certainly is a bug in e100.
>>
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>>
>> is racing against
>>
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>>
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>>
>> From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.
>
>Do the e100 maintainers agree with this diagnosis? If so then more
testing
>isn't required at this stage - the next step is to fix the above bug,
no?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 20:41 Venkatesan, Ganesh
@ 2005-05-26 21:34 ` Herbert Xu
2005-05-26 23:08 ` Ganesh Venkatesan
0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2005-05-26 21:34 UTC (permalink / raw)
To: Venkatesan, Ganesh
Cc: Andrew Morton, Jian Jun He, anton, rende, ganesh.venkatesan,
Brandeburg, Jesse, jgarzik, wangjs, Ronciak, John, cdlwangl,
linuxppc64-dev, netdev
On Thu, May 26, 2005 at 01:41:53PM -0700, Venkatesan, Ganesh wrote:
>
> I already responded to this analysis before. In any case, here it is:
>
> Later versions of e100 (3.4.8 for instance) includes a call to
> netif_poll_disable in e100_down. This is supposed to wait and when it
As I said last time, this is broken since the code path in question
starts from tx_timeout which is called in softirq context. You'll
need to schedule a work struct at least.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 21:34 ` Herbert Xu
@ 2005-05-26 23:08 ` Ganesh Venkatesan
2005-05-27 0:11 ` Herbert Xu
0 siblings, 1 reply; 24+ messages in thread
From: Ganesh Venkatesan @ 2005-05-26 23:08 UTC (permalink / raw)
To: Herbert Xu
Cc: Venkatesan, Ganesh, Andrew Morton, Jian Jun He, anton, rende,
Brandeburg, Jesse, jgarzik, wangjs, Ronciak, John, cdlwangl,
linuxppc64-dev, netdev
Herbert:
I do not get it. Bear with my ignorance.
e100_tx_timeout does not call e100_down.
It is called from e100_tx_timeout_task which is invoked as a result of
schedule_work. Are you saying that it would still not have the right
context to call netif_disable_poll()?
thanks,
ganesh.
On 5/26/05, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, May 26, 2005 at 01:41:53PM -0700, Venkatesan, Ganesh wrote:
> >
> > I already responded to this analysis before. In any case, here it is:
> >
> > Later versions of e100 (3.4.8 for instance) includes a call to
> > netif_poll_disable in e100_down. This is supposed to wait and when it
>
> As I said last time, this is broken since the code path in question
> starts from tx_timeout which is called in softirq context. You'll
> need to schedule a work struct at least.
>
> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-26 23:08 ` Ganesh Venkatesan
@ 2005-05-27 0:11 ` Herbert Xu
2005-05-27 0:20 ` Herbert Xu
0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2005-05-27 0:11 UTC (permalink / raw)
To: Ganesh Venkatesan
Cc: Venkatesan, Ganesh, Andrew Morton, Jian Jun He, anton, rende,
Brandeburg, Jesse, jgarzik, wangjs, Ronciak, John, cdlwangl,
linuxppc64-dev, netdev
On Thu, May 26, 2005 at 04:08:17PM -0700, Ganesh Venkatesan wrote:
>
> I do not get it. Bear with my ignorance.
>
> e100_tx_timeout does not call e100_down.
>
> It is called from e100_tx_timeout_task which is invoked as a result of
> schedule_work. Are you saying that it would still not have the right
> context to call netif_disable_poll()?
Sorry, I was not aware that you've already added the sched work
in the driver. With that it should work correctly.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-27 0:11 ` Herbert Xu
@ 2005-05-27 0:20 ` Herbert Xu
0 siblings, 0 replies; 24+ messages in thread
From: Herbert Xu @ 2005-05-27 0:20 UTC (permalink / raw)
To: Ganesh Venkatesan
Cc: Venkatesan, Ganesh, Andrew Morton, Jian Jun He, anton, rende,
Brandeburg, Jesse, jgarzik, wangjs, Ronciak, John, cdlwangl,
linuxppc64-dev, netdev
On Fri, May 27, 2005 at 10:11:23AM +1000, herbert wrote:
>
> Sorry, I was not aware that you've already added the sched work
> in the driver. With that it should work correctly.
BTW, I was just checking out your sched work code. There needs
to be synchronisation between the tx_timeout task and the stop
method. Otherwise they can race against each other or worse
tx_timeout could keep running even after stop has finished.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-27 0:28 Venkatesan, Ganesh
2005-05-27 1:26 ` Herbert Xu
0 siblings, 1 reply; 24+ messages in thread
From: Venkatesan, Ganesh @ 2005-05-27 0:28 UTC (permalink / raw)
To: Herbert Xu, Ganesh Venkatesan
Cc: Andrew Morton, Jian Jun He, anton, rende, Brandeburg, Jesse,
jgarzik, wangjs, Ronciak, John, cdlwangl, linuxppc64-dev, netdev
Would adding flush_sheduled_tasks() to e100_down, do it?
Ganesh.
>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Thursday, May 26, 2005 5:21 PM
>To: Ganesh Venkatesan
>Cc: Venkatesan, Ganesh; Andrew Morton; Jian Jun He; anton@samba.org;
>rende@cn.ibm.com; Brandeburg, Jesse; jgarzik@pobox.com;
wangjs@cn.ibm.com;
>Ronciak, John; cdlwangl@cn.ibm.com; linuxppc64-
>dev@lists.linuxppc.org.sgi.com; netdev@oss.sgi.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>On Fri, May 27, 2005 at 10:11:23AM +1000, herbert wrote:
>>
>> Sorry, I was not aware that you've already added the sched work
>> in the driver. With that it should work correctly.
>
>BTW, I was just checking out your sched work code. There needs
>to be synchronisation between the tx_timeout task and the stop
>method. Otherwise they can race against each other or worse
>tx_timeout could keep running even after stop has finished.
>
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-27 0:28 Venkatesan, Ganesh
@ 2005-05-27 1:26 ` Herbert Xu
2005-05-27 1:44 ` Andrew Morton
0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2005-05-27 1:26 UTC (permalink / raw)
To: Venkatesan, Ganesh
Cc: Ganesh Venkatesan, Andrew Morton, Jian Jun He, anton, rende,
Brandeburg, Jesse, jgarzik, wangjs, Ronciak, John, cdlwangl,
linuxppc64-dev, netdev
On Thu, May 26, 2005 at 05:28:55PM -0700, Venkatesan, Ganesh wrote:
> Would adding flush_sheduled_tasks() to e100_down, do it?
Even though it might be OK it's probably not a good idea to have
the task flush itself while it's running.
If you put it in the e100_close function then it should be fine.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
2005-05-27 1:26 ` Herbert Xu
@ 2005-05-27 1:44 ` Andrew Morton
0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2005-05-27 1:44 UTC (permalink / raw)
To: Herbert Xu
Cc: ganesh.venkatesan, ganesh.venkatesan, hejianj, anton, rende,
jesse.brandeburg, jgarzik, wangjs, john.ronciak, cdlwangl,
linuxppc64-dev, netdev
Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, May 26, 2005 at 05:28:55PM -0700, Venkatesan, Ganesh wrote:
> > Would adding flush_sheduled_tasks() to e100_down, do it?
>
> Even though it might be OK it's probably not a good idea to have
> the task flush itself while it's running.
flush_workqueue() handles that case.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2005-05-27 10:12 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-16 9:59 Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4 Andrew Morton
2005-05-16 10:41 ` Jian Jun He
2005-05-16 11:00 ` Herbert Xu
2005-05-16 17:43 ` Ganesh Venkatesan
2005-05-16 21:29 ` Herbert Xu
2005-05-16 21:58 ` Jeff Garzik
[not found] ` <OFB1F7DBFD.6A6514AD-ON48257004.0038A154-48257004.0038E08A@cn.ibm.com>
2005-05-26 7:38 ` Andrew Morton
2005-05-26 7:53 ` Jeff Garzik
2005-05-24 18:36 ` Ganesh Venkatesan
2005-05-25 3:21 ` Jian Jun He
-- strict thread matches above, loose matches on Subject: below --
2005-05-26 13:00 Venkatesan, Ganesh
2005-05-26 16:09 ` Jian Jun He
2005-05-26 20:31 ` Andrew Morton
2005-05-27 6:18 ` Jian Jun He
2005-05-27 8:21 ` Andrew Morton
2005-05-27 10:12 ` Jian Jun He
2005-05-26 20:41 Venkatesan, Ganesh
2005-05-26 21:34 ` Herbert Xu
2005-05-26 23:08 ` Ganesh Venkatesan
2005-05-27 0:11 ` Herbert Xu
2005-05-27 0:20 ` Herbert Xu
2005-05-27 0:28 Venkatesan, Ganesh
2005-05-27 1:26 ` Herbert Xu
2005-05-27 1:44 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).