netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-16  9:59 Andrew Morton
  2005-05-16 10:41 ` Jian Jun He
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Andrew Morton @ 2005-05-16  9:59 UTC (permalink / raw)
  To: netdev; +Cc: hejianj, linuxppc64-dev, Anton Blanchard


Might be a bug in the e100 driver, might not be.

I assume this is the

	BUG_ON(skb->list != NULL);

in __kfree_skb(), although the line number is off-by-one, and the
.__kfree_skb+0x188/0x240 would tend to contradict that.  Anton, can you
help work out where we went splat please?

tx timeouts are fairly rare events, so this might not be a recently-added
bug.

Do we know if it is repeatable?



Begin forwarded message:

Date: Mon, 16 May 2005 02:44:04 -0700
From: bugme-daemon@osdl.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4


http://bugme.osdl.org/show_bug.cgi?id=4628

           Summary: Test server hang while running rhr (network) test on
                    RHEL4 with kernel 2.6.12-rc1-mm4
    Kernel Version: 2.6.12-rc1 with mm4 patch
            Status: NEW
          Severity: normal
             Owner: anton@samba.org
         Submitter: hejianj@cn.ibm.com
                CC: hanwenb@cn.ibm.com,mridge@us.ibm.com,rende@cn.ibm.com,wa
                    ngjs@cn.ibm.com


Distribution:
RHEL4 with kernel 2.6.12-rc1-mm4

Hardware Environment:
IBM OpenPower( CHRP IBM,9124-720 )

Software Environment:
RHEL4
RHR: rhr2-rhel4-1.0-14a.noarch.rpm

Problem Description:
The test server hang while running rhr (network) test on RHEL4 with kernel 
2.6.12-rc1-mm4.

Steps to reproduce:
1. Download kernel 2.6.12-rc1 and 2.6.12-rc1-mm4 patch from kernel.org, then 
build the kernel on OpenPower 720
2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and install it on 
the test machine.
3. Configure and run the rhr test via invoking redhat-ready.

Additional information:
Here is the backtrace from xmon.

3:mon> e
cpu 0x3: Vector: 700 (Program Check) at [c00000000ffe7920]
    pc: c00000000029632c: .__kfree_skb+0x188/0x240
    lr: c000000000296328: .__kfree_skb+0x184/0x240
    sp: c00000000ffe7ba0
   msr: 8000000000029032
  current = 0xc000000107f94040
  paca    = 0xc000000000431c00
    pid   = 0, comm = swapper
kernel BUG in __kfree_skb at net/core/skbuff.c:282!

3:mon> t
[c00000000ffe7c40] d0000000000ebac4 .e100_rx_clean_list+0xa0/0x144 [e100]
[c00000000ffe7ce0] d0000000000ed6dc .e100_tx_timeout+0x7c/0xb0 [e100]
[c00000000ffe7d70] c0000000002b87bc .dev_watchdog+0xc8/0x154
[c00000000ffe7e00] c00000000006d6b4 .run_timer_softirq+0x180/0x298
[c00000000ffe7ed0] c0000000000667d8 .__do_softirq+0xdc/0x1b8
[c00000000ffe7f90] c000000000014bf0 .call_do_softirq+0x14/0x24
[c000000086b43860] c0000000000102c4 .do_softirq+0x98/0xac
[c000000086b438f0] c0000000000669cc .irq_exit+0x70/0x8c
[c000000086b43970] c000000000011fb8 .timer_interrupt+0x398/0x47c
[c000000086b43a90] c00000000000a2b4 decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c000000000010554 .dedicated_idle+0x114/0x280
[c000000086b43e80] c0000000000108c8 .cpu_idle+0x3c/0x54
[c000000086b43f00] c00000000003cc8c .start_secondary+0x108/0x148
[c000000086b43f90] c00000000000bd84 .enable_64b_mode+0x0/0x28

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-26 13:00 Venkatesan, Ganesh
  2005-05-26 16:09 ` Jian Jun He
  0 siblings, 1 reply; 24+ messages in thread
From: Venkatesan, Ganesh @ 2005-05-26 13:00 UTC (permalink / raw)
  To: Andrew Morton, Jian Jun He, Ronciak, John, Brandeburg, Jesse
  Cc: ganesh.venkatesan, anton, herbert, jgarzik, linuxppc64-dev,
	netdev, rende, wangjs, cdlwangl

Jian:

We need more information on the test you run to get this hang. We have
not seen a hang similar to the one you describe. We have a p-series
machine in our lab and all we need is details on the test to run.

Thanks,
Ganesh.

>-----Original Message-----
>From: Andrew Morton [mailto:akpm@osdl.org]
>Sent: Thursday, May 26, 2005 12:38 AM
>To: Jian Jun He; Ronciak, John; Venkatesan, Ganesh; Brandeburg, Jesse
>Cc: ganesh.venkatesan@gmail.com; anton@samba.org;
>herbert@gondor.apana.org.au; jgarzik@pobox.com; linuxppc64-
>dev@lists.linuxppc.org.sgi.com; netdev@oss.sgi.com; rende@cn.ibm.com;
>wangjs@cn.ibm.com; cdlwangl@cn.ibm.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>Jian Jun He <hejianj@cn.ibm.com> wrote:
>>
>> I download e100-3.4.8 and installed in the test machine (both client
and
>>  server). But the server still hang while running rhr (network) test.
:(
>
>e100 is one of those drivers which we'd rather like to have working
>properly.
>
>Can we please confirm that a) this bug is not fixed in 2.6.12-rc5 and
b)
>nobody has seen a patch which fixes it?
>
>
>
>For reference:
>
>Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> Andrew Morton <akpm@osdl.org> wrote:
>> >
>> > Might be a bug in the e100 driver, might not be.
>> >
>> > I assume this is the
>> >
>> >        BUG_ON(skb->list != NULL);
>>
>> It certainly is a bug in e100.
>>
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>>
>> is racing against
>>
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>>
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>>
>> >From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-26 20:41 Venkatesan, Ganesh
  2005-05-26 21:34 ` Herbert Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Venkatesan, Ganesh @ 2005-05-26 20:41 UTC (permalink / raw)
  To: Andrew Morton, Jian Jun He
  Cc: anton, rende, ganesh.venkatesan, herbert, Brandeburg, Jesse,
	jgarzik, wangjs, Ronciak, John, cdlwangl, linuxppc64-dev, netdev

Andrew:

I already responded to this analysis before. In any case, here it is:

Later versions of e100 (3.4.8 for instance) includes a call to
netif_poll_disable in e100_down. This is supposed to wait and when it
returns we are guaranteed that e100_poll will no longer be called. In
addition, if there happens to be an interrupt, our call to
netif_rx_schedule() will not add our poll routine to the poll-list since
poll is disabled. So this race can never happen.

Ganesh.

>-----Original Message-----
>From: Andrew Morton [mailto:akpm@osdl.org]
>Sent: Thursday, May 26, 2005 1:31 PM
>To: Jian Jun He
>Cc: Venkatesan, Ganesh; anton@samba.org; rende@cn.ibm.com;
>ganesh.venkatesan@gmail.com; herbert@gondor.apana.org.au; Brandeburg,
>Jesse; jgarzik@pobox.com; wangjs@cn.ibm.com; Ronciak, John;
>cdlwangl@cn.ibm.com; linuxppc64-dev@lists.linuxppc.org.sgi.com;
>netdev@oss.sgi.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>Jian Jun He <hejianj@cn.ibm.com> wrote:
>>
>>  2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and
>install
>>  it on
>>  the test machine.
>>  3. Configure and run the rhr test via invoking redhat-ready.
>
>This is the problematic bit.
>
>- Please provide a full URL which can be used to obtain rhr.
>  rhn.redhat.com is subscription-based.
>
>- Please describe the hardware setup - surely the test requires at
least
>  two machines.  How are they configured?
>
>- Provide an exact transcript of the commands which are to be used.  Is
>  it just
>
>	redhat-ready
>
>  with no arguments?
>
>
>
>All that begin said, we already have a quite specific diagnosis via
code
>inspection, from Herbert:
>
>
>Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> Andrew Morton <akpm@osdl.org> wrote:
>> >
>> > Might be a bug in the e100 driver, might not be.
>> >
>> > I assume this is the
>> >
>> >        BUG_ON(skb->list != NULL);
>>
>> It certainly is a bug in e100.
>>
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>>
>> is racing against
>>
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>>
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>>
>> From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.
>
>Do the e100 maintainers agree with this diagnosis?  If so then more
testing
>isn't required at this stage - the next step is to fix the above bug,
no?

^ permalink raw reply	[flat|nested] 24+ messages in thread
* RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
@ 2005-05-27  0:28 Venkatesan, Ganesh
  2005-05-27  1:26 ` Herbert Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Venkatesan, Ganesh @ 2005-05-27  0:28 UTC (permalink / raw)
  To: Herbert Xu, Ganesh Venkatesan
  Cc: Andrew Morton, Jian Jun He, anton, rende, Brandeburg, Jesse,
	jgarzik, wangjs, Ronciak, John, cdlwangl, linuxppc64-dev, netdev

Would adding flush_sheduled_tasks() to e100_down, do it?

Ganesh.

>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Thursday, May 26, 2005 5:21 PM
>To: Ganesh Venkatesan
>Cc: Venkatesan, Ganesh; Andrew Morton; Jian Jun He; anton@samba.org;
>rende@cn.ibm.com; Brandeburg, Jesse; jgarzik@pobox.com;
wangjs@cn.ibm.com;
>Ronciak, John; cdlwangl@cn.ibm.com; linuxppc64-
>dev@lists.linuxppc.org.sgi.com; netdev@oss.sgi.com
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>On Fri, May 27, 2005 at 10:11:23AM +1000, herbert wrote:
>>
>> Sorry, I was not aware that you've already added the sched work
>> in the driver.  With that it should work correctly.
>
>BTW, I was just checking out your sched work code.  There needs
>to be synchronisation between the tx_timeout task and the stop
>method.  Otherwise they can race against each other or worse
>tx_timeout could keep running even after stop has finished.
>
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2005-05-27 10:12 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-16  9:59 Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4 Andrew Morton
2005-05-16 10:41 ` Jian Jun He
2005-05-16 11:00 ` Herbert Xu
2005-05-16 17:43   ` Ganesh Venkatesan
2005-05-16 21:29     ` Herbert Xu
2005-05-16 21:58       ` Jeff Garzik
     [not found]     ` <OFB1F7DBFD.6A6514AD-ON48257004.0038A154-48257004.0038E08A@cn.ibm.com>
2005-05-26  7:38       ` Andrew Morton
2005-05-26  7:53         ` Jeff Garzik
2005-05-24 18:36 ` Ganesh Venkatesan
2005-05-25  3:21   ` Jian Jun He
  -- strict thread matches above, loose matches on Subject: below --
2005-05-26 13:00 Venkatesan, Ganesh
2005-05-26 16:09 ` Jian Jun He
2005-05-26 20:31   ` Andrew Morton
2005-05-27  6:18     ` Jian Jun He
2005-05-27  8:21       ` Andrew Morton
2005-05-27 10:12         ` Jian Jun He
2005-05-26 20:41 Venkatesan, Ganesh
2005-05-26 21:34 ` Herbert Xu
2005-05-26 23:08   ` Ganesh Venkatesan
2005-05-27  0:11     ` Herbert Xu
2005-05-27  0:20       ` Herbert Xu
2005-05-27  0:28 Venkatesan, Ganesh
2005-05-27  1:26 ` Herbert Xu
2005-05-27  1:44   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).