netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tina Yang <tina.yang@oracle.com>
To: Matt Mackall <mpm@selenic.com>
Cc: netdev@vger.kernel.org
Subject: Re: netconsole problems
Date: Thu, 04 Oct 2007 18:22:06 -0700	[thread overview]
Message-ID: <470591BE.9020704@oracle.com> (raw)
In-Reply-To: <20071005002754.GH19691@waste.org>

Matt Mackall wrote:
> On Thu, Oct 04, 2007 at 10:59:38AM -0700, Tina Yang wrote:
>   
>> We recently run into a few problems with netconsole
>> in at least 2.6.9, 2.6.18 and 2.6.23.  It either panicked
>> at netdevice.h:890 or hung the system, and sometimes depending
>> on which NIC we are using, the following console message,
>> e1000:
>>      "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"
>> tg3:
>>      "NETDEV WATCHDOG: eth4: transmit timed out"
>>      "tg3: eth4: transmit timed out, resetting"
>>
>> The postmortem vmcore analysis indicated race between normal
>> network stack (net_rx_action) and netpoll, and disabling the
>> following code segment cures all the problems.
>>     
>
> That doesn't tell us much. Can you provide any more details? Like the
> call chains on both sides?
>   
       I've filed a bug with details, 
http://bugzilla.kernel.org/show_bug.cgi?id=9124
       Basically for 2.6.9, tg3_poll from net_rx_action had panicked 
because
        __LINK_STATE_RX_SCHED is not set, and the net_device from the vmcore
       showed the device is not on any of the per_cpu poll_list at the time.
       For 2.6.18, same crash, however, the net_device showed the dev is 
on one
       poll_list.  The discrepancy between the two crashes can be 
explained as follows,
       1) netpoll on cpu0 called dev->poll(), removed the dev from the 
list and enabled the interrupt
       2) net_rx_action on cpu1 called dev->poll() again, panicked on 
removing the dev from the list
       3) interrupt delivered to, say cpu2, and scheduled the device again

       Because of the race, it could result in a condition where you 
could have more than
       one cpu deal with interrupt (hw or soft) from the same device at 
the same time ?
     
>  
>   
>> netpoll.c
>>    178         /* Process pending work on NIC */
>>    179         np->dev->poll_controller(np->dev);
>>    180         if (np->dev->poll)
>>    181                 poll_napi(np);
>>     
>
> There are a couple different places this gets called, and for
> different reasons. If we have a -large- netconsole dump (like
> sysrq-t), we'll swallow up all of our SKB pool and may get stuck waiting
> for the NIC to send them (because it's waiting to hand packets back to
> the kernel and has no free buffers for outgoing packets).
>
>   
       But the softirq will process and free them ?  The problem is the 
poll_list
       is in a per_cpu structure, shouldn't be manipulated by another 
cpu where
       netpoll is running.
>> Big or small, there seems to be several race windows in the code,
>> and fixing them probably has consequence on overall system performance.
>>     
>
> Yes, the networking layer goes to great lengths to avoid having any
> locking in its fast paths and we don't want to undo any of that
> effort.
>
>   
>> Maybe this code should only run when the machine is single-threaded ?
>>     
>
> In the not-very-distant future, such machines will be extremely rare.
>
>   
       I meant the special case such as in crash mode.


      reply	other threads:[~2007-10-05  1:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-04 17:59 netconsole problems Tina Yang
2007-10-05  0:27 ` Matt Mackall
2007-10-05  1:22   ` Tina Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=470591BE.9020704@oracle.com \
    --to=tina.yang@oracle.com \
    --cc=mpm@selenic.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).