netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kok, Auke" <auke-jan.h.kok@intel.com>
To: Bernd Schubert <bernd-schubert@gmx.de>
Cc: netdev@vger.kernel.org
Subject: Re: e1000: Detected Tx Unit Hang
Date: Tue, 19 Feb 2008 08:47:05 -0800	[thread overview]
Message-ID: <47BB0809.7060509@intel.com> (raw)
In-Reply-To: <200802160126.43069.bernd-schubert@gmx.de>

Bernd Schubert wrote:
> On Saturday 16 February 2008, Kok, Auke wrote:
>> Bernd Schubert wrote:
>>> Hello,
>>>
>>> I can't login to one of our servers and just got this in an ipmi sol
>>> session:
>>>
>>> [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.209183]   Tx Queue             <0>
>>> [18169.209184]   TDH                  <e3>
>>> [18169.209185]   TDT                  <e3>
>>> [18169.209186]   next_to_use          <e3>
>>> [18169.209187]   next_to_clean        <bd>
>>> [18169.209188] buffer_info[next_to_clean]
>>> [18169.209189]   time_stamp           <10043e4d2>
>>> [18169.209190]   next_to_watch        <be>
>>> [18169.209191]   jiffies              <10043e6f6>
>>> [18169.209192]   next_to_watch.status <1>
>>> [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.256979]   Tx Queue             <0>
>>> [18169.256980]   TDH                  <de>
>>> [18169.256982]   TDT                  <de>
>>> [18169.256983]   next_to_use          <de>
>>> [18169.256984]   next_to_clean        <bc>
>>> [18169.256985] buffer_info[next_to_clean]
>>> [18169.256986]   time_stamp           <10043e511>
>>> [18169.256987]   next_to_watch        <bd>
>>> [18169.256988]   jiffies              <10043e701>
>>> [18169.256989]   next_to_watch.status <1>
>>>
>>> This is with 2.6.22.18. Is there any chance to recover the system? For
>>> some reasons I would prefer not to reboot now.
>> if that's all you have then it was false alarm. there should be a 'netdev
>> timeout - link reset' following those messages. can you send some more
>> context on those messages?
> 
> All I presently know is that there are 20 servers and login doesn't work any 
> more - sysrq+t does show me it hangs in fuse, which is accessing the 
> underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
> output suddenly these e1000 messages appeared.
> Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
> 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
> mis-configured the switch/network environment today. 
> Hmm, now that I think about the last part, there already had been other 
> networking problems today, which were supposed to be fixed several hours ago. 
> Seems they didn't fix it properly.
> 
>> in real tx hang cases, the hardware is reset within 2 seconds, and
>> everything continues as normal.
> 
> Thanks, this gives me hope I don't need to reboot the serves (reboot would 
> mean I would need to start 60 md-raid rebuilds...).

my first thought after I read this e-mail is that the tx-hang message is just a
symptom of your system not responding or being spinlocked all the time. These TX
hang issues normally completely do not interfere with normal system operation and
unless you have continuous TX resets you would be able to logon perfectly fine.

I think you might have hit another kernel bug here... perhaps even unionfs/fuse
related and that certainly looks plausible from your problem description.

looking at the changelog for 2.6.22.16->2.6.22.18 I can't see anything relevant
(see
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=shortlog),
but there are definately no e1000 driver changes in that range anyway.

I don't suppose you can do a git-bisect? that would certainly help. I don't think
we can rule out anything just yet here.

At least try to revert some of your systems to the previous kernel version and see
if the problem goes away...

Auke

  reply	other threads:[~2008-02-19 16:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-15 22:52 e1000: Detected Tx Unit Hang Bernd Schubert
2008-02-15 23:29 ` Kok, Auke
2008-02-16  0:26   ` Bernd Schubert
2008-02-19 16:47     ` Kok, Auke [this message]
  -- strict thread matches above, loose matches on Subject: below --
2006-09-02 14:39 e1000 " Paul Aviles
2006-09-03 17:45 ` Jesse Brandeburg
2006-09-03 23:37   ` Paul Aviles
2006-09-05 16:09     ` Jesse Brandeburg
2006-09-06  1:33       ` Paul Aviles
2006-09-11  4:03       ` Paul Aviles
2006-09-17  2:05       ` Paul Aviles
     [not found] <000f01c6ce49$affd37e0$3224050a@avilespaxp>
2006-09-02  5:45 ` Auke Kok

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47BB0809.7060509@intel.com \
    --to=auke-jan.h.kok@intel.com \
    --cc=bernd-schubert@gmx.de \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).