All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: "David S. Miller" <davem@davemloft.net>,
	lkml <linux-kernel@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Cc: "Fehrmann, Henning" <henning.fehrmann@aei.mpg.de>,
	Carsten Aulbert <carsten.aulbert@aei.mpg.de>
Subject: oops in tcp_xmit_retransmit_queue() w/ v2.6.32.15
Date: Thu, 08 Jul 2010 10:22:02 +0200	[thread overview]
Message-ID: <4C358AAA.9080400@kernel.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 1970 bytes --]

Hello,

We've been seeing oops in tcp_xmit_retransmit_queue() w/ 2.6.32.15.
Please see the attached photoshoot.  This is happening on a HPC
cluster and very interestingly caused by one particular job.  How long
it takes isn't clear yet (at least more than a day) but when it
happens it happens on a lot of machines in relatively short time.

With a bit of disassemblying, I've found that the oops is happening
during tcp_for_write_queue_from() because the skb->next points to
NULL.

 void tcp_xmit_retransmit_queue(struct sock *sk)
 {
 ...
	if (tp->retransmit_skb_hint) {
		skb = tp->retransmit_skb_hint;
		last_lost = TCP_SKB_CB(skb)->end_seq;
		if (after(last_lost, tp->retransmit_high))
			last_lost = tp->retransmit_high;
	} else {
		skb = tcp_write_queue_head(sk);
		last_lost = tp->snd_una;
	}

 =>	tcp_for_write_queue_from(skb, sk) {
		 __u8 sacked = TCP_SKB_CB(skb)->sacked;

		 if (skb == tcp_send_head(sk))
			 break;
		 /* we could do better than to assign each time */
		 if (hole == NULL)

This can happen for one of the following reasons,

1. tp->retransmit_skb_hint is NULL and tcp_write_queue_head() is NULL
   too.  ie. tcp_xmit_retransmit_queue() is called on an empty write
   queue for some reason.

2. tp->retransmit_skb_hint is pointing to a skb which is not on the
   write_queue.  ie. somebody forgot to update hint while removing the
   skb from the write queue.

3. The hint is pointing to a skb on the list but the list itself is
   corrupt.

I added some debug code and the crash is happening when
tp->retransmit_skb_hint is not NULL but tp->retransmit_skb_hint->next
is NULL.  So, #1 is out; unfortunately, I didn't have debug code in
place to discern between #2 and #3.

Does anything ring a bell?  This is a production system and debugging
affects quite a number of people.  I can put debug code in to discern
between #2 and #3 but I'm basically shooting in the dark and it would
be great if someone has a better idea.

Thanks.

-- 
tejun

[-- Attachment #2: oops.jpg --]
[-- Type: image/jpeg, Size: 92585 bytes --]

             reply	other threads:[~2010-07-08  8:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-08  8:22 Tejun Heo [this message]
2010-07-11  2:36 ` oops in tcp_xmit_retransmit_queue() w/ v2.6.32.15 David Miller
2010-07-11 16:09 ` Ilpo Järvinen
2010-07-11 17:06   ` Eric Dumazet
2010-07-11 17:46     ` Eric Dumazet
2010-07-11 18:29       ` Eric Dumazet
2010-07-11 19:22         ` Ilpo Järvinen
2010-07-11 19:25           ` Ilpo Järvinen
2010-07-11 19:44             ` Ilpo Järvinen
2010-07-15 11:58   ` Lennart Schulte
2010-07-15 12:05     ` Eric Dumazet
2010-07-15 12:55       ` Lennart Schulte
2010-07-16 12:02         ` Ilpo Järvinen
2010-07-16 12:25           ` Lennart Schulte
2010-07-16 13:19             ` Ilpo Järvinen
2010-07-19  8:06               ` Lennart Schulte
2010-07-19 11:16                 ` [PATCHv2] tcp: fix crash in tcp_xmit_retransmit_queue Ilpo Järvinen
2010-07-19 14:09                   ` Eric Dumazet
2010-07-19 17:25                     ` Ilpo Järvinen
2010-07-19 17:39                       ` Eric Dumazet
2010-07-19 19:55                         ` David Miller
2010-07-20  8:33                           ` Ilpo Järvinen
2010-07-19 14:57           ` oops in tcp_xmit_retransmit_queue() w/ v2.6.32.15 Tejun Heo
2010-07-20  8:41             ` Ilpo Järvinen
2010-09-08  9:32             ` Ilpo Järvinen
2010-09-08 10:25               ` Tejun Heo
2010-09-08 10:34                 ` Ilpo Järvinen
2010-09-09 10:27                   ` Tejun Heo
2010-09-09 10:45                     ` Ilpo Järvinen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C358AAA.9080400@kernel.org \
    --to=tj@kernel.org \
    --cc=carsten.aulbert@aei.mpg.de \
    --cc=davem@davemloft.net \
    --cc=henning.fehrmann@aei.mpg.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.