netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nivedita Singhvi <niv@us.ibm.com>
To: David Miller <davem@redhat.com>
Cc: netdev <netdev@oss.sgi.com>, Elizabeth Kon <bkon@us.ibm.com>,
	jgrimm@us.ibm.com, jgarvey@us.ibm.com
Subject: TCP hang in timewait processing
Date: Sat, 27 Mar 2004 15:27:51 -0800	[thread overview]
Message-ID: <40660DF7.9090806@us.ibm.com> (raw)

Dave,

We're investigating a hang in TCP that a clustered node
is running into, and I'd appreciate any help whatsoever
on this...

System is running SLES8 + patches (including latest
fixes in timewait stuff) - but is pretty equivalent
to mainline 2.4 kernel from what I can tell.
Problem is reproducible, takes anywhere from several
hours to a day.

The hang occurs due to the while in tcp_twkill going
into an infinite loop:

while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
	tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
	if (tw->next_death)
		tw->next_death->pprev_death = tw->pprev_death;
	tw->pprev_death = NULL;
	spin_unlock(&tw_death_lock);

	tcp_timewait_kill(tw);
	tcp_tw_put(tw);

	killed++;

	spin_lock(&tw_death_lock);
}

Thanks to some neat detective work by Beth Kon and Joe
Garvey, the culprit seems to be a tw node pointing to
itself. See attached note from Beth at end.

This is possible if a tcp_tw_bucket is freed prematurely
before being taken off the death list. If the node is
at the head of the list, and is freed and then later
reallocated in tcp_time_wait() and reinserted into the
list, (now linked to a new sk) it will end up pointing at
itself. [There might be other ways to end up like this,
but I'm not seeing them]

We come into tcp_tw_schedule() (which puts it into the
death list) with pprev_death cleared by tcp_time_wait().

tcp_tw_schedule() {

	if (tw->pprev_death) {
		...
	} else
		atomic_inc(&tw->refcnt);

	...

	if((tw->next_death = *tpp) != NULL)
		(*tpp)->pprev_death = &tw->next_death;
	*tpp = tw;
	tw->pprev_death = tpp;
										
If tw is at the head of the list, (*tpp == tw), then
we just created a loop of tw->next_death pointing at tw.
If tw is in other places on the death list, we could
potentially have Y-shaped chains and other garbage...

Does that seem correct, or am I barking up the wrong
tree here?

Just checking at this point for a node pointing to
itself is rather late - the damage has been done in
losing the original linkages from the tcp_tw_bucket
to the other structures which we need to remove as
well, so as to not cause a further mess in the hash
table and death list pointers.

So the question is, is there any path that leads to
us erroneously freeing tcp_tw_bucket without taking it
off the death list?

I've been looking at the tw refcount manipulation
and am trying to identify any  possible gratuitous
tcp_tw_put() calls, but haven't successfully isolated
any one yet.

Any ideas, pointers would be very much appreciated!

thanks,
Nivedita

---
 From Beth Kon:
I see what is going on here... not sure how it got to this state.

Joe Garvey did excellent work gathering kdb info (and
graciously taught me a lot as he went along) and confirming that the
while loop in tcp_twkill is in an infinite loop.

Here is the code in tcp_twkill that is in an infinite loop:

while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
                 tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
                 if (tw->next_death)
                         tw->next_death->pprev_death = tw->pprev_death;
                 tw->pprev_death = NULL;
                 spin_unlock(&tw_death_lock);

                 tcp_timewait_kill(tw);
                 tcp_tw_put(tw);

                 killed++;

                 spin_lock(&tw_death_lock);
         }
Using the data Joe gathered, here is what I see...

[0]kdb> rd
eax = 0x00000001 ebx = 0xc50a7840 ecx = 0xdf615478 edx = 0x00000001
esi = 0x061c3332 edi = 0x00000000 esp = 0xc03e7f10 eip = 0xc02be950
ebp = 0x00000000 xss = 0xc02e0018 xcs = 0x00000010 eflags = 0x00000282
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc03e7edc

In the above register dump, the pointer to the tw being handled in the
tcp_twkill loop is in ebx.

The contents of the tw struct (annotated by me) are:

[0]kdb> mds %ebx tw
0xc50a7840 260f3c09   .<.&    daddr
0xc50a7844 6d0f3c09   .<.m    rcv_saddr
0xc50a7848 8200a3e5   å£..    dport, num
0xc50a784c 00000000   ....    bound_dev_if
0xc50a7850 00000000   ....    next
0xc50a7854 00000000   ....    pprev
0xc50a7858 00000000   ....    bindnext
0xc50a785c c26dcbc8   ÈËm    bind_pprev
[0]kdb>
0xc50a7860 00820506   ....    state, substate, sport
0xc50a7864 00000002   ....    family
0xc50a7868 f9e3ccd0   ÐÌãù   refcnt
0xc50a786c 00002a8f   .*..   hashent
0xc50a7870 00001770   p...   timeout
0xc50a7874 d4ad3cee   î<­Ô   rcv_next
0xc50a7878 878fe09e   .à..   send_next
0xc50a787c 000016d0   Ð...   rcv_wnd
[0]kdb>
0xc50a7880 00000000   ....    ts_recent
0xc50a7884 00000000   ....    ts_recent_stamp
0xc50a7888 000353c1   ÁS..    ttd
0xc50a788c 00000000   ....    tb
0xc50a7890 c50a7840   @x.Å    next_death
0xc50a7894 00000000   ....    pprev_death
0xc50a7898 00000000   ....
0xc50a789c 00000000   ....

The above shows that next_death in the structure == ebx. Which means this
element of the linked list is pointing to itself. So it in an infinite loop.
Assuming this is the last element on the linked list, next_death should be null.

             reply	other threads:[~2004-03-27 23:27 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-27 23:27 Nivedita Singhvi [this message]
2004-03-28  9:35 ` TCP hang in timewait processing David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40660DF7.9090806@us.ibm.com \
    --to=niv@us.ibm.com \
    --cc=bkon@us.ibm.com \
    --cc=davem@redhat.com \
    --cc=jgarvey@us.ibm.com \
    --cc=jgrimm@us.ibm.com \
    --cc=netdev@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).