From: Nivedita Singhvi <niv@us.ibm.com>
To: David Miller <davem@redhat.com>
Cc: netdev <netdev@oss.sgi.com>, Elizabeth Kon <bkon@us.ibm.com>,
jgrimm@us.ibm.com, jgarvey@us.ibm.com
Subject: TCP hang in timewait processing
Date: Sat, 27 Mar 2004 15:27:51 -0800 [thread overview]
Message-ID: <40660DF7.9090806@us.ibm.com> (raw)
Dave,
We're investigating a hang in TCP that a clustered node
is running into, and I'd appreciate any help whatsoever
on this...
System is running SLES8 + patches (including latest
fixes in timewait stuff) - but is pretty equivalent
to mainline 2.4 kernel from what I can tell.
Problem is reproducible, takes anywhere from several
hours to a day.
The hang occurs due to the while in tcp_twkill going
into an infinite loop:
while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
if (tw->next_death)
tw->next_death->pprev_death = tw->pprev_death;
tw->pprev_death = NULL;
spin_unlock(&tw_death_lock);
tcp_timewait_kill(tw);
tcp_tw_put(tw);
killed++;
spin_lock(&tw_death_lock);
}
Thanks to some neat detective work by Beth Kon and Joe
Garvey, the culprit seems to be a tw node pointing to
itself. See attached note from Beth at end.
This is possible if a tcp_tw_bucket is freed prematurely
before being taken off the death list. If the node is
at the head of the list, and is freed and then later
reallocated in tcp_time_wait() and reinserted into the
list, (now linked to a new sk) it will end up pointing at
itself. [There might be other ways to end up like this,
but I'm not seeing them]
We come into tcp_tw_schedule() (which puts it into the
death list) with pprev_death cleared by tcp_time_wait().
tcp_tw_schedule() {
if (tw->pprev_death) {
...
} else
atomic_inc(&tw->refcnt);
...
if((tw->next_death = *tpp) != NULL)
(*tpp)->pprev_death = &tw->next_death;
*tpp = tw;
tw->pprev_death = tpp;
If tw is at the head of the list, (*tpp == tw), then
we just created a loop of tw->next_death pointing at tw.
If tw is in other places on the death list, we could
potentially have Y-shaped chains and other garbage...
Does that seem correct, or am I barking up the wrong
tree here?
Just checking at this point for a node pointing to
itself is rather late - the damage has been done in
losing the original linkages from the tcp_tw_bucket
to the other structures which we need to remove as
well, so as to not cause a further mess in the hash
table and death list pointers.
So the question is, is there any path that leads to
us erroneously freeing tcp_tw_bucket without taking it
off the death list?
I've been looking at the tw refcount manipulation
and am trying to identify any possible gratuitous
tcp_tw_put() calls, but haven't successfully isolated
any one yet.
Any ideas, pointers would be very much appreciated!
thanks,
Nivedita
---
From Beth Kon:
I see what is going on here... not sure how it got to this state.
Joe Garvey did excellent work gathering kdb info (and
graciously taught me a lot as he went along) and confirming that the
while loop in tcp_twkill is in an infinite loop.
Here is the code in tcp_twkill that is in an infinite loop:
while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
if (tw->next_death)
tw->next_death->pprev_death = tw->pprev_death;
tw->pprev_death = NULL;
spin_unlock(&tw_death_lock);
tcp_timewait_kill(tw);
tcp_tw_put(tw);
killed++;
spin_lock(&tw_death_lock);
}
Using the data Joe gathered, here is what I see...
[0]kdb> rd
eax = 0x00000001 ebx = 0xc50a7840 ecx = 0xdf615478 edx = 0x00000001
esi = 0x061c3332 edi = 0x00000000 esp = 0xc03e7f10 eip = 0xc02be950
ebp = 0x00000000 xss = 0xc02e0018 xcs = 0x00000010 eflags = 0x00000282
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc03e7edc
In the above register dump, the pointer to the tw being handled in the
tcp_twkill loop is in ebx.
The contents of the tw struct (annotated by me) are:
[0]kdb> mds %ebx tw
0xc50a7840 260f3c09 .<.& daddr
0xc50a7844 6d0f3c09 .<.m rcv_saddr
0xc50a7848 8200a3e5 å£.. dport, num
0xc50a784c 00000000 .... bound_dev_if
0xc50a7850 00000000 .... next
0xc50a7854 00000000 .... pprev
0xc50a7858 00000000 .... bindnext
0xc50a785c c26dcbc8 ÈËm bind_pprev
[0]kdb>
0xc50a7860 00820506 .... state, substate, sport
0xc50a7864 00000002 .... family
0xc50a7868 f9e3ccd0 ÐÌãù refcnt
0xc50a786c 00002a8f .*.. hashent
0xc50a7870 00001770 p... timeout
0xc50a7874 d4ad3cee î<Ô rcv_next
0xc50a7878 878fe09e .à.. send_next
0xc50a787c 000016d0 Ð... rcv_wnd
[0]kdb>
0xc50a7880 00000000 .... ts_recent
0xc50a7884 00000000 .... ts_recent_stamp
0xc50a7888 000353c1 ÁS.. ttd
0xc50a788c 00000000 .... tb
0xc50a7890 c50a7840 @x.Å next_death
0xc50a7894 00000000 .... pprev_death
0xc50a7898 00000000 ....
0xc50a789c 00000000 ....
The above shows that next_death in the structure == ebx. Which means this
element of the linked list is pointing to itself. So it in an infinite loop.
Assuming this is the last element on the linked list, next_death should be null.
next reply other threads:[~2004-03-27 23:27 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-03-27 23:27 Nivedita Singhvi [this message]
2004-03-28 9:35 ` TCP hang in timewait processing David S. Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=40660DF7.9090806@us.ibm.com \
--to=niv@us.ibm.com \
--cc=bkon@us.ibm.com \
--cc=davem@redhat.com \
--cc=jgarvey@us.ibm.com \
--cc=jgrimm@us.ibm.com \
--cc=netdev@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).