* TCP hang in timewait processing
@ 2004-03-27 23:27 Nivedita Singhvi
2004-03-28 9:35 ` David S. Miller
0 siblings, 1 reply; 2+ messages in thread
From: Nivedita Singhvi @ 2004-03-27 23:27 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Elizabeth Kon, jgrimm, jgarvey
Dave,
We're investigating a hang in TCP that a clustered node
is running into, and I'd appreciate any help whatsoever
on this...
System is running SLES8 + patches (including latest
fixes in timewait stuff) - but is pretty equivalent
to mainline 2.4 kernel from what I can tell.
Problem is reproducible, takes anywhere from several
hours to a day.
The hang occurs due to the while in tcp_twkill going
into an infinite loop:
while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
if (tw->next_death)
tw->next_death->pprev_death = tw->pprev_death;
tw->pprev_death = NULL;
spin_unlock(&tw_death_lock);
tcp_timewait_kill(tw);
tcp_tw_put(tw);
killed++;
spin_lock(&tw_death_lock);
}
Thanks to some neat detective work by Beth Kon and Joe
Garvey, the culprit seems to be a tw node pointing to
itself. See attached note from Beth at end.
This is possible if a tcp_tw_bucket is freed prematurely
before being taken off the death list. If the node is
at the head of the list, and is freed and then later
reallocated in tcp_time_wait() and reinserted into the
list, (now linked to a new sk) it will end up pointing at
itself. [There might be other ways to end up like this,
but I'm not seeing them]
We come into tcp_tw_schedule() (which puts it into the
death list) with pprev_death cleared by tcp_time_wait().
tcp_tw_schedule() {
if (tw->pprev_death) {
...
} else
atomic_inc(&tw->refcnt);
...
if((tw->next_death = *tpp) != NULL)
(*tpp)->pprev_death = &tw->next_death;
*tpp = tw;
tw->pprev_death = tpp;
If tw is at the head of the list, (*tpp == tw), then
we just created a loop of tw->next_death pointing at tw.
If tw is in other places on the death list, we could
potentially have Y-shaped chains and other garbage...
Does that seem correct, or am I barking up the wrong
tree here?
Just checking at this point for a node pointing to
itself is rather late - the damage has been done in
losing the original linkages from the tcp_tw_bucket
to the other structures which we need to remove as
well, so as to not cause a further mess in the hash
table and death list pointers.
So the question is, is there any path that leads to
us erroneously freeing tcp_tw_bucket without taking it
off the death list?
I've been looking at the tw refcount manipulation
and am trying to identify any possible gratuitous
tcp_tw_put() calls, but haven't successfully isolated
any one yet.
Any ideas, pointers would be very much appreciated!
thanks,
Nivedita
---
From Beth Kon:
I see what is going on here... not sure how it got to this state.
Joe Garvey did excellent work gathering kdb info (and
graciously taught me a lot as he went along) and confirming that the
while loop in tcp_twkill is in an infinite loop.
Here is the code in tcp_twkill that is in an infinite loop:
while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
if (tw->next_death)
tw->next_death->pprev_death = tw->pprev_death;
tw->pprev_death = NULL;
spin_unlock(&tw_death_lock);
tcp_timewait_kill(tw);
tcp_tw_put(tw);
killed++;
spin_lock(&tw_death_lock);
}
Using the data Joe gathered, here is what I see...
[0]kdb> rd
eax = 0x00000001 ebx = 0xc50a7840 ecx = 0xdf615478 edx = 0x00000001
esi = 0x061c3332 edi = 0x00000000 esp = 0xc03e7f10 eip = 0xc02be950
ebp = 0x00000000 xss = 0xc02e0018 xcs = 0x00000010 eflags = 0x00000282
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc03e7edc
In the above register dump, the pointer to the tw being handled in the
tcp_twkill loop is in ebx.
The contents of the tw struct (annotated by me) are:
[0]kdb> mds %ebx tw
0xc50a7840 260f3c09 .<.& daddr
0xc50a7844 6d0f3c09 .<.m rcv_saddr
0xc50a7848 8200a3e5 å£.. dport, num
0xc50a784c 00000000 .... bound_dev_if
0xc50a7850 00000000 .... next
0xc50a7854 00000000 .... pprev
0xc50a7858 00000000 .... bindnext
0xc50a785c c26dcbc8 ÈËm bind_pprev
[0]kdb>
0xc50a7860 00820506 .... state, substate, sport
0xc50a7864 00000002 .... family
0xc50a7868 f9e3ccd0 ÐÌãù refcnt
0xc50a786c 00002a8f .*.. hashent
0xc50a7870 00001770 p... timeout
0xc50a7874 d4ad3cee î<Ô rcv_next
0xc50a7878 878fe09e .à.. send_next
0xc50a787c 000016d0 Ð... rcv_wnd
[0]kdb>
0xc50a7880 00000000 .... ts_recent
0xc50a7884 00000000 .... ts_recent_stamp
0xc50a7888 000353c1 ÁS.. ttd
0xc50a788c 00000000 .... tb
0xc50a7890 c50a7840 @x.Å next_death
0xc50a7894 00000000 .... pprev_death
0xc50a7898 00000000 ....
0xc50a789c 00000000 ....
The above shows that next_death in the structure == ebx. Which means this
element of the linked list is pointing to itself. So it in an infinite loop.
Assuming this is the last element on the linked list, next_death should be null.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: TCP hang in timewait processing
2004-03-27 23:27 TCP hang in timewait processing Nivedita Singhvi
@ 2004-03-28 9:35 ` David S. Miller
0 siblings, 0 replies; 2+ messages in thread
From: David S. Miller @ 2004-03-28 9:35 UTC (permalink / raw)
To: Nivedita Singhvi; +Cc: netdev, bkon, jgrimm, jgarvey
On Sat, 27 Mar 2004 15:27:51 -0800
Nivedita Singhvi <niv@us.ibm.com> wrote:
> Any ideas, pointers would be very much appreciated!
One thing that makes timewait bucket garbage collection interesting
is that the node can be reached from two spots, the death row list
and the TCP hash chains via packet input processing.
So you have to see if scenerios like the following are possible:
1) TW death worker thread choose tw X to be killed, drops tw_death_lock
2) packet input hits tw X, packet is reset which kills the tw
3) packet input thus tries to remove tw X from the death row list
too and put it
You get the idea.
We had a similar bug recently in the 2.6.x tree but that was due to
a bug in the tw-kill-via-worker-thread code which is not in 2.4.x unless
someone patched it into the 2.4.x tree you are using :-)
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2004-03-28 9:35 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-27 23:27 TCP hang in timewait processing Nivedita Singhvi
2004-03-28 9:35 ` David S. Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).