netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* TCP hang in timewait processing
@ 2004-03-27 23:27 Nivedita Singhvi
  2004-03-28  9:35 ` David S. Miller
  0 siblings, 1 reply; 2+ messages in thread
From: Nivedita Singhvi @ 2004-03-27 23:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Elizabeth Kon, jgrimm, jgarvey

Dave,

We're investigating a hang in TCP that a clustered node
is running into, and I'd appreciate any help whatsoever
on this...

System is running SLES8 + patches (including latest
fixes in timewait stuff) - but is pretty equivalent
to mainline 2.4 kernel from what I can tell.
Problem is reproducible, takes anywhere from several
hours to a day.

The hang occurs due to the while in tcp_twkill going
into an infinite loop:

while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
	tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
	if (tw->next_death)
		tw->next_death->pprev_death = tw->pprev_death;
	tw->pprev_death = NULL;
	spin_unlock(&tw_death_lock);

	tcp_timewait_kill(tw);
	tcp_tw_put(tw);

	killed++;

	spin_lock(&tw_death_lock);
}

Thanks to some neat detective work by Beth Kon and Joe
Garvey, the culprit seems to be a tw node pointing to
itself. See attached note from Beth at end.

This is possible if a tcp_tw_bucket is freed prematurely
before being taken off the death list. If the node is
at the head of the list, and is freed and then later
reallocated in tcp_time_wait() and reinserted into the
list, (now linked to a new sk) it will end up pointing at
itself. [There might be other ways to end up like this,
but I'm not seeing them]

We come into tcp_tw_schedule() (which puts it into the
death list) with pprev_death cleared by tcp_time_wait().

tcp_tw_schedule() {

	if (tw->pprev_death) {
		...
	} else
		atomic_inc(&tw->refcnt);

	...

	if((tw->next_death = *tpp) != NULL)
		(*tpp)->pprev_death = &tw->next_death;
	*tpp = tw;
	tw->pprev_death = tpp;
										
If tw is at the head of the list, (*tpp == tw), then
we just created a loop of tw->next_death pointing at tw.
If tw is in other places on the death list, we could
potentially have Y-shaped chains and other garbage...

Does that seem correct, or am I barking up the wrong
tree here?

Just checking at this point for a node pointing to
itself is rather late - the damage has been done in
losing the original linkages from the tcp_tw_bucket
to the other structures which we need to remove as
well, so as to not cause a further mess in the hash
table and death list pointers.

So the question is, is there any path that leads to
us erroneously freeing tcp_tw_bucket without taking it
off the death list?

I've been looking at the tw refcount manipulation
and am trying to identify any  possible gratuitous
tcp_tw_put() calls, but haven't successfully isolated
any one yet.

Any ideas, pointers would be very much appreciated!

thanks,
Nivedita

---
 From Beth Kon:
I see what is going on here... not sure how it got to this state.

Joe Garvey did excellent work gathering kdb info (and
graciously taught me a lot as he went along) and confirming that the
while loop in tcp_twkill is in an infinite loop.

Here is the code in tcp_twkill that is in an infinite loop:

while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
                 tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
                 if (tw->next_death)
                         tw->next_death->pprev_death = tw->pprev_death;
                 tw->pprev_death = NULL;
                 spin_unlock(&tw_death_lock);

                 tcp_timewait_kill(tw);
                 tcp_tw_put(tw);

                 killed++;

                 spin_lock(&tw_death_lock);
         }
Using the data Joe gathered, here is what I see...

[0]kdb> rd
eax = 0x00000001 ebx = 0xc50a7840 ecx = 0xdf615478 edx = 0x00000001
esi = 0x061c3332 edi = 0x00000000 esp = 0xc03e7f10 eip = 0xc02be950
ebp = 0x00000000 xss = 0xc02e0018 xcs = 0x00000010 eflags = 0x00000282
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc03e7edc

In the above register dump, the pointer to the tw being handled in the
tcp_twkill loop is in ebx.

The contents of the tw struct (annotated by me) are:

[0]kdb> mds %ebx tw
0xc50a7840 260f3c09   .<.&    daddr
0xc50a7844 6d0f3c09   .<.m    rcv_saddr
0xc50a7848 8200a3e5   å£..    dport, num
0xc50a784c 00000000   ....    bound_dev_if
0xc50a7850 00000000   ....    next
0xc50a7854 00000000   ....    pprev
0xc50a7858 00000000   ....    bindnext
0xc50a785c c26dcbc8   ÈËm    bind_pprev
[0]kdb>
0xc50a7860 00820506   ....    state, substate, sport
0xc50a7864 00000002   ....    family
0xc50a7868 f9e3ccd0   ÐÌãù   refcnt
0xc50a786c 00002a8f   .*..   hashent
0xc50a7870 00001770   p...   timeout
0xc50a7874 d4ad3cee   î<­Ô   rcv_next
0xc50a7878 878fe09e   .à..   send_next
0xc50a787c 000016d0   Ð...   rcv_wnd
[0]kdb>
0xc50a7880 00000000   ....    ts_recent
0xc50a7884 00000000   ....    ts_recent_stamp
0xc50a7888 000353c1   ÁS..    ttd
0xc50a788c 00000000   ....    tb
0xc50a7890 c50a7840   @x.Å    next_death
0xc50a7894 00000000   ....    pprev_death
0xc50a7898 00000000   ....
0xc50a789c 00000000   ....

The above shows that next_death in the structure == ebx. Which means this
element of the linked list is pointing to itself. So it in an infinite loop.
Assuming this is the last element on the linked list, next_death should be null.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-03-28  9:35 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-27 23:27 TCP hang in timewait processing Nivedita Singhvi
2004-03-28  9:35 ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).