From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nivedita Singhvi Subject: TCP hang in timewait processing Date: Sat, 27 Mar 2004 15:27:51 -0800 Sender: netdev-bounce@oss.sgi.com Message-ID: <40660DF7.9090806@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Cc: netdev , Elizabeth Kon , jgrimm@us.ibm.com, jgarvey@us.ibm.com Return-path: To: David Miller Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org Dave, We're investigating a hang in TCP that a clustered node is running into, and I'd appreciate any help whatsoever on this... System is running SLES8 + patches (including latest fixes in timewait stuff) - but is pretty equivalent to mainline 2.4 kernel from what I can tell. Problem is reproducible, takes anywhere from several hours to a day. The hang occurs due to the while in tcp_twkill going into an infinite loop: while((tw =3D tcp_tw_death_row[tcp_tw_death_row_slot]) !=3D NULL) { tcp_tw_death_row[tcp_tw_death_row_slot] =3D tw->next_death; if (tw->next_death) tw->next_death->pprev_death =3D tw->pprev_death; tw->pprev_death =3D NULL; spin_unlock(&tw_death_lock); tcp_timewait_kill(tw); tcp_tw_put(tw); killed++; spin_lock(&tw_death_lock); } Thanks to some neat detective work by Beth Kon and Joe Garvey, the culprit seems to be a tw node pointing to itself. See attached note from Beth at end. This is possible if a tcp_tw_bucket is freed prematurely before being taken off the death list. If the node is at the head of the list, and is freed and then later reallocated in tcp_time_wait() and reinserted into the list, (now linked to a new sk) it will end up pointing at itself. [There might be other ways to end up like this, but I'm not seeing them] We come into tcp_tw_schedule() (which puts it into the death list) with pprev_death cleared by tcp_time_wait(). tcp_tw_schedule() { if (tw->pprev_death) { ... } else atomic_inc(&tw->refcnt); ... if((tw->next_death =3D *tpp) !=3D NULL) (*tpp)->pprev_death =3D &tw->next_death; *tpp =3D tw; tw->pprev_death =3D tpp; =09 If tw is at the head of the list, (*tpp =3D=3D tw), then we just created a loop of tw->next_death pointing at tw. If tw is in other places on the death list, we could potentially have Y-shaped chains and other garbage... Does that seem correct, or am I barking up the wrong tree here? Just checking at this point for a node pointing to itself is rather late - the damage has been done in losing the original linkages from the tcp_tw_bucket to the other structures which we need to remove as well, so as to not cause a further mess in the hash table and death list pointers. So the question is, is there any path that leads to us erroneously freeing tcp_tw_bucket without taking it off the death list? I've been looking at the tw refcount manipulation and am trying to identify any possible gratuitous tcp_tw_put() calls, but haven't successfully isolated any one yet. Any ideas, pointers would be very much appreciated! thanks, Nivedita --- From Beth Kon: I see what is going on here... not sure how it got to this state. Joe Garvey did excellent work gathering kdb info (and graciously taught me a lot as he went along) and confirming that the while loop in tcp_twkill is in an infinite loop. Here is the code in tcp_twkill that is in an infinite loop: while((tw =3D tcp_tw_death_row[tcp_tw_death_row_slot]) !=3D NULL) { tcp_tw_death_row[tcp_tw_death_row_slot] =3D tw->next_dea= th; if (tw->next_death) tw->next_death->pprev_death =3D tw->pprev_death; tw->pprev_death =3D NULL; spin_unlock(&tw_death_lock); tcp_timewait_kill(tw); tcp_tw_put(tw); killed++; spin_lock(&tw_death_lock); } Using the data Joe gathered, here is what I see... [0]kdb> rd eax =3D 0x00000001 ebx =3D 0xc50a7840 ecx =3D 0xdf615478 edx =3D 0x000000= 01 esi =3D 0x061c3332 edi =3D 0x00000000 esp =3D 0xc03e7f10 eip =3D 0xc02be9= 50 ebp =3D 0x00000000 xss =3D 0xc02e0018 xcs =3D 0x00000010 eflags =3D 0x000= 00282 xds =3D 0x00000018 xes =3D 0x00000018 origeax =3D 0xffffffff ®s =3D 0x= c03e7edc In the above register dump, the pointer to the tw being handled in the tcp_twkill loop is in ebx. The contents of the tw struct (annotated by me) are: [0]kdb> mds %ebx tw 0xc50a7840 260f3c09 .<.& daddr 0xc50a7844 6d0f3c09 .<.m rcv_saddr 0xc50a7848 8200a3e5 =E5=A3.. dport, num 0xc50a784c 00000000 .... bound_dev_if 0xc50a7850 00000000 .... next 0xc50a7854 00000000 .... pprev 0xc50a7858 00000000 .... bindnext 0xc50a785c c26dcbc8 =C8=CBm=C2 bind_pprev [0]kdb> 0xc50a7860 00820506 .... state, substate, sport 0xc50a7864 00000002 .... family 0xc50a7868 f9e3ccd0 =D0=CC=E3=F9 refcnt 0xc50a786c 00002a8f .*.. hashent 0xc50a7870 00001770 p... timeout 0xc50a7874 d4ad3cee =EE<=AD=D4 rcv_next 0xc50a7878 878fe09e .=E0.. send_next 0xc50a787c 000016d0 =D0... rcv_wnd [0]kdb> 0xc50a7880 00000000 .... ts_recent 0xc50a7884 00000000 .... ts_recent_stamp 0xc50a7888 000353c1 =C1S.. ttd 0xc50a788c 00000000 .... tb 0xc50a7890 c50a7840 @x.=C5 next_death 0xc50a7894 00000000 .... pprev_death 0xc50a7898 00000000 .... 0xc50a789c 00000000 .... The above shows that next_death in the structure =3D=3D ebx. Which means = this element of the linked list is pointing to itself. So it in an infinite lo= op. Assuming this is the last element on the linked list, next_death should b= e null.