From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Osterried Subject: Re: AX.25 unaccepted socket makes problems Date: Wed, 28 May 2003 02:04:11 +0200 Sender: linux-hams-owner@vger.kernel.org Message-ID: <20030528000411.GA1391@osterried.de> References: <2197@9A0TCP> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <2197@9A0TCP> List-Id: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tihomir Heidelberg <9a4gl@9a0tcp.ampr.org> Cc: linux-hams@vger.kernel.org > are NOT destroyed here ! A condition: > (ax25->sk->state == TCP_LISTEN && ax25->sk->dead) > is not true because state of those connections are not as far as i got into the code when i looked for the reason for the segfault problem, former connected sessions come to the state TCP_LISTEN when they get disconnected. > + skb->sk->state = TCP_LISTEN; > + if (!skb->sk->dead) { > + skb->sk->state_change(skb->sk); > + } by insuring with with your patch should at least do not harm. perhaps there's the problem for this phaenome: | DB0TUD-0 DB0TUD-1 ax0 LISTENING 007/005 0 320 | DL6MPG-1 DB0TUD-15 ax0 LISTENING 005/007 0 432 | DB0TUD-0 DB0TUD-4 ax0 LISTENING 002/005 0 80 | DB0DSD-0 DB0TUD-1 ax0 LISTENING 001/000 0 80 | DL6MPG-4 DB0TUD-15 ax0 LISTENING 002/001 0 816 > Cannot get the reason why > segmentation fault is happening, but hope someone else can find > a reason when problem is reproducable. as i noted abt a month ago in a mail to this list, i tracked the problem down to the following point: there's a list of active ax25 control blocks (ax25_cb *ax25_list). this list gets corrupted. while walking through ax25_list->next->....->next, somewhere in this reference is pointing to somewhere where it should not. ax25_get_info(), which is called when doing "netstat -a" or "cat /proc/net/ax25" from userspace, walks through this list. i had a chance to let show me the data of all ax25_cb elements while walking through the list (after i spicked my kernel with debug routines). somewhere in the middle of the list, the reference to "next" was inkonsistant in comparison to the time it was before, and the kernel oopsed while walking over the next list element pointer. i looked several times over the code (routines adding and removing list elements, checking for cli() and restore_flags()). everything looks ok for me. currently, i assume this error is such difficile, because it may not be caused by the routines which change the list, but because _maybe_ another routine in ax25.o allocates a buffer / struct, writes more data as it should to it, and overwrites parts of the struct where the ax25 control block is stored, causing corrupted data in the linked list. [as i mentioned also, ax25rtd which adds ax25 route lists to the kernel, causes troubles to the kernel. perhaps it's one of those routines which messes up the ax25 cb lists as side effect] btw, a few days ago i announced a bugfix (the [invalid] to [invalid] SABM problem). it does fix the problem. but my hope that the kernel oopses will go away was not proven: they still occured. > Also, some months ago I mention here that regulary I get this > AX.25 kernel behavior after few days of running 9A0TCP gateway > machine. I noticed that very often ax25d died or had to restart > ax25d because it was not handling connections. Think this > bind/non-accept kernel problem is very probably the reason. afik, it's caused by the inkonsistent list: ax25_find_listener() walks over ax25_list. and may not find the ax25 control block which actually listens. ax25_find_listener() is called by ax25_rcv() in ax25_in.c when an SABM is received. a linux with a flexnet digipeater as neighbour (which connects quite often for link test probes) gets most oopses this way. on the other hand, an ax25 socket which is listening for incoming connections, for e.g. by ax25d, gets sometimes [pid 4924] accept(7, 0xbffff714, [72]) = -1 ECONNABORTED (Software caused connection abort) that should never happen. 73, - thomas dl9sau