From mboxrd@z Thu Jan 1 00:00:00 1970 From: Patrick McHardy Subject: Re: bnx2_poll panicking kernel Date: Fri, 11 Jul 2008 14:19:59 +0200 Message-ID: <48774FEF.8050700@trash.net> References: <20080710223141.GA2519@orion.carnet.hr> <20080710.162007.04790623.davem@davemloft.net> <20080711092416.GA26515@orion.carnet.hr> <20080711.025656.261409874.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: joy@entuzijast.net, mchan@broadcom.com, billfink@mindspring.com, bhutchings@solarflare.com, netdev@vger.kernel.org, mirrors@debian.org, devik@cdi.cz To: David Miller Return-path: Received: from stinky.trash.net ([213.144.137.162]:57639 "EHLO stinky.trash.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752673AbYGKMUS (ORCPT ); Fri, 11 Jul 2008 08:20:18 -0400 In-Reply-To: <20080711.025656.261409874.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: David Miller wrote: > From: Josip Rodin > Date: Fri, 11 Jul 2008 11:24:16 +0200 > > [ Patrick/Martin, you can simply skip to the final paragraph. ] > >> Here we go, it triggered, here are the first few, tell me if you need more: > > Thanks for the trace: > >> Jul 11 02:15:10 arrakis kernel: Splitting cloned skb >> Jul 11 02:15:10 arrakis kernel: Pid: 0, comm: swapper Not tainted 2.6.25.6 #2 >> Jul 11 02:15:10 arrakis kernel: >> Jul 11 02:15:10 arrakis kernel: Call Trace: >> Jul 11 02:15:10 arrakis kernel: [] __alloc_skb+0x85/0x150 >> Jul 11 02:15:10 arrakis kernel: [] skb_split+0x4a/0x300 >> Jul 11 02:15:10 arrakis kernel: [] tso_fragment+0xfb/0x180 >> Jul 11 02:15:10 arrakis kernel: [] __tcp_push_pending_frames+0xde/0x860 >> Jul 11 02:15:10 arrakis kernel: [] tcp_rcv_established+0x596/0x9d0 > > So it's splitting a frame up which should be new data, but for > some reason made it to the device previously. > > The comment above tso_fragment() reads: > > /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet > * which is put after SKB on the list. It is very much like > * tcp_fragment() except that it may make several kinds of assumptions > * in order to speed up the splitting operation. In particular, we > * know that all the data is in scatter-gather pages, and that the > * packet has never been sent out before (and thus is not cloned). > */ > > Note in particular the final phrase inside the parens. :-))) > > There is only one way this situation seen in the trace can develop. > That is if the queueing discipline gave the packet to the device, yet > returned a value that made TCP believe the packet was not. > > When TCP sees such a return value, it does not advance the head of the > write queue. It will retry to send that head packet again later. And > that's what we seem to be seeing here. > > TCP treats any non-zero return value other than NET_XMIT_CN > in this way (see tcp_transmit_skb and how it uses net_xmit_eval). > > I notice that HTB does a lot of very queer things wrt. return > values. > > For example, it seems that if the class's leaf queue ->enqueue() > returns any non-success value, it gives NET_XMIT_DROP back down to the > call chain. > > But what if that leaf ->enqueue() is something that passes back > NET_XMIT_CN? NET_XMIT_CN can be signalled for things like RED, in > cases where some "other" packet in the same class got dropped but not > necessarily the one you enqueued. > > NET_XMIT_CN means backoff, but it does not indicate that the specific > packet being enqueued was dropped. It just means "some" packet from > the same flow was dropped, and therefore there is congestion on this > flow. > > Even more simpler qdiscs such as SFQ use the NET_XMIT_CN return value > when it does a drop. > > So this return value munging being done by HTB creates the illegal > situation. > > I'm not sure how to fix this, because I'm not sure how these > NET_XMIT_CN situations should be handled wrt. maintaining a proper > parent queue length value. > > Patrick/Martin, in HTB's ->enqueue() and ->requeue() we need to > propagate NET_XMIT_CN to the caller if that's what the leaf qdisc > signals to us. But the question is, should sch->q.qlen be > incremented in that case? NET_XMIT_CN means that some packet got > dropped, but not necessarily this one. If, for example, RED > drops another packet already in the queue does it somehow adjust > the parent sch->q.qlen back down? If not, it's pretty clear how > this bug got created in the first place :) Usually we only increment q.qlen on NET_XMIT_SUCCESS, in all other cases it stays untouched. > Below is my idiotic > attempt to cure this, but this whole situation needs an audit: Yes, this also reminded me of another related bug, when actions steel a packet, qdiscs return NET_XMIT_SUCCESS, which causes upper qdiscs to perform incorrect qlen adjustments. I'll see if I can audit all these paths sometime this weekend. > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c > index 3fb58f4..aa20b47 100644 > --- a/net/sched/sch_htb.c > +++ b/net/sched/sch_htb.c > + ret = cl->un.leaf.q->enqueue(skb, cl->un.leaf.q); > + if (ret == NET_XMIT_DROP) { > + sch->qstats.drops++; > + cl->qstats.drops++; > + } else { > + cl->bstats.packets += > + skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1; > + cl->bstats.bytes += skb->len; > + htb_activate(q, cl); > + } > } The propagation of the leaf qdiscs return value is definitely correct. The patch looks fine to me.