From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jarek Poplawski Subject: Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic) Date: Fri, 22 Oct 2010 06:52:31 +0000 Message-ID: <20101022065231.GA7036@ff.dom.local> References: <20101010103236.GA1919@del.dom.local> <20101015085853.GA8091@ff.dom.local> <20101016194815.GA1894@del.dom.local> <20101019100636.GA9246@ff.dom.local> <1287727917.9059.117.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: David Miller , emin ak , Andrew Morton , netdev@vger.kernel.org, bugzilla-daemon@bugzilla.kernel.org, bugme-daemon@bugzilla.kernel.org, Anton Vorontsov , Andy Fleming To: Eric Dumazet Return-path: Received: from mail-bw0-f46.google.com ([209.85.214.46]:50716 "EHLO mail-bw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753717Ab0JVGwn (ORCPT ); Fri, 22 Oct 2010 02:52:43 -0400 Received: by bwz11 with SMTP id 11so258514bwz.19 for ; Thu, 21 Oct 2010 23:52:41 -0700 (PDT) Content-Disposition: inline In-Reply-To: <1287727917.9059.117.camel@edumazet-laptop> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Oct 22, 2010 at 08:11:57AM +0200, Eric Dumazet wrote: > Le mardi 19 octobre 2010 ?? 10:06 +0000, Jarek Poplawski a =E9crit : > > On Tue, Oct 19, 2010 at 09:44:33AM +0300, emin ak wrote: > > > Hi Jarek; > > > After 5 days and more then 20 billion packets passed without cras= h, it > > > seems that this patch is working for me, at least for crash type = 2. > > > (For type 1, it only occured once and I can never reproduce this > > > again, but still trying. I think with this patch is also lowers t= he > > > risk for type 1. > >=20 > > It would be interesting to have a look if it's exactly type 1, beca= use > > skb_over_panic can happen for different reasons, e.g. like here: > > http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2.6.git;= a=3Dcommitdiff;h=3D63b88b9041ceef8217f34de71a2e96f0c3f0fd3b > >=20 > > > For adding a new bug entry for skb_over_panic, before that I thin= k I > > > must find a reliable way to make this type of crash reproducable, > > > otherwise I don't know how to test it if it solved or not. > >=20 > > Maybe for now let's try to get and see this type 1 again? Since the > > recycle path is suspicious a bit to me, probably limiting memory or > > slowing tx (maybe different mtus on eth0 and 1) under heavy multi c= pu > > load might help. > >=20 > > > Lastly, thanks a lot for your valuable help to overcome this prob= lem > > > and also is there anything that I can do for testing / commiting= this > > > patch to mainline? > >=20 > > Here it is for David to handle the rest. > >=20 > > Thanks a lot for such an intense testing, > > Jarek P. > > ---------------------------> > >=20 > > The rx_recycle queue is global per device but can be accesed by man= y > > napi handlers at the same time, so it needs full skb_queue primitiv= es > > (with locking). Otherwise, various crashes caused by broken skbs ar= e > > possible. > >=20 > > This patch resolves, at least partly, bugzilla bug 19692. (Because = of > > some doubts that there could be still something around which is har= d > > to reproduce my proposal is to leave this bug opened for a month.) > >=20 > > Fixes commit: 0fd56bb5be6455d0d42241e65aed057244665e5e > >=20 > > Reported-by: emin ak > > Tested-by: emin ak > > Signed-off-by: Jarek Poplawski > > CC: Andy Fleming > > --- > > diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c > > index 4f7c3f3..db47b55 100644 > > --- a/drivers/net/gianfar.c > > +++ b/drivers/net/gianfar.c > > @@ -2515,7 +2515,7 @@ static int gfar_clean_tx_ring(struct gfar_pri= v_tx_q *tx_queue) > > skb_recycle_check(skb, priv->rx_buffer_size + > > RXBUF_ALIGNMENT)) { > > gfar_align_skb(skb); > > - __skb_queue_head(&priv->rx_recycle, skb); > > + skb_queue_head(&priv->rx_recycle, skb); > > } else > > dev_kfree_skb_any(skb); > > =20 > > @@ -2598,7 +2598,7 @@ struct sk_buff * gfar_new_skb(struct net_devi= ce *dev) > > struct gfar_private *priv =3D netdev_priv(dev); > > struct sk_buff *skb =3D NULL; > > =20 > > - skb =3D __skb_dequeue(&priv->rx_recycle); > > + skb =3D skb_dequeue(&priv->rx_recycle); > > if (!skb) > > skb =3D gfar_alloc_skb(dev); > > =20 > > @@ -2754,7 +2754,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q = *rx_queue, int rx_work_limit) > > if (unlikely(!newskb)) > > newskb =3D skb; > > else if (skb) > > - __skb_queue_head(&priv->rx_recycle, skb); > > + skb_queue_head(&priv->rx_recycle, skb); > > } else { > > /* Increment the number of packets */ > > rx_queue->stats.rx_packets++; >=20 > Are you sure its needed at all ? Yes, after Emin's testing I'm quite sure this fix is needed. >=20 > Gianfar claims to be multiqueue, but only one cpu can run gfar_poll() > and call gfar_clean_tx_ring() / gfar_clean_rx_ring() >=20 > If not, there would be more bugs than only rx_recycle thing I didn't find what prevents running gfar_poll on many cpus and don't claim there is no more bugs around. Jarek P. >=20 > vi +2822 drivers/net/gianfar.c >=20 > for_each_set_bit(i, &gfargrp->rx_bit_map, priv->num_r= x_queues) { > if (test_bit(i, &serviced_queues)) > continue; > rx_queue =3D priv->rx_queue[i]; > tx_queue =3D priv->tx_queue[rx_queue->qindex]= ; >=20 > tx_cleaned +=3D gfar_clean_tx_ring(tx_queue); > rx_cleaned_per_queue =3D gfar_clean_rx_ring(r= x_queue, > budget_per_qu= eue); > rx_cleaned +=3D rx_cleaned_per_queue; > if(rx_cleaned_per_queue < budget_per_queue) { > left_over_budget =3D left_over_budget= + > (budget_per_queue - rx_cleane= d_per_queue); > set_bit(i, &serviced_queues); > num_queues--; > } > } >=20 >=20