Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [XFRM] Fix aevent timer
From: David S. Miller @ 2006-04-14 22:03 UTC (permalink / raw)
  To: hadi; +Cc: netdev, herbert, hidden
In-Reply-To: <1144932959.5262.5.camel@jzny2>

From: jamal <hadi@cyberus.ca>
Date: Thu, 13 Apr 2006 08:55:58 -0400

> Finally got an opportunity to test this. I have incorporated all the
> feedback and it passes all the scenarios i know.
> 
> Dave, please apply to Linus tree since this is a bug fix.

Applied, thanks.

^ permalink raw reply

* Re: [XFRM Doc]: aevent description
From: David S. Miller @ 2006-04-14 22:05 UTC (permalink / raw)
  To: hadi; +Cc: netdev, hidden, herbert
In-Reply-To: <1144933208.5262.9.camel@jzny2>

From: jamal <hadi@cyberus.ca>
Date: Thu, 13 Apr 2006 09:00:08 -0400

> There is dependency on the previous patch i sent since the issue that
> patch fixes is assumed in this text description. It would be a good
> idea to apply at the same time as the other.

Applied, after fixing 28 lines containing trailing whitespace :-)

^ permalink raw reply

* Re: [PATCH] atm: clip causes unregister hang
From: David S. Miller @ 2006-04-14 22:07 UTC (permalink / raw)
  To: shemminger; +Cc: herbert, chas, linux-atm-general, netdev, stable
In-Reply-To: <20060412145254.4dd21be6@localhost.localdomain>

From: Stephen Hemminger <shemminger@osdl.org>
Date: Wed, 12 Apr 2006 14:52:54 -0700

> If Classical IP over ATM module is loaded, its neighbor table gets
> populated when permanent neighbor entries are created; but these entries
> are not flushed when the device is removed. Since the entry never gets
> flushed the unregister of the network device never completes.
> 
> This version of the patch also adds locking around the reference to
> the atm arp daemon to avoid races with events and daemon state changes.
> (Note: barrier() was never really safe)
> 
> Bug-reference: http://bugzilla.kernel.org/show_bug.cgi?id=6295
> Signed-off-by: Stephen Hemminger <shemminger@osdl.org>

Applied, thanks Stephen.

^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: Jesse Brandeburg @ 2006-04-14 22:13 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Kok, Auke, netdev, Miller, David, Ronciak, John,
	Brandeburg, Jesse, Kirsher, Jeff, Kok, Auke
In-Reply-To: <4440112D.9040509@pobox.com>

On Fri, 14 Apr 2006, Jeff Garzik wrote:

> Kok, Auke wrote:
> > diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> > index 2cc9955..7ba1b16 100644
> > --- a/drivers/net/e1000/e1000_main.c
> > +++ b/drivers/net/e1000/e1000_main.c
> > @@ -3770,6 +3770,7 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
> >  			ps_page->ps_page[j] = NULL;
> >  			skb->len += length;
> >  			skb->data_len += length;
> > +			skb->truesize += length;
> 
> NAK, apparently this is adding a very serious bug.

I'm surprised to hear you say that, here's why.  Elaboration would be 
helpful.

our packet split receive code uses two buffers for any packet the hardware 
was able to split.  The skb when we're receiving say a 1000 byte IPv4 
packet looks like this:

skb -
    |- len = total packet length (1000)
    |- data_len = length in skb_shinfo(skb)->frags[] (1000 - headerlen)
    |- truesize = size of originally allocated skb (128 + NET_IP_ALIGN)

in this case, the stack is using truesize to figure out how much to charge 
the receive socket for our receive packet.

in the old driver we indicate a packet with truesize = 128, which has 1000 
bytes of data in it.  normally (for everyone using only an skb) skb->len, 
and skb->truesize are updated by skb_put();  In this case we don't have a 
kernel support function for adding length to skb->len *and* to 
skb->truesize when chaining on something to ->frags[]

see the code in do_tcp_sendpages for an example of what the stack does in 
setting up a packet for transmit using frags[]

it updates skb->len and skb->truesize, right after calling 
skb_fill_page_desc, just like we do.

interesingly, that is right where sk_forward_alloc is modified (but thats 
another thread)

Jesse

^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: David S. Miller @ 2006-04-14 22:32 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: jgarzik, auke-jan.h.kok, netdev, john.ronciak, Jeffrey.t.kirsher,
	auke
In-Reply-To: <Pine.WNT.4.63.0604141448520.2116@jbrandeb-desk.amr.corp.intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Fri, 14 Apr 2006 15:13:31 -0700 (Pacific Daylight Time)

> in the old driver we indicate a packet with truesize = 128, which has 1000 
> bytes of data in it.  normally (for everyone using only an skb) skb->len, 
> and skb->truesize are updated by skb_put();  In this case we don't have a 
> kernel support function for adding length to skb->len *and* to 
> skb->truesize when chaining on something to ->frags[]

Where does skb_put() update truesize?

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: Jesse Brandeburg @ 2006-04-14 22:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, bb, herbert, kernel, nipsy, jrlundgren, cat,
	djani22, yoseph.basri, mykleb, olel, michal, chris, netdev,
	jesse.brandeburg, E1000-devel, ak, jgarzik
In-Reply-To: <20060414.140225.113553109.davem@davemloft.net>



On Fri, 14 Apr 2006, David S. Miller wrote:

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date: Fri, 14 Apr 2006 13:28:10 -0700
> 
> > We also have some new data from the last couple of days.  First, I think 
> > that this problem is likely not just E1000's fault.  We have multiple 
> > reports both in bugzilla.kernel.org and from a distro that show this 
> > problem has occurred on (at least) tg3 driven adapters as well as e1000.
> 
> That's interesting since tg3 does not enable TSO by default for any
> chip until very recent versions of the tg3 driver.  And even with
> those recent tg3 drivers which will enable TSO by default, it is only
> done for a very specific selection of chip revisions.
> 
> Where are these reports that precisely implicate tg3?

well there was one of them here, but the tg3 bit may actually be due to 
the 2.6.14 problems.

http://bugzilla.kernel.org/show_bug.cgi?id=6279

the other one is buried at a distro's bugzilla, so I can't post it.

> Note that there were legitimate TSO retransmit bugs in the 2.6.14
> timeframe that triggered those messages and did get fixed.  And yes
> some of those reports were with tg3.

I've removed the call to pskb_expand_head in e1000 (yes i know its not 
quite right, but it shouldn't be fatal either) and the bug still occurs.

so its something else.  I'm not giving up on this, I just want to post the 
state of the investigation.

Jesse

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: David S. Miller @ 2006-04-14 22:42 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: bb, herbert, kernel, nipsy, jrlundgren, cat, djani22,
	yoseph.basri, mykleb, olel, michal, chris, netdev,
	jesse.brandeburg, E1000-devel, ak, jgarzik
In-Reply-To: <Pine.WNT.4.63.0604141527420.2116@jbrandeb-desk.amr.corp.intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Fri, 14 Apr 2006 15:32:55 -0700 (Pacific Daylight Time)

> well there was one of them here, but the tg3 bit may actually be due to 
> the 2.6.14 problems.
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=6279

There are 2 e1000 gigabit devices in that person's system, and
not one tg3 device.

^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: Jesse Brandeburg @ 2006-04-14 22:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, jgarzik, auke-jan.h.kok, netdev, john.ronciak,
	Jeffrey.t.kirsher, auke
In-Reply-To: <20060414.153219.62111494.davem@davemloft.net>

On Fri, 14 Apr 2006, David S. Miller wrote:

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date: Fri, 14 Apr 2006 15:13:31 -0700 (Pacific Daylight Time)
> 
> > in the old driver we indicate a packet with truesize = 128, which has 1000 
> > bytes of data in it.  normally (for everyone using only an skb) skb->len, 
> > and skb->truesize are updated by skb_put();  In this case we don't have a 
> > kernel support function for adding length to skb->len *and* to 
> > skb->truesize when chaining on something to ->frags[]
> 
> Where does skb_put() update truesize?

my mistake, it does not.  but truesize in the "normal" case is already set 
to the full length of the allocated skb (which would hold the whole 
frame).  In the packet split case the received skb looks dramatically 
different, and since we're one of the only drivers using this mechanism.

I could easily be incorrect here, but my interpretation based on the 
stack's use of ->truesize in do_tcp_sendpages is what I based our code 
change on.

Please help me understand how you think it should work when we have a 
device that wants to receive a packet using header in the skb->data, and 
application data in the ->frags[]? 

Jesse

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: Jesse Brandeburg @ 2006-04-14 22:46 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, bb, herbert, kernel, nipsy, jrlundgren, cat,
	djani22, yoseph.basri, mykleb, olel, michal, chris, netdev,
	jesse.brandeburg, E1000-devel, ak, jgarzik
In-Reply-To: <20060414.154240.30125561.davem@davemloft.net>

On Fri, 14 Apr 2006, David S. Miller wrote:

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date: Fri, 14 Apr 2006 15:32:55 -0700 (Pacific Daylight Time)
> 
> > well there was one of them here, but the tg3 bit may actually be due to 
> > the 2.6.14 problems.
> > 
> > http://bugzilla.kernel.org/show_bug.cgi?id=6279
> 
> There are 2 e1000 gigabit devices in that person's system, and
> not one tg3 device.

sure, thats fine, but we just reproduced it in two seperate systems 
without the e1000 driver loaded, using the instructions as mentioned in a 
previous email.  We used a 5704 with TSO enabled.

I'm willing to try any debug patches you might come up with.

Jesse

^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: David S. Miller @ 2006-04-14 22:51 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: jgarzik, auke-jan.h.kok, netdev, john.ronciak, Jeffrey.t.kirsher,
	auke
In-Reply-To: <Pine.WNT.4.63.0604141533250.2116@jbrandeb-desk.amr.corp.intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Fri, 14 Apr 2006 15:43:02 -0700 (Pacific Daylight Time)

> Please help me understand how you think it should work when we have a 
> device that wants to receive a packet using header in the skb->data, and 
> application data in the ->frags[]? 

If you're removing the pages from ->frags[] and you already accounted
for them in skb->truesize, then yes I guess it could be correct to
subtract it back out.

Looking at S2IO it's doing something very wrong with it's modification
of skb->truesize.  It's not changing the amount of memory allocated
to an SKB in any way yet it's bumping skb->truesize for some reason.
That will definitely lead to performance problems.

That's why generally when we see an skb->truesize modification, it's
just assumed to be a bug, it's generally not necessary but may be
so in your e1000 situation here.

Another problem to look out for is that once a socket has a reference
to an SKB you must never ever modify skb->truesize because doing so
will break socket memory accounting.  There is exactly one spot in
the TCP stack where we are able to do this by adjusting the socket
memory accounting variables carefully in a locked context at the same
time, but otherwise it's not safe to modify skb->truesize when there
is an attached socket.

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: David S. Miller @ 2006-04-14 22:52 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: bb, herbert, kernel, nipsy, jrlundgren, cat, djani22,
	yoseph.basri, mykleb, olel, michal, chris, netdev,
	jesse.brandeburg, E1000-devel, ak, jgarzik
In-Reply-To: <Pine.WNT.4.63.0604141543170.2116@jbrandeb-desk.amr.corp.intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Fri, 14 Apr 2006 15:46:31 -0700 (Pacific Daylight Time)

> sure, thats fine, but we just reproduced it in two seperate systems 
> without the e1000 driver loaded, using the instructions as mentioned in a 
> previous email.  We used a 5704 with TSO enabled.

That I didn't notice, thanks for the datapoint.

^ permalink raw reply

* Re: [PATCH] atm: clip timer race
From: David S. Miller @ 2006-04-14 22:56 UTC (permalink / raw)
  To: shemminger; +Cc: herbert, chas, linux-atm-general, netdev
In-Reply-To: <20060412154214.1dd4117d@localhost.localdomain>

From: Stephen Hemminger <shemminger@osdl.org>
Date: Wed, 12 Apr 2006 15:42:14 -0700

> By inspection, the clip idle timer code is racy on SMP.
> Here is a safe version of timer management.
> Untested, I don't have ATM hardware.
> 
> Signed-off-by: Stephen Hemminger <shemminger@osdl.org>

Applied, thanks Stephen.

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: Jesse Brandeburg @ 2006-04-14 22:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, bb, herbert, kernel, nipsy, jrlundgren, cat,
	djani22, yoseph.basri, mykleb, olel, michal, chris, netdev,
	jesse.brandeburg, E1000-devel, ak, jgarzik
In-Reply-To: <20060414.155202.82987818.davem@davemloft.net>

On Fri, 14 Apr 2006, David S. Miller wrote:

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date: Fri, 14 Apr 2006 15:46:31 -0700 (Pacific Daylight Time)
> 
> > sure, thats fine, but we just reproduced it in two seperate systems 
> > without the e1000 driver loaded, using the instructions as mentioned in a 
> > previous email.  We used a 5704 with TSO enabled.
> 
> That I didn't notice, thanks for the datapoint.

you're welcome, its "hot off the presses" (we just reproduced it 5 minutes 
ago) 

I'm trying to isolate more of a reproduction case, I'll be sure to post if 
I can find anything with more detail.

^ permalink raw reply

* Re: [PATCH 1/4] clip: run through Lindent
From: David S. Miller @ 2006-04-14 22:58 UTC (permalink / raw)
  To: shemminger; +Cc: chas, linux-atm-general, netdev, stable
In-Reply-To: <20060413152224.2d46df99@localhost.localdomain>

From: Stephen Hemminger <shemminger@osdl.org>
Date: Thu, 13 Apr 2006 15:22:24 -0700

> Run CLIP driver through Lindent script to fix formatting.
> 
> Signed-off-by: Stephen Hemminger <shemminger@osdl.org>

These no longer apply after putting in the two CLIP
bug fixes.

Please respin, thanks.

^ permalink raw reply

* Re: [PATCH 1/4] clip: run through Lindent
From: David S. Miller @ 2006-04-14 22:59 UTC (permalink / raw)
  To: shemminger; +Cc: chas, linux-atm-general, netdev, stable
In-Reply-To: <20060413152224.2d46df99@localhost.localdomain>


Ignore my previous email, I tried to apply it to the wrong
tree :-)

^ permalink raw reply

* Re: [PATCH] IrDA: irda-usb, unregister netdev when patch upload fails
From: David S. Miller @ 2006-04-14 23:02 UTC (permalink / raw)
  To: samuel.ortiz; +Cc: irda-users, jt, netdev
In-Reply-To: <20060414101607.GA32724@irie>

From: Samuel Ortiz <samuel.ortiz@nokia.com>
Date: Fri, 14 Apr 2006 13:16:07 +0300

> Hi David,
> 
> In the STIR421x case, when the firmware upload fails, we need to unregister_netdev. Otherwise we hit a BUG on free_netdev(), if sysfs is enabled.
>  
> Signed-off-by: Samuel Ortiz <samuel.ortiz@nokia.com>

Applied, but please become familiar with the carriage return key on
your keyboard and format your changelog descirptions to ~80 columns.

Thanks.

^ permalink raw reply

* Re: [PATCH] IrDA: smsc-ircc2, smcinit support for ALi ISA bridges
From: David S. Miller @ 2006-04-14 23:03 UTC (permalink / raw)
  To: samuel.ortiz; +Cc: netdev, jt, irda-users
In-Reply-To: <20060414101617.GB32724@irie>

From: Samuel Ortiz <samuel.ortiz@nokia.com>
Date: Fri, 14 Apr 2006 13:16:18 +0300

> Hi David,
> 
> This patch enables support for ALi ISA bridges when we run the smcinit code.
> It is needed to properly configure some Toshiba laptops.
> 
> Patch from Linus Walleij <triad@df.lth.se>
> Signed-off-by: Samuel Ortiz <samuel.ortiz@nokia.com>

Also applied, thanks.

^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: Jesse Brandeburg @ 2006-04-14 23:04 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, jgarzik, auke-jan.h.kok, netdev, john.ronciak,
	Jeffrey.t.kirsher, auke
In-Reply-To: <20060414.155129.121789313.davem@davemloft.net>

On Fri, 14 Apr 2006, David S. Miller wrote:

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date: Fri, 14 Apr 2006 15:43:02 -0700 (Pacific Daylight Time)
> 
> > Please help me understand how you think it should work when we have a 
> > device that wants to receive a packet using header in the skb->data, and 
> > application data in the ->frags[]? 
> 
> If you're removing the pages from ->frags[] and you already accounted
> for them in skb->truesize, then yes I guess it could be correct to
> subtract it back out.

they are not accounted for in the skb yet.  the way e1000 works is it 
pre-allocates all its 128 byte receive buffers using skb_alloc, and then 
chains in any pages that are used for receive packets.

> Looking at S2IO it's doing something very wrong with it's modification
> of skb->truesize.  It's not changing the amount of memory allocated
> to an SKB in any way yet it's bumping skb->truesize for some reason.
> That will definitely lead to performance problems.

I can only speak to what e1000 is trying to do. we're technically 
increasing the size of the skb after we've allocated it, so

does nr_frags and frags[] need truesize to be correct?

> That's why generally when we see an skb->truesize modification, it's
> just assumed to be a bug, it's generally not necessary but may be
> so in your e1000 situation here.

cool, I would really prefer that we extend the kernel api's to "do the 
right thing" for receives using frags[]

> Another problem to look out for is that once a socket has a reference
> to an SKB you must never ever modify skb->truesize because doing so
> will break socket memory accounting.  There is exactly one spot in
> the TCP stack where we are able to do this by adjusting the socket
> memory accounting variables carefully in a locked context at the same
> time, but otherwise it's not safe to modify skb->truesize when there
> is an attached socket.

Absolutely!  we're indicating the packet to the stack using netif_rx* (and 
we allocated it), so no socket has the reference until after we change 
truesize.


^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: David S. Miller @ 2006-04-14 23:45 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: jgarzik, auke-jan.h.kok, netdev, john.ronciak, Jeffrey.t.kirsher,
	auke
In-Reply-To: <Pine.WNT.4.63.0604141555200.2116@jbrandeb-desk.amr.corp.intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Fri, 14 Apr 2006 16:04:21 -0700 (Pacific Daylight Time)

> On Fri, 14 Apr 2006, David S. Miller wrote:
> 
> > From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> > Date: Fri, 14 Apr 2006 15:43:02 -0700 (Pacific Daylight Time)
> > 
> > > Please help me understand how you think it should work when we have a 
> > > device that wants to receive a packet using header in the skb->data, and 
> > > application data in the ->frags[]? 
> > 
> > If you're removing the pages from ->frags[] and you already accounted
> > for them in skb->truesize, then yes I guess it could be correct to
> > subtract it back out.
> 
> they are not accounted for in the skb yet.  the way e1000 works is it 
> pre-allocates all its 128 byte receive buffers using skb_alloc, and then 
> chains in any pages that are used for receive packets.

Ok so you allocate the base skb->data memory using alloc_skb().  Then
you add on pages and increment skb->truesize to account for the pages.
And you do all of this for receive packets before they are passed up
to the stack.

Right?

If so, then that should be OK.

Jeff, I seem to have misunderstood what e1000 was doing, the patches
changing skb->truesize should be fine.

^ permalink raw reply

* Re: [PATCH 05/10] e1000: Update truesize with the length of the packet for packet split
From: Jesse Brandeburg @ 2006-04-14 23:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, jgarzik, auke-jan.h.kok, netdev, john.ronciak,
	Jeffrey.t.kirsher, auke
In-Reply-To: <20060414.164512.17280154.davem@davemloft.net>



On Fri, 14 Apr 2006, David S. Miller wrote:
> > they are not accounted for in the skb yet.  the way e1000 works is it 
> > pre-allocates all its 128 byte receive buffers using skb_alloc, and then 
> > chains in any pages that are used for receive packets.
> 
> Ok so you allocate the base skb->data memory using alloc_skb().  Then
> you add on pages and increment skb->truesize to account for the pages.
> And you do all of this for receive packets before they are passed up
> to the stack.
> 
> Right?

yes, exactly.

> If so, then that should be OK.
> 
> Jeff, I seem to have misunderstood what e1000 was doing, the patches
> changing skb->truesize should be fine.

Thanks

^ permalink raw reply

* Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
From: David S. Miller @ 2006-04-14 23:53 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: bb, herbert, kernel, nipsy, jrlundgren, cat, djani22,
	yoseph.basri, mykleb, olel, michal, chris, netdev,
	jesse.brandeburg, E1000-devel, ak, jgarzik
In-Reply-To: <Pine.WNT.4.63.0604141553210.2116@jbrandeb-desk.amr.corp.intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Fri, 14 Apr 2006 15:55:10 -0700 (Pacific Daylight Time)

> I'm trying to isolate more of a reproduction case, I'll be sure to
> post if I can find anything with more detail.

I think I see the bug.

If tbench with large numbers of clients is part of what helps
reproduce it, the key might be hitting the memory limits in tcp_mem[]
and friends, or something to do with concurrent access to
sk->sk_forward_alloc.

I bet there is some race in there.

A lot of the action is in net/core/stream.c  We modify
sk->sk_forward_alloc non-atomically but that should be ok
since we ought to be holding all of the correct locks when
we hit these accesses.  But it is the first thing to audit.

Let's look at sk_stream_rfree() as that is invoked from SKB
freeing callbacks and is the most likely suspect for these
kinds of problems.

It is hooked up to the skb->destructor by sk_stream_set_owner_r() and
then invoked via __kfree_skb().

Nothing here takes any locks, and as stated above we modify
sk->sk_forward_alloc non-atomically, and this is therefore the bug.

Shit.

I'll think of how to fix this in the least invasive manner.  I also
want to search the changelog history to see if this race was always
present or if it was "introduced".

Making sk->sk_forward_alloc an atomic_t would be incredibly expensive
so I'll try to find a way to avoid that.  We may be able to just do
a bh_lock_sock()/bh_unlock_sock() around the body of sk_stream_rfree()
to fix this.

^ permalink raw reply

* fixing sk_stream_rfree()
From: David S. Miller @ 2006-04-15  3:59 UTC (permalink / raw)
  To: netdev; +Cc: herbert

Herbert, as you may have noticed we found some missing
locking in sk_stream_rfree().  It could explain the
"!sk_forward_alloc" BUG() we thought was caused by e1000's
TSO implementation and the Intel folks have provided enough
datapoints to prove that it is indeed not an e1000 specific
problem.

sk_stream_rfree() modifies sk->sk_forward_alloc without holding any
locks from the __kfree_skb() callback.

Now, _most_ of the time it is OK because we're being invoked with the
socket lock held as ACKs come back to liberate the transmit queue or
during socket shutdown which also holds the socket locked.  (I did
audit this, but I may have missed something, please double check this
assertion)

However, if skb->users is incremented or similar, that entity could
end up being the context in which the skb is freed up and that will
not necessarily have the socket locked correctly when the
skb->destructor callback invokes sk_stream_rfree().  Consider tcpdump
or similar.

But we're stuck if we try to cure this:

1) We can't take bh_lock_sock().  Well, we could if we disabled
   BH explicitly first but even then if the socket is locked
   by the user we can't tell if that's the current cpu and there
   is no easy way to handle deferring this work.

   We really can't consider using a workqueue or something like
   that to handle this, that's way too heavy handed.

2) We can't turn sk_forward_alloc easily into an atomic_t.  The
   masking operation in __sk_stream_mem_reclaim() does not translate
   readily into an atomic_t op:

		sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;

   That line has always driven me nuts, but I know that it is there
   to handle partial page allocations existing when that function
   is called.

I guess for #2 we could change those two lines into:

		int n = atomic_read(&sk->sk_forward_alloc) /
				SK_STREAM_MEM_QUANTUM;

		atomic_sub(n, sk->sk_prot->memory_allocated);
		sk->sk_forward_alloc -= n * SK_STREAM_MEM_QUANTUM;

in order to make it "atomic_t op" translatable.

But even if we did this with sk->sk_forward_alloc as an atomic_t
it would have the same kinds of races we're trying to cure here,
namely that we test the value and then act upon it while an unlocked
asynchronous context can modify the value in another thread of
execution in between the test and the action.

Any ideas?  Come to think of it, __sk_stream_mem_reclaim() looks
really racey even as-is.

^ permalink raw reply

* Re: [patch] ipv4: initialize arp_tbl rw lock
From: Heiko Carstens @ 2006-04-15  7:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: shemminger, jgarzik, akpm, netdev, linux-kernel, fpavlic, davem
In-Reply-To: <20060408.031404.111884281.davem@davemloft.net>

> > Ok, so the problem seems to be that inet_init gets called after qeth_init.
> > Looking at the top level Makefile this seems to be true for all network
> > drivers in drivers/net/ and drivers/s390/net/ since we have
> > 
> > vmlinux-main := $(core-y) $(libs-y) $(drivers-y) $(net-y)
> > 
> > The patch below works for me... I guess there must be a better way to make
> > sure that any networking driver's initcall is made before inet_init?
> > 
> > Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> We could make inet_init() a subsystem init but I vaguely recall
> that we were doing that at one point and it broke things for
> some reason.
> 
> Perhaps fs_initcall() would work better.  Or if that causes
> problems we could create a net_initcall() that sits between
> fs_initcall() and device_initcall().
> 
> Or any other ideas?

Tried to figure out what is causing the delays I experienced when I replaced
module_init() in af_inet.c with fs_initcall(). After all it turned out that
synchronize_net() which is basicically nothing else than synchronize_rcu()
sometimes takes several seconds to complete?! No idea why that is...

callchain: inet_init() -> inet_register_protosw() -> synchronize_net()

^ permalink raw reply

* Re: [patch] ipv4: initialize arp_tbl rw lock
From: David S. Miller @ 2006-04-15  7:34 UTC (permalink / raw)
  To: heiko.carstens
  Cc: shemminger, jgarzik, akpm, netdev, linux-kernel, fpavlic, davem
In-Reply-To: <20060415072745.GA17011@osiris.boeblingen.de.ibm.com>

From: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Sat, 15 Apr 2006 09:27:45 +0200

> Tried to figure out what is causing the delays I experienced when I replaced
> module_init() in af_inet.c with fs_initcall(). After all it turned out that
> synchronize_net() which is basicically nothing else than synchronize_rcu()
> sometimes takes several seconds to complete?! No idea why that is...
> 
> callchain: inet_init() -> inet_register_protosw() -> synchronize_net()

The problem can't be rcu_init(), that gets done very early
in init/main.c

Maybe it's some timer or something else specific to s390?

It could also be that there's perhaps nothing to context
switch to, thus the RCU takes forever to "happen".

^ permalink raw reply

* Re: fixing sk_stream_rfree()
From: David S. Miller @ 2006-04-15  8:23 UTC (permalink / raw)
  To: netdev; +Cc: herbert
In-Reply-To: <20060414.205927.13626300.davem@davemloft.net>

From: "David S. Miller" <davem@davemloft.net>
Date: Fri, 14 Apr 2006 20:59:27 -0700 (PDT)

> Any ideas?  Come to think of it, __sk_stream_mem_reclaim() looks
> really racey even as-is.

I came up with one more possible approach, it goes something like
this:

1) Get rid of the skb->destructor callback for TCP receive
   data.

2) Adjust sk_rmem_alloc and sk_forward_alloc explicitly when we add
   packets to sk_receive_queue/out_of_order_queue, and when we unlink
   them from sk_receive_queue and __kfree_skb() them up.

I think this would work and get rid of the unlocked accesses to
sk_forward_alloc.  And it would do so without adding new overheads
(the other two ideas did, by adding locks or turning sk_forward_alloc
into an atomic_t).

The implementation needs a bit of thinking in order to not mess up all
the nice abstractions Arnaldo has created for stream sockets and the
interfaces we share between TCP and DCCP.

But actually DCCP needs this fix too, since it also uses
sk_forward_alloc, so building the solution into the shared interfaces
is desirable.

Too bad this will take a little bit of time to implement, I really
want to be able to cook up a much simpler test patch that the people
who can reproduce the BUG() can try out.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox