Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
From: Vladislav Zolotarov @ 2010-10-12  9:12 UTC (permalink / raw)
  To: Eric Dumazet, Tom Herbert
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein
In-Reply-To: <1286860560.30423.186.camel@edumazet-laptop>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Tuesday, October 12, 2010 7:16 AM
> To: Tom Herbert
> Cc: David Miller; netdev; Michael Chan; Eilon Greenstein
> Subject: Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
> 
> Le lundi 11 octobre 2010 à 22:03 -0700, Tom Herbert a écrit :
> > On Mon, Oct 11, 2010 at 4:22 PM, Eric Dumazet
> <eric.dumazet@gmail.com> wrote:
> > > Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> > >> netdev_alloc_skb() is a very wrong interface, really.
> > >>
> > >> We should remove/deprecate it.
> > >>
> > >> For multi queue devices, it makes more sense to allocate skb on
> local
> > >> node of the cpu handling RX interrupts. This allow each cpu to
> > >> manipulate its own slub/slab queues/structures without doing
> expensive
> > >> cross-node business.
> > >>
> > >> For non multi queue devices, IRQ affinity should be set so that a
> cpu
> > >> close to the device services interrupts. Even if not set, using
> > >> dev_alloc_skb() is faster.
> > >>
> > >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > >
> > > Or maybe revert :
> > >
> > > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > > Author: Christoph Hellwig <hch@lst.de>
> > > Date:   Wed Dec 6 20:32:36 2006 -0800
> > >
> >
> > I second this revert.  Node aware allocation by device's node makes
> > little sense on a multi-queue device and leads to mediocre
> > performance.
> 
> Yes, I said this several time in the past, I believe time has come to
> get rid of it.
> 
> I posted a patch some minutes ago, so you can review it and ack it ;)
> 
> Thanks !

Eric, very nice patch, thanks. However Eilon is on vacation till tomorrow
and he is a responsible maintainer of the bnx2x, thus the final ACK should
be his.

Dave, if it can wait till tomorrow, I'd like to ask u to wait for Eilon's
final ACK.

Thanks,
vlad

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH] b44: fix resume, request_irq after hw reset
From: James Hogan @ 2010-10-12  8:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gary Zambrano, Jiri Pirko, FUJITA Tomonori, Hauke Mehrtens,
	Larry Finger, netdev, linux-kernel, David S. Miller
In-Reply-To: <20101012002715.fd0e3371.akpm@linux-foundation.org>

On 12 October 2010 08:27, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 12 Oct 2010 08:08:05 +0100 James Hogan <james@albanarts.com> wrote:
>
>> > > @@ -2309,6 +2303,12 @@ static int b44_resume(struct ssb_device *sdev)
>> > >
>> > >   netif_device_attach(bp->dev);
>> > >   spin_unlock_irq(&bp->lock);
>> > >
>> > > + rc = request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name,
>> dev);
>> > > + if (rc) {
>> > > +         netdev_err(dev, "request_irq failed\n");
>> > > +         return rc;
>> > > + }
>> > > +
>> > >
>> > >   b44_enable_ints(bp);
>> > >   netif_wake_queue(dev);
>> >
>> > OK, running the interrupt handler before b44_init_hw() is presumably
>> > the problem here.
>> >
>> > The hardware surely won't be generating interrupts until we've run
>> > b44_init_hw() and b44_enable_ints(), so this patch really is only to
>> > keep CONFIG_DEBUG_SHIRQ happy.
>>
>> For me it's mainly to keep CONFIG_DEBUG_SHIRQ happy (Fedora has this switched
>> on), but since it's a shared IRQ, there is still a chance it could be
>> called before enabling it's own interrupts by a different device on the same
>> IRQ.
>
> ooh, yes, you're right, I forgot about that.  It's indeed a bug.
>
>> It makes sense to me why it's disabling the IRQ now, in case another device
>> triggers it when it cannot handle it safely.
>
> What code are you referring to here?  There's no disable_irq() in that area?

Sorry, I meant freeing the irq (free_irq() in b44_suspend).

Thinking about it this should also go in stable too.

Cheers
James

>
>> I also tried calling the
>> interrupt directly before the free_irq in the suspend function to check that
>> it wasn't being done too late, and it didn't fail, so possibly it is the core
>> suspension that makes it start failing until it is brought back up properly.
>

^ permalink raw reply

* Re: [PATCH net-next] net:  allocate skbs on local node
From: Andrew Morton @ 2010-10-12  7:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter, Pekka Enberg
In-Reply-To: <1286869793.2732.24.camel@edumazet-laptop>

On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :
> 
> > I'd love to forget it, but it's faster for some things (I forget
> > which).  Which is why it's still around.
> 
> Yes, two years ago it was true on pathological/obscure cases.
> Every time I did the comparison, SLUB won.
> You asked me, I did yet another test this morning, and 40% is pretty
> serious, I believe.
> 
> > 
> > And the ghastly thing about this is that you're forced to care about it
> > too because some people are, apparently, still using it.
> > 
> 
> Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
> machines, I know...
> 
> I am not saying we should not care, but for any serious network workload
> on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
> might even get better.
> 
> BTW, I believe all modern distros ship SLUB, dont they ?
> 

Dunno.

Pekka, why haven't we deleted slab yet??

^ permalink raw reply

* Re: [PATCH net-next] net:  allocate skbs on local node
From: Eric Dumazet @ 2010-10-12  7:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter
In-Reply-To: <20101012002435.f51f2c0e.akpm@linux-foundation.org>

Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :

> I'd love to forget it, but it's faster for some things (I forget
> which).  Which is why it's still around.

Yes, two years ago it was true on pathological/obscure cases.
Every time I did the comparison, SLUB won.
You asked me, I did yet another test this morning, and 40% is pretty
serious, I believe.

> 
> And the ghastly thing about this is that you're forced to care about it
> too because some people are, apparently, still using it.
> 

Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
machines, I know...

I am not saying we should not care, but for any serious network workload
on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
might even get better.

BTW, I believe all modern distros ship SLUB, dont they ?



^ permalink raw reply

* Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35
From: Marc Kleine-Budde @ 2010-10-12  7:56 UTC (permalink / raw)
  To: Masayuki Ohtake
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	sameo-VuQAYsv1563Yd54FQh9/CA,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, chripell-VaTbYqLCNhc,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	morinaga526-ECg8zkTtlr0C6LszWs/t0g, David Miller,
	Wolfgang Grandegger, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <000c01cb69dc$5d2aaab0$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2267 bytes --]

Hello Masayuki,

On 10/12/2010 09:09 AM, Masayuki Ohtake wrote:
> We have implemented our CAN driver with FIFO mode, and
> We are testing our CAN driver with FIFO mode.
> However, we have found Our CAN hardware spec is different from our anticipated.
> Our CAN HW FIFO is not common FIFO.
> Using  FIFO mode, there is possibility received packets are out-of-order.

It's called a FIFO in the Documentation, but it's not a real FIFO as we
all hoped.

> e.g.
> Recv packet-A from NW and set to FIFO.
>   |A|
> 
> Recv packet-B from NW and set to FIFO.
>   |A|B|
> 
> Recv packet-C is about to set to FIFO
>   |A|B|(C)|
> 
> Userspace Copies A from Driver
> Userspace Copies B from Driver
>   |   |   |(C)|
> 
> packet-C set to FIFO (C is not head.)
> Recv packet-D from NW(Next packet is set to head)
>   |D|   |C|
> 
> Userspace Copies D from Driver
> Userspace Copies C from Driver
> Userspace raceived packet order is like below
> A-B-D-C

I just had a small peak at the datasheet. It seems you can implement the
same scheme for rx as on the at91, if you enough rx buffers. In the at91
driver I use 8+4 = 12 buffers for rx. Have a look at the driver, the rx
path is documented.

The driver uses 8 + 4 rx buffers, the trick is that a buffer in read but
_not_ marked as free, if I understand the intel datasheet correctly this
is by not resetting the NEWDAT bit. Becasue the driver doesn't free the
buffers it knows that the next CAN frame is in the next buffer. After
reading the 8th buffer, the first 8 buffers (0-7) are marked as free.
Then the driver continues to read buffers 8-11, they might have been
filled before marking 0-7 as free. Then the driver continues from the
beginning.

> So, I think normal-mode is better than FIFO-mode.

> I will revert like the following spec.
>   Rx 1 Message Object
>   Tx 1 Message Object
> 
> Could you agree the above ?

ahh common...the hardware doesn't seem that bad :)

cheers, Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35
From: Wolfgang Grandegger @ 2010-10-12  7:42 UTC (permalink / raw)
  To: Masayuki Ohtake
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	sameo-VuQAYsv1563Yd54FQh9/CA,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w, mkl-bIcnvbaLZ9MEGnE8C9+IrQ,
	chripell-VaTbYqLCNhc, morinaga526-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, David Miller,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <000c01cb69dc$5d2aaab0$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>

Hi Ohtake,

On 10/12/2010 09:09 AM, Masayuki Ohtake wrote:
> Hi Wolfgang,
> 
> We have implemented our CAN driver with FIFO mode, and
> We are testing our CAN driver with FIFO mode.
> However, we have found Our CAN hardware spec is different from our anticipated.
> Our CAN HW FIFO is not common FIFO.
> Using  FIFO mode, there is possibility received packets are out-of-order.
> 
> e.g.
> Recv packet-A from NW and set to FIFO.
>   |A|
> 
> Recv packet-B from NW and set to FIFO.
>   |A|B|
> 
> Recv packet-C is about to set to FIFO
>   |A|B|(C)|
> 
> Userspace Copies A from Driver

OK, let's say the CPU or software starts processing the message FIFO.

> Userspace Copies B from Driver
>   |   |   |(C)|
> 
> packet-C set to FIFO (C is not head.)
> Recv packet-D from NW(Next packet is set to head)
>   |D|   |C|
> 
> Userspace Copies D from Driver

The software could continues searching the FIFO for valid messages. Then
it would find C first.

> Userspace Copies C from Driver
> Userspace raceived packet order is like below
> A-B-D-C

I'm still optimistic that it could be handled properly by software, it
might be tricky though.

> So, I think normal-mode is better than FIFO-mode.

To be clear. Out-of-order reception is not allowed!

> I will revert like the following spec.
>   Rx 1 Message Object
>   Tx 1 Message Object
> 
> Could you agree the above ?

See above. I agree if the software cannot assure in-order reception. But
it's not yet obvious to me that it cannot be achieved. I will have a
closer look to the manual. Just one RX message object without any
further buffering is really bad as message losses are likely to happen.

Wolfgang.

^ permalink raw reply

* Re: [PATCH] b44: fix resume, request_irq after hw reset
From: Andrew Morton @ 2010-10-12  7:27 UTC (permalink / raw)
  To: James Hogan
  Cc: Gary Zambrano, Jiri Pirko, FUJITA Tomonori, Hauke Mehrtens,
	Larry Finger, netdev, linux-kernel, David S. Miller
In-Reply-To: <201010120808.05713.james@albanarts.com>

On Tue, 12 Oct 2010 08:08:05 +0100 James Hogan <james@albanarts.com> wrote:

> > > @@ -2309,6 +2303,12 @@ static int b44_resume(struct ssb_device *sdev)
> > > 
> > >  	netif_device_attach(bp->dev);
> > >  	spin_unlock_irq(&bp->lock);
> > > 
> > > +	rc = request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, 
> dev);
> > > +	if (rc) {
> > > +		netdev_err(dev, "request_irq failed\n");
> > > +		return rc;
> > > +	}
> > > +
> > > 
> > >  	b44_enable_ints(bp);
> > >  	netif_wake_queue(dev);
> > 
> > OK, running the interrupt handler before b44_init_hw() is presumably
> > the problem here.
> > 
> > The hardware surely won't be generating interrupts until we've run
> > b44_init_hw() and b44_enable_ints(), so this patch really is only to
> > keep CONFIG_DEBUG_SHIRQ happy.
> 
> For me it's mainly to keep CONFIG_DEBUG_SHIRQ happy (Fedora has this switched 
> on), but since it's a shared IRQ, there is still a chance it could be 
> called before enabling it's own interrupts by a different device on the same 
> IRQ.

ooh, yes, you're right, I forgot about that.  It's indeed a bug.

> It makes sense to me why it's disabling the IRQ now, in case another device 
> triggers it when it cannot handle it safely.

What code are you referring to here?  There's no disable_irq() in that area?

> I also tried calling the 
> interrupt directly before the free_irq in the suspend function to check that 
> it wasn't being done too late, and it didn't fail, so possibly it is the core 
> suspension that makes it start failing until it is brought back up properly.

^ permalink raw reply

* Re: [PATCH net-next] net:  allocate skbs on local node
From: Andrew Morton @ 2010-10-12  7:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter
In-Reply-To: <1286866699.30423.234.camel@edumazet-laptop>

On Tue, 12 Oct 2010 08:58:19 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 11 octobre 2010 __ 23:03 -0700, Andrew Morton a __crit :
> > On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > [PATCH net-next] net: allocate skbs on local node
> > > 
> > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> > > allocating net drivers skbs on a given memory node : The one closest to
> > > the NIC hardware. This is wrong because as soon as we try to scale
> > > network stack, we need to use many cpus to handle traffic and hit
> > > slub/slab management on cross-node allocations/frees when these cpus
> > > have to alloc/free skbs bound to a central node.
> > > 
> > > skb allocated in RX path are ephemeral, they have a very short
> > > lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> > > appeared as a nice idea four years ago is in fact a bad one.
> > > 
> > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> > > and two 10Gb NIC might deliver more than 28 million packets per second,
> > > needing all the available cpus.
> > > 
> > > Cost of cross-node handling in network and vm stacks outperforms the
> > > small benefit hardware had when doing its DMA transfert in its 'local'
> > > memory node at RX time. Even trying to differentiate the two allocations
> > > done for one skb (the sk_buff on local node, the data part on NIC
> > > hardware node) is not enough to bring good performance.
> > > 
> > 
> > This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> > 
> 
> I would say, _you_ should prove that original patch was good. It seems
> no network guy was really in the discussion ?

Two wrongs and all that.  The 2006 patch has nothing to do with it,
apart from demonstrating the importance of including performance
measurements in a performance patch.

> Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the
> difference. Thats about a 40% slowdown on high packet rates, on a dual
> socket machine (dual X5570  @2.93GHz). You can expect higher values on
> four nodes (I dont have such hardware to do the test)

Like that.  Please flesh it out and stick it in the changelog.

> 
> > The mooted effects should be tested for on both slab and slub, I
> > suggest.  They're pretty different beasts.
> 
> SLAB is so slow on NUMA these days, you can forget it for good.

I'd love to forget it, but it's faster for some things (I forget
which).  Which is why it's still around.

And the ghastly thing about this is that you're forced to care about it
too because some people are, apparently, still using it.


^ permalink raw reply

* Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35
From: Masayuki Ohtake @ 2010-10-12  7:09 UTC (permalink / raw)
  To: Wolfgang Grandegger, David Miller
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	sameo-VuQAYsv1563Yd54FQh9/CA,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w, mkl-bIcnvbaLZ9MEGnE8C9+IrQ,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	morinaga526-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, chripell-VaTbYqLCNhc,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CAC3D94.9010408@grandegger.com>

Hi Wolfgang,

We have implemented our CAN driver with FIFO mode, and
We are testing our CAN driver with FIFO mode.
However, we have found Our CAN hardware spec is different from our anticipated.
Our CAN HW FIFO is not common FIFO.
Using  FIFO mode, there is possibility received packets are out-of-order.

e.g.
Recv packet-A from NW and set to FIFO.
  |A|

Recv packet-B from NW and set to FIFO.
  |A|B|

Recv packet-C is about to set to FIFO
  |A|B|(C)|

Userspace Copies A from Driver
Userspace Copies B from Driver
  |   |   |(C)|

packet-C set to FIFO (C is not head.)
Recv packet-D from NW(Next packet is set to head)
  |D|   |C|

Userspace Copies D from Driver
Userspace Copies C from Driver
Userspace raceived packet order is like below
A-B-D-C

So, I think normal-mode is better than FIFO-mode.

I will revert like the following spec.
  Rx 1 Message Object
  Tx 1 Message Object

Could you agree the above ?

Thanks, Ohtake(OKISemi)
----- Original Message ----- 
From: "Wolfgang Grandegger" <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
To: "David Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: <masa-korg-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>; <andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; <sameo-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>; <margie.foster-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>;
<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; <yong.y.wang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; <socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org>;
<mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>; <chripell-VaTbYqLCNhc@public.gmane.org>; <morinaga526-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>; <meego-dev-WXzIur8shnEAvxtiuMwx3w@public.gmane.org>;
<kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; <joel.clark-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; <qi.wang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Sent: Wednesday, October 06, 2010 6:12 PM
Subject: Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35


> On 10/06/2010 05:09 AM, David Miller wrote:
> > From: "Masayuki Ohtake" <masa-korg-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
> > Date: Wed, 6 Oct 2010 12:07:15 +0900
> >
> >> Does your mail mean, for accepting upstream, NAPI is essential for
> >> CAN driver ?
> >
> > It is up to the CAN maintainers :-)
>
> Well, our SJA1000 reference driver does still not use NAPI. But NAPI is
> for CAN especially useful to avoid the infamous *bus error irq
> flooding*, which may hang low end systems if the interrupts are handled
> in the IRQ context. Ohtake, if your system can handle well such CAN bus
> error irq storms at 1MB/s, then NAPI is *not* a must to have. Anyway, as
> you are at it, I also suggest to use NAPI right from the beginning.
>
> Wolfgang.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH] b44: fix resume, request_irq after hw reset
From: James Hogan @ 2010-10-12  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gary Zambrano, Jiri Pirko, FUJITA Tomonori, Hauke Mehrtens,
	Larry Finger, netdev, linux-kernel, David S. Miller
In-Reply-To: <20101011163440.072e0227.akpm@linux-foundation.org>

On Tuesday 12 October 2010 00:34:40 Andrew Morton wrote:
> On Tue, 12 Oct 2010 00:22:12 +0100
> 
> James Hogan <james@albanarts.com> wrote:
> > This driver was hanging on resume because it was requesting a shared irq
> > that it wasn't ready to immediately handle, which was tested in
> > request_irq because of the CONFIG_DEBUG_SHIRQ config option. The
> > interrupt handler tried to read the interrupt status register but for
> > some reason it hung the system.
> > 
> > The request_irq is now moved a bit later after resetting the hardware
> > which seems to fix it.
> > 
> > Signed-off-by: James Hogan <james@albanarts.com>
> > ---
> > 
> >  drivers/net/b44.c |   12 ++++++------
> >  1 files changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/b44.c b/drivers/net/b44.c
> > index 1e620e2..dbba981 100644
> > --- a/drivers/net/b44.c
> > +++ b/drivers/net/b44.c
> > @@ -2296,12 +2296,6 @@ static int b44_resume(struct ssb_device *sdev)
> > 
> >  	if (!netif_running(dev))
> >  	
> >  		return 0;
> > 
> > -	rc = request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, 
dev);
> > -	if (rc) {
> > -		netdev_err(dev, "request_irq failed\n");
> > -		return rc;
> > -	}
> > -
> > 
> >  	spin_lock_irq(&bp->lock);
> >  	
> >  	b44_init_rings(bp);
> > 
> > @@ -2309,6 +2303,12 @@ static int b44_resume(struct ssb_device *sdev)
> > 
> >  	netif_device_attach(bp->dev);
> >  	spin_unlock_irq(&bp->lock);
> > 
> > +	rc = request_irq(dev->irq, b44_interrupt, IRQF_SHARED, dev->name, 
dev);
> > +	if (rc) {
> > +		netdev_err(dev, "request_irq failed\n");
> > +		return rc;
> > +	}
> > +
> > 
> >  	b44_enable_ints(bp);
> >  	netif_wake_queue(dev);
> 
> OK, running the interrupt handler before b44_init_hw() is presumably
> the problem here.
> 
> The hardware surely won't be generating interrupts until we've run
> b44_init_hw() and b44_enable_ints(), so this patch really is only to
> keep CONFIG_DEBUG_SHIRQ happy.

For me it's mainly to keep CONFIG_DEBUG_SHIRQ happy (Fedora has this switched 
on), but since it's a shared IRQ, there is still a chance it could be 
called before enabling it's own interrupts by a different device on the same 
IRQ.

It makes sense to me why it's disabling the IRQ now, in case another device 
triggers it when it cannot handle it safely. I also tried calling the 
interrupt directly before the free_irq in the suspend function to check that 
it wasn't being done too late, and it didn't fail, so possibly it is the core 
suspension that makes it start failing until it is brought back up properly.

Cheers
James

^ permalink raw reply

* [PATCH] net: allow FEC driver to use fixed PHY support
From: Greg Ungerer @ 2010-10-12  7:03 UTC (permalink / raw)
  To: netdev; +Cc: gerg

At least one board using the FEC driver does not have a conventional
PHY attached to it, it is directly connected to a somewhat simple
ethernet switch (the board is the SnapGear/LITE, and the attached
4-port ethernet switch is a RealTek RTL8305). This switch does not
present the usual register interface of a PHY, it presents nothing.
So a PHY scan will find nothing - it finds ID's of 0 for each PHY
on the attached MII bus.

After the FEC driver was changed to use phylib for supporting PHYs
it no longer works on this particular board/switch setup.

Add code support to use a fixed phy if no PHY is found on the MII bus.
This is based on the way the cpmac.c driver solved this same problem.

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
---
 drivers/net/fec.c |   41 +++++++++++++++++++++++++++--------------
 1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index 768b840..f45dc63 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -678,24 +678,37 @@ static int fec_enet_mii_probe(struct net_device *dev)
 {
 	struct fec_enet_private *fep = netdev_priv(dev);
 	struct phy_device *phy_dev = NULL;
-	int ret;
+	char mdio_bus_id[MII_BUS_ID_SIZE];
+	char phy_name[MII_BUS_ID_SIZE + 3];
+	int phy_id;
 
 	fep->phy_dev = NULL;
 
-	/* find the first phy */
-	phy_dev = phy_find_first(fep->mii_bus);
-	if (!phy_dev) {
-		printk(KERN_ERR "%s: no PHY found\n", dev->name);
-		return -ENODEV;
+	/* check for attached phy */
+	for (phy_id = 0; (phy_id < PHY_MAX_ADDR); phy_id++) {
+		if ((fep->mii_bus->phy_mask & (1 << phy_id)))
+			continue;
+		if (fep->mii_bus->phy_map[phy_id] == NULL)
+			continue;
+		if (fep->mii_bus->phy_map[phy_id]->phy_id == 0)
+			continue;
+		strncpy(mdio_bus_id, fep->mii_bus->id, MII_BUS_ID_SIZE);
+		break;
 	}
 
-	/* attach the mac to the phy */
-	ret = phy_connect_direct(dev, phy_dev,
-			     &fec_enet_adjust_link, 0,
-			     PHY_INTERFACE_MODE_MII);
-	if (ret) {
-		printk(KERN_ERR "%s: Could not attach to PHY\n", dev->name);
-		return ret;
+	if (phy_id >= PHY_MAX_ADDR) {
+		printk(KERN_INFO "%s: no PHY, assuming direct connection "
+			"to switch\n", dev->name);
+		strncpy(mdio_bus_id, "0", MII_BUS_ID_SIZE);
+		phy_id = 0;
+	}
+
+	snprintf(phy_name, MII_BUS_ID_SIZE, PHY_ID_FMT, mdio_bus_id, phy_id);
+	phy_dev = phy_connect(dev, phy_name, &fec_enet_adjust_link, 0,
+		PHY_INTERFACE_MODE_MII);
+	if (IS_ERR(phy_dev)) {
+		printk(KERN_ERR "%s: could not attach to PHY\n", dev->name);
+		return PTR_ERR(phy_dev);
 	}
 
 	/* mask with MAC supported features */
@@ -738,7 +751,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	fep->mii_bus->read = fec_enet_mdio_read;
 	fep->mii_bus->write = fec_enet_mdio_write;
 	fep->mii_bus->reset = fec_enet_mdio_reset;
-	snprintf(fep->mii_bus->id, MII_BUS_ID_SIZE, "%x", pdev->id);
+	snprintf(fep->mii_bus->id, MII_BUS_ID_SIZE, "%x", pdev->id + 1);
 	fep->mii_bus->priv = fep;
 	fep->mii_bus->parent = &pdev->dev;
 

^ permalink raw reply related

* Re: [PATCH net-next] net:  allocate skbs on local node
From: Eric Dumazet @ 2010-10-12  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter
In-Reply-To: <20101011230322.f0f6dd47.akpm@linux-foundation.org>

Le lundi 11 octobre 2010 à 23:03 -0700, Andrew Morton a écrit :
> On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > [PATCH net-next] net: allocate skbs on local node
> > 
> > commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> > allocating net drivers skbs on a given memory node : The one closest to
> > the NIC hardware. This is wrong because as soon as we try to scale
> > network stack, we need to use many cpus to handle traffic and hit
> > slub/slab management on cross-node allocations/frees when these cpus
> > have to alloc/free skbs bound to a central node.
> > 
> > skb allocated in RX path are ephemeral, they have a very short
> > lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> > appeared as a nice idea four years ago is in fact a bad one.
> > 
> > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> > and two 10Gb NIC might deliver more than 28 million packets per second,
> > needing all the available cpus.
> > 
> > Cost of cross-node handling in network and vm stacks outperforms the
> > small benefit hardware had when doing its DMA transfert in its 'local'
> > memory node at RX time. Even trying to differentiate the two allocations
> > done for one skb (the sk_buff on local node, the data part on NIC
> > hardware node) is not enough to bring good performance.
> > 
> 
> This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> 

I would say, _you_ should prove that original patch was good. It seems
no network guy was really in the discussion ?

Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the
difference. Thats about a 40% slowdown on high packet rates, on a dual
socket machine (dual X5570  @2.93GHz). You can expect higher values on
four nodes (I dont have such hardware to do the test)


> The mooted effects should be tested for on both slab and slub, I
> suggest.  They're pretty different beasts.

SLAB is so slow on NUMA these days, you can forget it for good.

Its about 40% slower on some tests I did this week on net-next, to
speedup output (and routing) performance, so it was with normal (local)
allocations, not even cross-nodes ones.

Once you remove network bottlenecks, you badly hit contention on SLAB
and are forced to switch to SLUB ;)

Sending 160.000.000 udp frames on same neighbour/destination,
IP route cache disabled (to mimic DDOS on a router)
16 threads, 16 logical cpus. 32bit kernel (dual E5540  @ 2.53GHz)


(It takes more than 2 minutes with linux-2.6, so use net-next-2.6 if you
really want to get these numbers)

SLUB :

real	0m50.661s
user	0m15.973s
sys	11m42.548s


18348.00 21.4% dst_destroy             vmlinux
 5674.00  6.6% fib_table_lookup        vmlinux
 5563.00  6.5% dst_alloc               vmlinux
 5226.00  6.1% neigh_lookup            vmlinux
 3590.00  4.2% __ip_route_output_key   vmlinux
 2712.00  3.2% neigh_resolve_output    vmlinux
 2511.00  2.9% fib_semantic_match      vmlinux
 2488.00  2.9% ipv4_dst_destroy        vmlinux
 2206.00  2.6% __xfrm_lookup           vmlinux
 2119.00  2.5% memset                  vmlinux
 2015.00  2.4% __copy_from_user_ll     vmlinux
 1722.00  2.0% udp_sendmsg             vmlinux
 1679.00  2.0% __slab_free             vmlinux
 1152.00  1.3% ip_append_data          vmlinux
 1044.00  1.2% __alloc_skb             vmlinux
  952.00  1.1% kmem_cache_free         vmlinux
  942.00  1.1% udp_push_pending_frames vmlinux
  877.00  1.0% kfree                   vmlinux
  870.00  1.0% __call_rcu              vmlinux
  829.00  1.0% ip_push_pending_frames  vmlinux
  799.00  0.9% _raw_spin_lock_bh       vmlinux

SLAB:

real	1m10.771s
user	0m13.941s
sys	12m42.188s


22734.00 26.0% _raw_spin_lock          vmlinux
 8238.00  9.4% dst_destroy             vmlinux
 4393.00  5.0% fib_table_lookup        vmlinux
 3652.00  4.2% dst_alloc               vmlinux
 3335.00  3.8% neigh_lookup            vmlinux
 2444.00  2.8% memset                  vmlinux
 2443.00  2.8% __ip_route_output_key   vmlinux
 1916.00  2.2% fib_semantic_match      vmlinux
 1708.00  2.0% __copy_from_user_ll     vmlinux
 1669.00  1.9% __xfrm_lookup           vmlinux
 1642.00  1.9% free_block              vmlinux
 1554.00  1.8% neigh_resolve_output    vmlinux
 1388.00  1.6% ipv4_dst_destroy        vmlinux
 1335.00  1.5% udp_sendmsg             vmlinux
 1109.00  1.3% kmem_cache_free         vmlinux
 1007.00  1.2% __alloc_skb             vmlinux
 1004.00  1.1% kfree                   vmlinux
 1002.00  1.1% ip_append_data          vmlinux
  975.00  1.1% cache_grow              vmlinux
  936.00  1.1% ____cache_alloc_node    vmlinux
  925.00  1.1% udp_push_pending_frames vmlinux


All this raw_spin_lock overhead comes from SLAB.



^ permalink raw reply

* Re: [PATCH net-next] net:  allocate skbs on local node
From: Andrew Morton @ 2010-10-12  6:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Christoph Hellwig, Christoph Lameter
In-Reply-To: <1286859925.30423.184.camel@edumazet-laptop>

On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 octobre 2010 __ 01:22 +0200, Eric Dumazet a __crit :
> > Le mardi 12 octobre 2010 __ 01:03 +0200, Eric Dumazet a __crit :
> > > 
> > > For multi queue devices, it makes more sense to allocate skb on local
> > > node of the cpu handling RX interrupts. This allow each cpu to
> > > manipulate its own slub/slab queues/structures without doing expensive
> > > cross-node business.
> > > 
> > > For non multi queue devices, IRQ affinity should be set so that a cpu
> > > close to the device services interrupts. Even if not set, using
> > > dev_alloc_skb() is faster.
> > > 
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > Or maybe revert :
> > 
> > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Wed Dec 6 20:32:36 2006 -0800
> > 
> >     [PATCH] node-aware skb allocation
> >     
> >     Node-aware allocation of skbs for the receive path.
> >     
> >     Details:
> >     
> >       - __alloc_skb gets a new node argument and cals the node-aware
> >         slab functions with it.
> >       - netdev_alloc_skb passed the node number it gets from dev_to_node
> >         to it, everyone else passes -1 (any node)
> >     
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Cc: Christoph Lameter <clameter@engr.sgi.com>
> >     Cc: "David S. Miller" <davem@davemloft.net>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> > 
> > 
> > Apparently, only Christoph and Andrew signed it.
> > 
> > 
> 
> [PATCH net-next] net: allocate skbs on local node
> 
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
> 
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
> 
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
> 
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
> 

This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)

The mooted effects should be tested for on both slab and slub, I
suggest.  They're pretty different beasts.

^ permalink raw reply

* Re: linux-next: manual merge of the net tree with the  tree
From: Stefan Richter @ 2010-10-12  5:56 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: netdev, linux-kernel, linux-next, Ben Hutchings, linux1394-devel,
	David Miller
In-Reply-To: <20101012145135.9697fe21.sfr@canb.auug.org.au>

Stephen Rothwell wrote:
> Hi all,
> 
> On Tue, 12 Oct 2010 11:17:16 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>> Today's linux-next merge of the net tree got a conflict in
>> drivers/ieee1394/eth1394.c between commit
>> 66fa12c571d35e3cd62574c65f1785a460105397 ("ieee1394: remove the old IEEE
>> 1394 driver stack") from the  tree and commit
>                                ^
> I meant to say "the ieee1394 tree".

(I should have put the ieee1394 removal into -next much earlier.  OTOH the
conflicting commit is from August 17 already...)

Would it be useful if I rebase this small ieee1394 removal branch of mine onto
net?
-- 
Stefan Richter
-=====-==-=- =-=- -==--
http://arcgraph.de/sr/

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb

^ permalink raw reply

* Re: [PATCH net-next] net: allocate skbs on local node
From: Tom Herbert @ 2010-10-12  5:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Michael Chan, Eilon Greenstein,
	Andrew Morton, Christoph Hellwig, Christoph Lameter
In-Reply-To: <1286859925.30423.184.camel@edumazet-laptop>

Acked-by: Tom Herbert <therbert@google.com>

On Mon, Oct 11, 2010 at 10:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit :
>> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
>> >
>> > For multi queue devices, it makes more sense to allocate skb on local
>> > node of the cpu handling RX interrupts. This allow each cpu to
>> > manipulate its own slub/slab queues/structures without doing expensive
>> > cross-node business.
>> >
>> > For non multi queue devices, IRQ affinity should be set so that a cpu
>> > close to the device services interrupts. Even if not set, using
>> > dev_alloc_skb() is faster.
>> >
>> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>
>> Or maybe revert :
>>
>> commit b30973f877fea1a3fb84e05599890fcc082a88e5
>> Author: Christoph Hellwig <hch@lst.de>
>> Date:   Wed Dec 6 20:32:36 2006 -0800
>>
>>     [PATCH] node-aware skb allocation
>>
>>     Node-aware allocation of skbs for the receive path.
>>
>>     Details:
>>
>>       - __alloc_skb gets a new node argument and cals the node-aware
>>         slab functions with it.
>>       - netdev_alloc_skb passed the node number it gets from dev_to_node
>>         to it, everyone else passes -1 (any node)
>>
>>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>>     Cc: Christoph Lameter <clameter@engr.sgi.com>
>>     Cc: "David S. Miller" <davem@davemloft.net>
>>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>>
>>
>> Apparently, only Christoph and Andrew signed it.
>>
>>
>
> [PATCH net-next] net: allocate skbs on local node
>
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
>
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
>
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
>
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   20 ++++++++++++++++----
>  net/core/skbuff.c      |   13 +------------
>  2 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0b53c43..05a358f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size,
>  static inline struct sk_buff *alloc_skb(unsigned int size,
>                                        gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, 0, -1);
> +       return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
>  }
>
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>                                               gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, 1, -1);
> +       return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
>  }
>
>  extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
> @@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
>        return skb;
>  }
>
> -extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
> +/**
> + *     __netdev_alloc_page - allocate a page for ps-rx on a specific device
> + *     @dev: network device to receive on
> + *     @gfp_mask: alloc_pages_node mask
> + *
> + *     Allocate a new page. dev currently unused.
> + *
> + *     %NULL is returned if there is no free memory.
> + */
> +static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
> +{
> +       return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
> +}
>
>  /**
>  *     netdev_alloc_page - allocate a page for ps-rx on a specific device
>  *     @dev: network device to receive on
>  *
> - *     Allocate a new page node local to the specified device.
> + *     Allocate a new page. dev currently unused.
>  *
>  *     %NULL is returned if there is no free memory.
>  */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 752c197..4e8b82e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb);
>  struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>                unsigned int length, gfp_t gfp_mask)
>  {
> -       int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
>        struct sk_buff *skb;
>
> -       skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
> +       skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
>        if (likely(skb)) {
>                skb_reserve(skb, NET_SKB_PAD);
>                skb->dev = dev;
> @@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>  }
>  EXPORT_SYMBOL(__netdev_alloc_skb);
>
> -struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
> -{
> -       int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
> -       struct page *page;
> -
> -       page = alloc_pages_node(node, gfp_mask, 0);
> -       return page;
> -}
> -EXPORT_SYMBOL(__netdev_alloc_page);
> -
>  void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
>                int size)
>  {
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH 4/6] dccp: generalise data-loss condition
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Ivo Calado
In-Reply-To: <1286860551-6809-4-git-send-email-gerrit@erg.abdn.ac.uk>

From: Ivo Calado <ivocalado@embedded.ufcg.edu.br>

This patch generalises the task of determining data loss from RFC 4340, 7.7.1.

Let S_A, S_B be sequence numbers such that S_B is "after" S_A, and let
N_B be the NDP count of packet S_B. Then, using modulo-2^48 arithmetic,
 D = S_B - S_A - 1  is an upper bound of the number of lost data packets,
 D - N_B            is an approximation of the number of lost data packets
                    (there are cases where this is not exact).

The patch implements this as
 dccp_loss_count(S_A, S_B, N_B) := max(S_B - S_A - 1 - N_B, 0)

Signed-off-by: Ivo Calado <ivocalado@embedded.ufcg.edu.br>
Signed-off-by: Erivaldo Xavier <desadoc@gmail.com>
Signed-off-by: Leandro Sales <leandroal@gmail.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/dccp.h |   21 +++++++++++++++------
 1 files changed, 15 insertions(+), 6 deletions(-)

--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -153,18 +153,27 @@ static inline u64 max48(const u64 seq1, const u64 seq2)
 }
 
 /**
- * dccp_loss_free  -  Evaluates condition for data loss from RFC 4340, 7.7.1
- * @s1:	 start sequence number
- * @s2:  end sequence number
+ * dccp_loss_count - Approximate the number of lost data packets in a burst loss
+ * @s1:  last known sequence number before the loss ('hole')
+ * @s2:  first sequence number seen after the 'hole'
  * @ndp: NDP count on packet with sequence number @s2
- * Returns true if the sequence range s1...s2 has no data loss.
  */
-static inline bool dccp_loss_free(const u64 s1, const u64 s2, const u64 ndp)
+static inline u64 dccp_loss_count(const u64 s1, const u64 s2, const u64 ndp)
 {
 	s64 delta = dccp_delta_seqno(s1, s2);
 
 	WARN_ON(delta < 0);
-	return (u64)delta <= ndp + 1;
+	delta -= ndp + 1;
+
+	return delta > 0 ? delta : 0;
+}
+
+/**
+ * dccp_loss_free - Evaluate condition for data loss from RFC 4340, 7.7.1
+ */
+static inline bool dccp_loss_free(const u64 s1, const u64 s2, const u64 ndp)
+{
+	return dccp_loss_count(s1, s2, ndp) == 0;
 }
 
 enum {

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
From: Eric Dumazet @ 2010-10-12  5:16 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev, Michael Chan, Eilon Greenstein
In-Reply-To: <AANLkTi=NvnPov_=KQJz2uXt1R-ep4oahDF0J=3HwhGzG@mail.gmail.com>

Le lundi 11 octobre 2010 à 22:03 -0700, Tom Herbert a écrit :
> On Mon, Oct 11, 2010 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> >> netdev_alloc_skb() is a very wrong interface, really.
> >>
> >> We should remove/deprecate it.
> >>
> >> For multi queue devices, it makes more sense to allocate skb on local
> >> node of the cpu handling RX interrupts. This allow each cpu to
> >> manipulate its own slub/slab queues/structures without doing expensive
> >> cross-node business.
> >>
> >> For non multi queue devices, IRQ affinity should be set so that a cpu
> >> close to the device services interrupts. Even if not set, using
> >> dev_alloc_skb() is faster.
> >>
> >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Or maybe revert :
> >
> > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Wed Dec 6 20:32:36 2006 -0800
> >
> 
> I second this revert.  Node aware allocation by device's node makes
> little sense on a multi-queue device and leads to mediocre
> performance.

Yes, I said this several time in the past, I believe time has come to
get rid of it.

I posted a patch some minutes ago, so you can review it and ack it ;)

Thanks !



^ permalink raw reply

* [PATCH 1/6] dccp: fix the adjustments to AWL and SWL
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1286860551-6809-1-git-send-email-gerrit@erg.abdn.ac.uk>

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: currently the initial adjustments to AWL/SWL are only performed
once at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore it is better to perform that safety check each time SWL/AWL are
updated, as implemented by the patch.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection;
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/dccp.h      |   20 ++++++++++++++++++++
 net/dccp/input.c     |   18 ++++++------------
 net/dccp/minisocks.c |   30 +++++++++---------------------
 3 files changed, 35 insertions(+), 33 deletions(-)

--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -414,6 +414,23 @@ static inline void dccp_update_gsr(struct sock *sk, u64 seq)
 	dp->dccps_gsr = seq;
 	/* Sequence validity window depends on remote Sequence Window (7.5.1) */
 	dp->dccps_swl = SUB48(ADD48(dp->dccps_gsr, 1), dp->dccps_r_seq_win / 4);
+	/*
+	 * Adjust SWL so that it is not below ISR. In contrast to RFC 4340,
+	 * 7.5.1 we perform this check beyond the initial handshake: W/W' are
+	 * always > 32, so for the first W/W' packets in the lifetime of a
+	 * connection we always have to adjust SWL.
+	 * A second reason why we are doing this is that the window depends on
+	 * the feature-remote value of Sequence Window: nothing stops the peer
+	 * from updating this value while we are busy adjusting SWL for the
+	 * first W packets (we would have to count from scratch again then).
+	 * Therefore it is safer to always make sure that the Sequence Window
+	 * is not artificially extended by a peer who grows SWL downwards by
+	 * continually updating the feature-remote Sequence-Window.
+	 * If sequence numbers wrap it is bad luck. But that will take a while
+	 * (48 bit), and this measure prevents Sequence-number attacks.
+	 */
+	if (before48(dp->dccps_swl, dp->dccps_isr))
+		dp->dccps_swl = dp->dccps_isr;
 	dp->dccps_swh = ADD48(dp->dccps_gsr, (3 * dp->dccps_r_seq_win) / 4);
 }
 
@@ -424,6 +441,9 @@ static inline void dccp_update_gss(struct sock *sk, u64 seq)
 	dp->dccps_gss = seq;
 	/* Ack validity window depends on local Sequence Window value (7.5.1) */
 	dp->dccps_awl = SUB48(ADD48(dp->dccps_gss, 1), dp->dccps_l_seq_win);
+	/* Adjust AWL so that it is not below ISS - see comment above for SWL */
+	if (before48(dp->dccps_awl, dp->dccps_iss))
+		dp->dccps_awl = dp->dccps_iss;
 	dp->dccps_awh = dp->dccps_gss;
 }
 
--- a/net/dccp/input.c
+++ b/net/dccp/input.c
@@ -441,20 +441,14 @@ static int dccp_rcv_request_sent_state_process(struct sock *sk,
 		kfree_skb(sk->sk_send_head);
 		sk->sk_send_head = NULL;
 
-		dp->dccps_isr = DCCP_SKB_CB(skb)->dccpd_seq;
-		dccp_update_gsr(sk, dp->dccps_isr);
 		/*
-		 * SWL and AWL are initially adjusted so that they are not less than
-		 * the initial Sequence Numbers received and sent, respectively:
-		 *	SWL := max(GSR + 1 - floor(W/4), ISR),
-		 *	AWL := max(GSS - W' + 1, ISS).
-		 * These adjustments MUST be applied only at the beginning of the
-		 * connection.
-		 *
-		 * AWL was adjusted in dccp_v4_connect -acme
+		 * Set ISR, GSR from packet. ISS was set in dccp_v{4,6}_connect
+		 * and GSS in dccp_transmit_skb(). Setting AWL/AWH and SWL/SWH
+		 * is done as part of activating the feature values below, since
+		 * these settings depend on the local/remote Sequence Window
+		 * features, which were undefined or not confirmed until now.
 		 */
-		dccp_set_seqno(&dp->dccps_swl,
-			       max48(dp->dccps_swl, dp->dccps_isr));
+		dp->dccps_gsr = dp->dccps_isr = DCCP_SKB_CB(skb)->dccpd_seq;
 
 		dccp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -121,30 +121,18 @@ struct sock *dccp_create_openreq_child(struct sock *sk,
 		 *
 		 *    Choose S.ISS (initial seqno) or set from Init Cookies
 		 *    Initialize S.GAR := S.ISS
-		 *    Set S.ISR, S.GSR, S.SWL, S.SWH from packet or Init Cookies
-		 */
-		newdp->dccps_gar = newdp->dccps_iss = dreq->dreq_iss;
-		dccp_update_gss(newsk, dreq->dreq_iss);
-
-		newdp->dccps_isr = dreq->dreq_isr;
-		dccp_update_gsr(newsk, dreq->dreq_isr);
-
-		/*
-		 * SWL and AWL are initially adjusted so that they are not less than
-		 * the initial Sequence Numbers received and sent, respectively:
-		 *	SWL := max(GSR + 1 - floor(W/4), ISR),
-		 *	AWL := max(GSS - W' + 1, ISS).
-		 * These adjustments MUST be applied only at the beginning of the
-		 * connection.
+		 *    Set S.ISR, S.GSR from packet (or Init Cookies)
+		 *
+		 *    Setting AWL/AWH and SWL/SWH happens as part of the feature
+		 *    activation below, as these windows all depend on the local
+		 *    and remote Sequence Window feature values (7.5.2).
 		 */
-		dccp_set_seqno(&newdp->dccps_swl,
-			       max48(newdp->dccps_swl, newdp->dccps_isr));
-		dccp_set_seqno(&newdp->dccps_awl,
-			       max48(newdp->dccps_awl, newdp->dccps_iss));
+		newdp->dccps_gss = newdp->dccps_iss = dreq->dreq_iss;
+		newdp->dccps_gar = newdp->dccps_iss;
+		newdp->dccps_gsr = newdp->dccps_isr = dreq->dreq_isr;
 
 		/*
-		 * Activate features after initialising the sequence numbers,
-		 * since CCID initialisation may depend on GSS, ISR, ISS etc.
+		 * Activate features: initialise CCIDs, sequence windows etc.
 		 */
 		if (dccp_feat_activate_values(newsk, &dreq->dreq_featneg)) {
 			/* It is still raw copy of parent, so invalidate

^ permalink raw reply

* [PATCH 3/6] dccp: remove unused argument in CCID tx function
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1286860551-6809-3-git-send-email-gerrit@erg.abdn.ac.uk>

This removes the argument `more' from ccid_hc_tx_packet_sent, since it was
nowhere used in the entire code.

(Btw, this argument was not even used in the original KAME code where the
 function initially came from; compare the variable moreToSend in the
 freebsd61-dccp-kame-28.08.2006.patch kept by Emmanuel Lochin.)

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/ccid.h        |    6 +++---
 net/dccp/ccids/ccid2.c |    2 +-
 net/dccp/ccids/ccid3.c |    3 +--
 net/dccp/output.c      |    2 +-
 4 files changed, 6 insertions(+), 7 deletions(-)

--- a/net/dccp/ccid.h
+++ b/net/dccp/ccid.h
@@ -73,7 +73,7 @@ struct ccid_operations {
 	int		(*ccid_hc_tx_send_packet)(struct sock *sk,
 						  struct sk_buff *skb);
 	void		(*ccid_hc_tx_packet_sent)(struct sock *sk,
-						  int more, unsigned int len);
+						  unsigned int len);
 	void		(*ccid_hc_rx_get_info)(struct sock *sk,
 					       struct tcp_info *info);
 	void		(*ccid_hc_tx_get_info)(struct sock *sk,
@@ -144,10 +144,10 @@ static inline int ccid_hc_tx_send_packet(struct ccid *ccid, struct sock *sk,
 }
 
 static inline void ccid_hc_tx_packet_sent(struct ccid *ccid, struct sock *sk,
-					  int more, unsigned int len)
+					  unsigned int len)
 {
 	if (ccid->ccid_ops->ccid_hc_tx_packet_sent != NULL)
-		ccid->ccid_ops->ccid_hc_tx_packet_sent(sk, more, len);
+		ccid->ccid_ops->ccid_hc_tx_packet_sent(sk, len);
 }
 
 static inline void ccid_hc_rx_packet_recv(struct ccid *ccid, struct sock *sk,
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -151,7 +151,7 @@ out:
 	sock_put(sk);
 }
 
-static void ccid2_hc_tx_packet_sent(struct sock *sk, int more, unsigned int len)
+static void ccid2_hc_tx_packet_sent(struct sock *sk, unsigned int len)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -351,8 +351,7 @@ static int ccid3_hc_tx_send_packet(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-static void ccid3_hc_tx_packet_sent(struct sock *sk, int more,
-				    unsigned int len)
+static void ccid3_hc_tx_packet_sent(struct sock *sk, unsigned int len)
 {
 	struct ccid3_hc_tx_sock *hc = ccid3_hc_tx_sk(sk);
 
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -304,7 +304,7 @@ void dccp_write_xmit(struct sock *sk, int block)
 				dcb->dccpd_type = DCCP_PKT_DATA;
 
 			err = dccp_transmit_skb(sk, skb);
-			ccid_hc_tx_packet_sent(dp->dccps_hc_tx_ccid, sk, 0, len);
+			ccid_hc_tx_packet_sent(dp->dccps_hc_tx_ccid, sk, len);
 			if (err)
 				DCCP_BUG("err=%d after ccid_hc_tx_packet_sent",
 					 err);

^ permalink raw reply

* [PATCH 2/6] dccp: merge now-reduced connect_init() function
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1286860551-6809-2-git-send-email-gerrit@erg.abdn.ac.uk>

After moving the assignment of GAR/ISS from dccp_connect_init() to
dccp_transmit_skb(), the former function becomes very small, so that
a merger with dccp_connect() suggests itself.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/output.c |   18 +++++-------------
 1 files changed, 5 insertions(+), 13 deletions(-)

--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -474,8 +474,9 @@ int dccp_send_reset(struct sock *sk, enum dccp_reset_codes code)
 /*
  * Do all connect socket setups that can be done AF independent.
  */
-static inline void dccp_connect_init(struct sock *sk)
+int dccp_connect(struct sock *sk)
 {
+	struct sk_buff *skb;
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -485,22 +486,12 @@ static inline void dccp_connect_init(struct sock *sk)
 
 	dccp_sync_mss(sk, dst_mtu(dst));
 
-	/* Initialise GAR as per 8.5; AWL/AWH are set in dccp_transmit_skb() */
-	dp->dccps_gar = dp->dccps_iss;
-
-	icsk->icsk_retransmits = 0;
-}
-
-int dccp_connect(struct sock *sk)
-{
-	struct sk_buff *skb;
-	struct inet_connection_sock *icsk = inet_csk(sk);
-
 	/* do not connect if feature negotiation setup fails */
 	if (dccp_feat_finalise_settings(dccp_sk(sk)))
 		return -EPROTO;
 
-	dccp_connect_init(sk);
+	/* Initialise GAR as per 8.5; AWL/AWH are set in dccp_transmit_skb() */
+	dp->dccps_gar = dp->dccps_iss;
 
 	skb = alloc_skb(sk->sk_prot->max_header, sk->sk_allocation);
 	if (unlikely(skb == NULL))
@@ -516,6 +507,7 @@ int dccp_connect(struct sock *sk)
 	DCCP_INC_STATS(DCCP_MIB_ACTIVEOPENS);
 
 	/* Timer for repeating the REQUEST until an answer. */
+	icsk->icsk_retransmits = 0;
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 				  icsk->icsk_rto, DCCP_RTO_MAX);
 	return 0;

^ permalink raw reply

* [PATCH 6/6] dccp: cosmetics - warning format
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1286860551-6809-6-git-send-email-gerrit@erg.abdn.ac.uk>

This  omits the redundant "DCCP:" in warning messages, since DCCP_WARN() already
echoes the function name, avoiding messages like

   kernel: [10988.766503] dccp_close: DCCP: ABORT -- 209 bytes unread

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/input.c |    2 +-
 net/dccp/proto.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/net/dccp/input.c
+++ b/net/dccp/input.c
@@ -259,7 +259,7 @@ static int dccp_check_seqno(struct sock *sk, struct sk_buff *skb)
 				      sysctl_dccp_sync_ratelimit)))
 			return 0;
 
-		DCCP_WARN("DCCP: Step 6 failed for %s packet, "
+		DCCP_WARN("Step 6 failed for %s packet, "
 			  "(LSWL(%llu) <= P.seqno(%llu) <= S.SWH(%llu)) and "
 			  "(P.ackno %s or LAWL(%llu) <= P.ackno(%llu) <= S.AWH(%llu), "
 			  "sending SYNC...\n",  dccp_packet_name(dh->dccph_type),
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -944,7 +944,7 @@ void dccp_close(struct sock *sk, long timeout)
 
 	if (data_was_unread) {
 		/* Unread data was tossed, send an appropriate Reset Code */
-		DCCP_WARN("DCCP: ABORT -- %u bytes unread\n", data_was_unread);
+		DCCP_WARN("ABORT with %u bytes unread\n", data_was_unread);
 		dccp_send_reset(sk, DCCP_RESET_CODE_ABORTED);
 		dccp_set_state(sk, DCCP_CLOSED);
 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {

^ permalink raw reply

* [PATCH 5/6] dccp: schedule an Ack when receiving timestamps
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1286860551-6809-5-git-send-email-gerrit@erg.abdn.ac.uk>

This schedules an Ack when receiving a timestamp, exploiting the
existing inet_csk_schedule_ack() function, saving one case in the
`dccp_ack_pending()' function.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/dccp.h    |    3 +--
 net/dccp/options.c |    2 ++
 2 files changed, 3 insertions(+), 2 deletions(-)

--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -459,8 +459,7 @@ static inline void dccp_update_gss(struct sock *sk, u64 seq)
 static inline int dccp_ack_pending(const struct sock *sk)
 {
 	const struct dccp_sock *dp = dccp_sk(sk);
-	return dp->dccps_timestamp_echo != 0 ||
-	       (dp->dccps_hc_rx_ackvec != NULL &&
+	return (dp->dccps_hc_rx_ackvec != NULL &&
 		dccp_ackvec_pending(dp->dccps_hc_rx_ackvec)) ||
 	       inet_csk_ack_scheduled(sk);
 }
--- a/net/dccp/options.c
+++ b/net/dccp/options.c
@@ -163,6 +163,8 @@ int dccp_parse_options(struct sock *sk, struct dccp_request_sock *dreq,
 				      dccp_role(sk), ntohl(opt_val),
 				      (unsigned long long)
 				      DCCP_SKB_CB(skb)->dccpd_ack_seq);
+			/* schedule an Ack in case this sender is quiescent */
+			inet_csk_schedule_ack(sk);
 			break;
 		case DCCPO_TIMESTAMP_ECHO:
 			if (len != 4 && len != 6 && len != 8)

^ permalink raw reply

* net-next-2.6 [PATCH 0/6] dccp: miscellaneous fixes and helper functions
From: Gerrit Renker @ 2010-10-12  5:15 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev
In-Reply-To: <dccp_misc_fixes_and_helper_functions___a_set_of_six_patches_sent_today>

Dave,

please can you consider the following set of DCCP patches: the first is a bug
fix, the rest perform refactoring and provide helper functions.

 Patch #1: fixes two problems in the DCCP  (AWL/SWL) window-size adjustment.
 Patch #2: refactors the connect_init() function by combining related code.
 Patch #3: removes an never-user argument from the CCID tx function.
 Patch #4: provides a function to generalize the data-loss condition 
           (patch originally came from the ccid-4 subtree).
 Patch #5: schedules an Ack as carrier for a pending timestamp echo.
 Patch #6: tidies up the format of DCCP per-connection warnings.

The set has also been placed into a fresh (today's) copy of net-next-2.6, on

    git://eden-feed.erg.abdn.ac.uk/net-next-2.6        [subtree 'dccp']

All patches have been in the test tree for a long while and are bisectable.

^ permalink raw reply

* [PATCH net-next] net:  allocate skbs on local node
From: Eric Dumazet @ 2010-10-12  5:05 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Michael Chan, Eilon Greenstein, Andrew Morton,
	Christoph Hellwig, Christoph Lameter
In-Reply-To: <1286839363.30423.130.camel@edumazet-laptop>

Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit :
> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> > 
> > For multi queue devices, it makes more sense to allocate skb on local
> > node of the cpu handling RX interrupts. This allow each cpu to
> > manipulate its own slub/slab queues/structures without doing expensive
> > cross-node business.
> > 
> > For non multi queue devices, IRQ affinity should be set so that a cpu
> > close to the device services interrupts. Even if not set, using
> > dev_alloc_skb() is faster.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Or maybe revert :
> 
> commit b30973f877fea1a3fb84e05599890fcc082a88e5
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Wed Dec 6 20:32:36 2006 -0800
> 
>     [PATCH] node-aware skb allocation
>     
>     Node-aware allocation of skbs for the receive path.
>     
>     Details:
>     
>       - __alloc_skb gets a new node argument and cals the node-aware
>         slab functions with it.
>       - netdev_alloc_skb passed the node number it gets from dev_to_node
>         to it, everyone else passes -1 (any node)
>     
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Cc: Christoph Lameter <clameter@engr.sgi.com>
>     Cc: "David S. Miller" <davem@davemloft.net>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
> 
> 
> Apparently, only Christoph and Andrew signed it.
> 
> 

[PATCH net-next] net: allocate skbs on local node

commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   20 ++++++++++++++++----
 net/core/skbuff.c      |   13 +------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0b53c43..05a358f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size,
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
-	return __alloc_skb(size, priority, 0, -1);
+	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
 }
 
 extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return skb;
 }
 
-extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+/**
+ *	__netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *	@gfp_mask: alloc_pages_node mask
+ *
+ * 	Allocate a new page. dev currently unused.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
+}
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
  *	@dev: network device to receive on
  *
- * 	Allocate a new page node local to the specified device.
+ * 	Allocate a new page. dev currently unused.
  *
  * 	%NULL is returned if there is no free memory.
  */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 752c197..4e8b82e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
-struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
-{
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
-	struct page *page;
-
-	page = alloc_pages_node(node, gfp_mask, 0);
-	return page;
-}
-EXPORT_SYMBOL(__netdev_alloc_page);
-
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {



^ permalink raw reply related

* Re: [PATCH net-next] bnx2x: dont use netdev_alloc_skb()
From: Tom Herbert @ 2010-10-12  5:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Michael Chan, Eilon Greenstein
In-Reply-To: <1286839363.30423.130.camel@edumazet-laptop>

On Mon, Oct 11, 2010 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
>> netdev_alloc_skb() is a very wrong interface, really.
>>
>> We should remove/deprecate it.
>>
>> For multi queue devices, it makes more sense to allocate skb on local
>> node of the cpu handling RX interrupts. This allow each cpu to
>> manipulate its own slub/slab queues/structures without doing expensive
>> cross-node business.
>>
>> For non multi queue devices, IRQ affinity should be set so that a cpu
>> close to the device services interrupts. Even if not set, using
>> dev_alloc_skb() is faster.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Or maybe revert :
>
> commit b30973f877fea1a3fb84e05599890fcc082a88e5
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Wed Dec 6 20:32:36 2006 -0800
>

I second this revert.  Node aware allocation by device's node makes
little sense on a multi-queue device and leads to mediocre
performance.

>    [PATCH] node-aware skb allocation
>
>    Node-aware allocation of skbs for the receive path.
>
>    Details:
>
>      - __alloc_skb gets a new node argument and cals the node-aware
>        slab functions with it.
>      - netdev_alloc_skb passed the node number it gets from dev_to_node
>        to it, everyone else passes -1 (any node)
>
>    Signed-off-by: Christoph Hellwig <hch@lst.de>
>    Cc: Christoph Lameter <clameter@engr.sgi.com>
>    Cc: "David S. Miller" <davem@davemloft.net>
>    Signed-off-by: Andrew Morton <akpm@osdl.org>
>
>
> Apparently, only Christoph and Andrew signed it.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox