Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
From: Maxime Bizon @ 2012-10-05 14:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins
In-Reply-To: <1349442173.21172.66.camel@edumazet-glaptop>


On Fri, 2012-10-05 at 15:02 +0200, Eric Dumazet wrote:

> On Fri, 2012-10-05 at 14:51 +0200, Maxime Bizon wrote:
> 
> > > New convention would be : pass number of needed bytes after current
> > > tail, not after current end.
> > 
> > Fully agree on this
> > 
> 
> Here is the proposal :

Looks fine

What is your plan for the actual pskb_expand_head() code now that you
will have absolute values for headroom & tailroom ?

Because there will still be callers that have no clue of required
tailroom (nor further headroom requirement), like skb_cow() in
vlan_reorder_header().

-- 
Maxime

^ permalink raw reply

* [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-05 14:52 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <CAEP_g=_nSb-ite51PM-E8SY53yOPiZs8N3gDrYNc0L4OU2Ht=A@mail.gmail.com>

Current GRO cell is somewhat limited :

- It uses a single list (napi->gro_list) of pending skbs

- This list has a limit of 8 skbs (MAX_GRO_SKBS)

- Workloads with lot of concurrent flows have small GRO hit rate but
  pay high overhead (in inet_gro_receive())

- Increasing MAX_GRO_SKBS is not an option, because GRO
  overhead becomes too high.

- Packets can stay a long time held in GRO cell (there is
  no flush if napi never completes on a stressed cpu)

  Some elephant flows can stall interactive ones (if we receive
  flood of non TCP frames, we dont flush tcp packets waiting in
gro_list)

What we could do :

1) Use a hash to avoid expensive gro_list management and allow
   much more concurrent flows.

Use skb_get_rxhash(skb) to compute rxhash

If l4_rxhash not set -> not a GRO candidate.

If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
candidates.

(tcp stack could eventually use rxhash instead of its custom hash
computation ...)

2) Use a LRU list to eventually be able to 'flush' too old packets,
   even if the napi never completes. Each time we process a new packet,
   being a GRO candidate or not, we increment a napi->sequence, and we
   flush the oldest packet in gro_lru_list if its own sequence is too
   old.

  That would give a latency guarantee.

^ permalink raw reply

* [PATCH 00/16] ARM: mostly harmless gcc warnings
From: Arnd Bergmann @ 2012-10-05 14:55 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nicolas Pitre, Dave Martin, Jan Kara, Catalin Marinas,
	linux-fbdev, Dominik Brodowski, Grant Likely, Pavel Machek,
	netdev, Greg Ungerer, Christoph Lameter, Kukjin Kim,
	Mike Turquette, Wan ZongShun, linux-scsi,
	Florian Tobias Schandinat, Jochen Friedrich, linux-samsung-soc,
	Julian Anastasov, arm, Alan Stern, Mike Rapoport, Russell King,
	coreteam, Arnd Bergmann, linux-p

In my attempt to weed out all warnings in ARM defconfigs, this
is the latest installment of patches. Ideally these could all
go through the respective subsystem maintainer trees, but
for any patches I don't hear anything back on, I will just
merge them through the arm-soc tree.

	Arnd

Arnd Bergmann (16):
  ARM: warnings in arch/arm/include/asm/uaccess.h
  ARM: binfmt_flat: unused variable 'persistent'
  SCSI: ARM: ncr5380/oak uses no interrupts
  SCSI: ARM: make fas216_dumpinfo function conditional
  vfs: bogus warnings in fs/namei.c
  mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
  cgroup: fix warning when building without any subsys
  ipvs: fix ip_vs_set_timeout debug messages
  USB: EHCI: mark ehci_orion_conf_mbus_windows __devinit
  clk: don't mark clkdev_add_table as init
  pcmcia: sharpsl: don't discard sharpsl_pcmcia_ops
  video: mark nuc900fb_map_video_memory as __devinit
  ARM: be really quiet when building with 'make -s'
  ARM: pxa: armcore: fix PCI PIO warnings
  spi/s3c64xx: use correct dma_transfer_direction type
  ARM: pass -marm to gcc by default for both C and assembler

 arch/arm/Makefile               |   13 +++++++------
 arch/arm/boot/Makefile          |   10 +++++-----
 arch/arm/common/it8152.c        |   12 +++++++++---
 arch/arm/include/asm/flat.h     |    2 +-
 arch/arm/include/asm/uaccess.h  |    4 ++--
 arch/arm/tools/Makefile         |    2 +-
 drivers/clk/clkdev.c            |    2 +-
 drivers/pcmcia/pxa2xx_sharpsl.c |    2 +-
 drivers/scsi/arm/fas216.c       |    2 +-
 drivers/scsi/arm/oak.c          |    1 +
 drivers/spi/spi-s3c64xx.c       |    2 +-
 drivers/usb/host/ehci-orion.c   |    2 +-
 drivers/video/nuc900fb.c        |    2 +-
 fs/namei.c                      |    6 +++---
 include/linux/cgroup.h          |    2 +-
 mm/slob.c                       |    6 +++---
 net/netfilter/ipvs/ip_vs_ctl.c  |   10 ++++++----
 17 files changed, 45 insertions(+), 35 deletions(-)

Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ben Blum <bblum@andrew.cmu.edu>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Bryan Wu <bryan.wu@canonical.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Martin <dave.martin@linaro.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Eric Miao <eric.y.miao@gmail.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Igor Grinberg <grinberg@compulab.co.il>
Cc: Jan Kara <jack@suse.cz>
Cc: Jochen Friedrich <jochen@scram.de>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Mike Rapoport <mike@compulab.co.il>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Pavel Machek <pavel@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Simon Horman <horms@verge.net.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wan ZongShun <mcuos.com@gmail.com>
Cc: coreteam@netfilter.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-fbdev@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-pcmcia@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: netfilter@vger.kernel.org
Cc: spi-devel-general@lists.sourceforge.net

-- 
1.7.10

^ permalink raw reply

* [PATCH 08/16] ipvs: fix ip_vs_set_timeout debug messages
From: Arnd Bergmann @ 2012-10-05 14:55 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, arm, Arnd Bergmann, David S. Miller, netdev,
	Simon Horman, Julian Anastasov, netfilter-devel, netfilter,
	coreteam
In-Reply-To: <1349448930-23976-1-git-send-email-arnd@arndb.de>

The ip_vs_set_timeout function sets timeouts for TCP and UDP, which
can be enabled independently at compile time. The debug message
always prints both timeouts that are passed into the function,
but if one is disabled, the message will show uninitialized data.

This splits the debug message into two separte IP_VS_DBG statements
that are in the same #ifdef section to ensure we only print the
text about what is actually going on.

Without this patch, building ARM ixp4xx_defconfig results in:

net/netfilter/ipvs/ip_vs_ctl.c: In function 'ip_vs_genl_set_cmd':
net/netfilter/ipvs/ip_vs_ctl.c:2238:47: warning: 't.udp_timeout' may be used uninitialized in this function [-Wuninitialized]
net/netfilter/ipvs/ip_vs_ctl.c:3322:28: note: 't.udp_timeout' was declared here
net/netfilter/ipvs/ip_vs_ctl.c:2238:47: warning: 't.tcp_fin_timeout' may be used uninitialized in this function [-Wuninitialized]
net/netfilter/ipvs/ip_vs_ctl.c:3322:28: note: 't.tcp_fin_timeout' was declared here
net/netfilter/ipvs/ip_vs_ctl.c:2238:47: warning: 't.tcp_timeout' may be used uninitialized in this function [-Wuninitialized]
net/netfilter/ipvs/ip_vs_ctl.c:3322:28: note: 't.tcp_timeout' was declared here

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: netfilter-devel@vger.kernel.org
Cc: netfilter@vger.kernel.org
Cc: coreteam@netfilter.org
---
 net/netfilter/ipvs/ip_vs_ctl.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index f51013c..f3a66c3 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2237,12 +2237,11 @@ static int ip_vs_set_timeout(struct net *net, struct ip_vs_timeout_user *u)
 	struct ip_vs_proto_data *pd;
 #endif
 
-	IP_VS_DBG(2, "Setting timeout tcp:%d tcpfin:%d udp:%d\n",
+#ifdef CONFIG_IP_VS_PROTO_TCP
+	IP_VS_DBG(2, "Setting timeout tcp:%d tcpfin:%d\n",
 		  u->tcp_timeout,
-		  u->tcp_fin_timeout,
-		  u->udp_timeout);
+		  u->tcp_fin_timeout);
 
-#ifdef CONFIG_IP_VS_PROTO_TCP
 	if (u->tcp_timeout) {
 		pd = ip_vs_proto_data_get(net, IPPROTO_TCP);
 		pd->timeout_table[IP_VS_TCP_S_ESTABLISHED]
@@ -2257,6 +2256,9 @@ static int ip_vs_set_timeout(struct net *net, struct ip_vs_timeout_user *u)
 #endif
 
 #ifdef CONFIG_IP_VS_PROTO_UDP
+	IP_VS_DBG(2, "Setting timeout udp:%d\n",
+		  u->udp_timeout);
+
 	if (u->udp_timeout) {
 		pd = ip_vs_proto_data_get(net, IPPROTO_UDP);
 		pd->timeout_table[IP_VS_UDP_S_NORMAL]
-- 
1.7.10


^ permalink raw reply related

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
From: Eric Dumazet @ 2012-10-05 15:04 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins
In-Reply-To: <1349448659.28867.27.camel@sakura.staff.proxad.net>

On Fri, 2012-10-05 at 16:50 +0200, Maxime Bizon wrote:
> On Fri, 2012-10-05 at 15:02 +0200, Eric Dumazet wrote:
> 
> > On Fri, 2012-10-05 at 14:51 +0200, Maxime Bizon wrote:
> > 
> > > > New convention would be : pass number of needed bytes after current
> > > > tail, not after current end.
> > > 
> > > Fully agree on this
> > > 
> > 
> > Here is the proposal :
> 
> Looks fine
> 
> What is your plan for the actual pskb_expand_head() code now that you
> will have absolute values for headroom & tailroom ?
> 

They stay relative values.

For example, netlink_trim() really wants to shrink the skb head.


> Because there will still be callers that have no clue of required
> tailroom (nor further headroom requirement), like skb_cow() in
> vlan_reorder_header().
> 

What I plan is to not shrink size, unless specifically asked.

Its 3.8 material anyway, so a stable fix is needed on skb_recycle() and
NET_SKB_PAD minimal value.

^ permalink raw reply

* Re: [PATCH] net, bluetooth: don't attempt to free a channel that wasn't created
From: Sasha Levin @ 2012-10-05 15:09 UTC (permalink / raw)
  To: Andrei Emeltchenko, Sasha Levin, marcel, gustavo, johan.hedberg,
	davem, davej, linux-kernel, linux-bluetooth, netdev
In-Reply-To: <20121005102159.GF12229@aemeltch-MOBL1>

On 10/05/2012 06:22 AM, Andrei Emeltchenko wrote:
> Hi Sasha,
> 
> On Thu, Oct 04, 2012 at 07:59:57PM -0400, Sasha Levin wrote:
>> We may currently attempt to free a channel which wasn't created due to
>> an error in the initialization path, this would cause a NULL ptr deref.
> 
> Please put oops dump here.

[   12.919073] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[   12.919131] IP: [<ffffffff836645c4>] l2cap_chan_put+0x34/0x50
[   12.919135] PGD 0
[   12.919138] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   12.919193] Dumping ftrace buffer:
[   12.919242]    (ftrace buffer empty)
[   12.919314] Modules linked in:
[   12.919318] CPU 1
[   12.919319] Pid: 6210, comm: krfcommd Tainted: G        W    3.6.0-next-20121004-sasha-00005-gb010653-dirty #30
[   12.919374] RIP: 0010:[<ffffffff836645c4>]  [<ffffffff836645c4>] l2cap_chan_put+0x34/0x50
[   12.919377] RSP: 0000:ffff880066933c38  EFLAGS: 00010246
[   12.919378] RAX: ffffffff8366c780 RBX: 0000000000000000 RCX: 6666666666666667
[   12.919379] RDX: 0000000000000fa0 RSI: ffffffff84d3f79e RDI: 0000000000000010
[   12.919381] RBP: ffff880066933c48 R08: ffffffff859989f8 R09: 0000000000000001
[   12.919382] R10: 0000000000000000 R11: 7fffffffffffffff R12: 0000000000000000
[   12.919383] R13: ffff88009b00a200 R14: ffff88009b00a200 R15: 0000000000000001
[   12.919385] FS:  0000000000000000(0000) GS:ffff880033600000(0000) knlGS:0000000000000000
[   12.919437] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.919440] CR2: 0000000000000010 CR3: 0000000005026000 CR4: 00000000000406e0
[   12.919446] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   12.919451] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   12.919504] Process krfcommd (pid: 6210, threadinfo ffff880066932000, task ffff880065c4b000)
[   12.919506] Stack:
[   12.919510]  ffff88009b00a200 ffff880032084000 ffff880066933c68 ffffffff8366c7bc
[   12.919513]  7fffffffffffffff ffff880032084000 ffff880066933c98 ffffffff833ae0ae
[   12.919516]  ffff880066933ca8 0000000000000000 0000000000000000 ffff88009b00a200
[   12.919517] Call Trace:
[   12.919522]  [<ffffffff8366c7bc>] l2cap_sock_destruct+0x3c/0x80
[   12.919527]  [<ffffffff833ae0ae>] __sk_free+0x1e/0x1f0
[   12.919530]  [<ffffffff833ae2f7>] sk_free+0x17/0x20
[   12.919585]  [<ffffffff8366ca4e>] l2cap_sock_alloc.constprop.5+0x9e/0xd0
[   12.919591]  [<ffffffff8366cb9e>] l2cap_sock_create+0x7e/0x100
[   12.919652]  [<ffffffff83a4f32a>] ? _raw_read_lock+0x6a/0x80
[   12.919658]  [<ffffffff836402c4>] ? bt_sock_create+0x74/0x110
[   12.919660]  [<ffffffff83640308>] bt_sock_create+0xb8/0x110
[   12.919664]  [<ffffffff833aa232>] __sock_create+0x282/0x3b0
[   12.919720]  [<ffffffff833aa0b0>] ? __sock_create+0x100/0x3b0
[   12.919725]  [<ffffffff836785b0>] ? rfcomm_process_sessions+0x17e0/0x17e0
[   12.919779]  [<ffffffff833aa37f>] sock_create_kern+0x1f/0x30
[   12.919784]  [<ffffffff83675714>] rfcomm_l2sock_create+0x44/0x70
[   12.919787]  [<ffffffff836785b0>] ? rfcomm_process_sessions+0x17e0/0x17e0
[   12.919790]  [<ffffffff836785fe>] rfcomm_run+0x4e/0x1f0
[   12.919846]  [<ffffffff836785b0>] ? rfcomm_process_sessions+0x17e0/0x17e0
[   12.919852]  [<ffffffff81138ee3>] kthread+0xe3/0xf0
[   12.919908]  [<ffffffff8117b12e>] ? put_lock_stats.isra.14+0xe/0x40
[   12.919914]  [<ffffffff81138e00>] ? flush_kthread_work+0x1f0/0x1f0
[   12.919968]  [<ffffffff83a5077c>] ret_from_fork+0x7c/0x90
[   12.919973]  [<ffffffff81138e00>] ? flush_kthread_work+0x1f0/0x1f0
[   12.920161] Code: 83 ec 08 f6 05 ff 58 44 02 04 74 1b 8b 4f 10 48 89 fa 48 c7 c6 d9 d7 d4 84 48 c7 c7 80 9e aa 85 31 c0 e8 80
ac 3a fe 48 8d 7b 10 <f0> 83 6b 10 01 0f 94 c0 84 c0 74 05 e8 8b e0 ff ff 48 83 c4 08
[   12.920165] RIP  [<ffffffff836645c4>] l2cap_chan_put+0x34/0x50
[   12.920166]  RSP <ffff880066933c38>
[   12.920167] CR2: 0000000000000010
[   12.920417] ---[ end trace 5a9114e8a158ab84 ]---

> 
>> Introduced in commit 61d6ef3e ("Bluetooth: Make better use of l2cap_chan
>> reference counting").
>>
>> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
>> ---
>>  net/bluetooth/l2cap_sock.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
>> index 083f2bf..66c295a 100644
>> --- a/net/bluetooth/l2cap_sock.c
>> +++ b/net/bluetooth/l2cap_sock.c
>> @@ -1083,7 +1083,8 @@ static void l2cap_sock_destruct(struct sock *sk)
>>  {
>>  	BT_DBG("sk %p", sk);
>>  
>> -	l2cap_chan_put(l2cap_pi(sk)->chan);
>> +	if (l2cap_pi(sk)->chan)
>> +		l2cap_chan_put(l2cap_pi(sk)->chan);
> 
> This does not look right, I suppose you want to put somewhere missing
> chan_hold

The issue is basically kzalloc() failing in l2cap_chan_create(), this would lead to sk_free()
getting called with chan being NULL, which is why I don't think that chan_hold is relevant
at this stage.


Thanks,
Sasha

^ permalink raw reply

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
From: Maxime Bizon @ 2012-10-05 15:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins
In-Reply-To: <1349449440.21172.118.camel@edumazet-glaptop>

On Fri, 2012-10-05 at 17:04 +0200, Eric Dumazet wrote:
> 
> 
> What I plan is to not shrink size, unless specifically asked.
> 
> Its 3.8 material anyway, so a stable fix is needed on skb_recycle()
> and NET_SKB_PAD minimal value.

You think removing skb_recycle() is too big a change for stable ?

Driver change is simple, as recycling is not guaranteed today you have
this:

 if (!try_recycle(skb))
    skb = alloc_skb()
> 
we just remove the try_recycle part, we are not adding any new code
path.

I'm not the right person to give you a correct NET_SKB_PAD value, I have
a lot of out of tree patches in my kernels, some custom drivers as well
that needs quite a lot of headroom, and uncommon network setup.

-- 
Maxime

^ permalink raw reply

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
From: Eric Dumazet @ 2012-10-05 15:37 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins
In-Reply-To: <1349450151.28867.33.camel@sakura.staff.proxad.net>

On Fri, 2012-10-05 at 17:15 +0200, Maxime Bizon wrote:

> You think removing skb_recycle() is too big a change for stable ?
> 
> Driver change is simple, as recycling is not guaranteed today you have
> this:
> 
>  if (!try_recycle(skb))
>     skb = alloc_skb()
> > 
> we just remove the try_recycle part, we are not adding any new code
> path.
> 
> 
> I'm not the right person to give you a correct NET_SKB_PAD value, I have
> a lot of out of tree patches in my kernels, some custom drivers as well
> that needs quite a lot of headroom, and uncommon network setup.
> 

OK, I'll send the patch to remove skb_recycle()

^ permalink raw reply

* [PATCH] net: remove skb recycling
From: Eric Dumazet @ 2012-10-05 16:23 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev
In-Reply-To: <1349451425.21172.119.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

Over time, skb recycling infrastructure got litle interest and
many bugs. Generic rx path skb allocation is now using page
fragments for efficient GRO / TCP coalescing, and recyling
a tx skb for rx path is not worth the pain.

Last identified bug is that fat skbs can be recycled
and it can endup using high order pages after few iterations.

With help from Maxime Bizon, who pointed out that commit
87151b8689d (net: allow pskb_expand_head() to get maximum tailroom)
introduced this regression for recycled skbs.

Instead of fixing this bug, lets remove skb recycling.

Drivers wanting really hot skbs should use build_skb() anyway,
to allocate/populate sk_buff right before netif_receive_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maxime Bizon <mbizon@freebox.fr>
---
 drivers/net/ethernet/calxeda/xgmac.c              |   19 ----
 drivers/net/ethernet/freescale/gianfar.c          |   27 +-----
 drivers/net/ethernet/freescale/gianfar.h          |    2 
 drivers/net/ethernet/freescale/ucc_geth.c         |   29 +------
 drivers/net/ethernet/freescale/ucc_geth.h         |    2 
 drivers/net/ethernet/marvell/mv643xx_eth.c        |   18 ----
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |    1 
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |   20 -----
 include/linux/skbuff.h                            |   24 ------
 net/core/skbuff.c                                 |   47 ------------
 10 files changed, 16 insertions(+), 173 deletions(-)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c
index 2b4b4f5..16814b3 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -375,7 +375,6 @@ struct xgmac_priv {
 	unsigned int tx_tail;
 
 	void __iomem *base;
-	struct sk_buff_head rx_recycle;
 	unsigned int dma_buf_sz;
 	dma_addr_t dma_rx_phy;
 	dma_addr_t dma_tx_phy;
@@ -672,9 +671,7 @@ static void xgmac_rx_refill(struct xgmac_priv *priv)
 		p = priv->dma_rx + entry;
 
 		if (priv->rx_skbuff[entry] == NULL) {
-			skb = __skb_dequeue(&priv->rx_recycle);
-			if (skb == NULL)
-				skb = netdev_alloc_skb(priv->dev, priv->dma_buf_sz);
+			skb = netdev_alloc_skb(priv->dev, priv->dma_buf_sz);
 			if (unlikely(skb == NULL))
 				break;
 
@@ -887,17 +884,7 @@ static void xgmac_tx_complete(struct xgmac_priv *priv)
 				       desc_get_buf_len(p), DMA_TO_DEVICE);
 		}
 
-		/*
-		 * If there's room in the queue (limit it to size)
-		 * we add this skb back into the pool,
-		 * if it's the right size.
-		 */
-		if ((skb_queue_len(&priv->rx_recycle) <
-			DMA_RX_RING_SZ) &&
-			skb_recycle_check(skb, priv->dma_buf_sz))
-			__skb_queue_head(&priv->rx_recycle, skb);
-		else
-			dev_kfree_skb(skb);
+		dev_kfree_skb(skb);
 	}
 
 	if (dma_ring_space(priv->tx_head, priv->tx_tail, DMA_TX_RING_SZ) >
@@ -1016,7 +1003,6 @@ static int xgmac_open(struct net_device *dev)
 			dev->dev_addr);
 	}
 
-	skb_queue_head_init(&priv->rx_recycle);
 	memset(&priv->xstats, 0, sizeof(struct xgmac_extra_stats));
 
 	/* Initialize the XGMAC and descriptors */
@@ -1053,7 +1039,6 @@ static int xgmac_stop(struct net_device *dev)
 		napi_disable(&priv->napi);
 
 	writel(0, priv->base + XGMAC_DMA_INTR_ENA);
-	skb_queue_purge(&priv->rx_recycle);
 
 	/* Disable the MAC core */
 	xgmac_mac_disable(priv->base);
diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index a1b52ec..1d03dcd 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -1765,7 +1765,6 @@ static void free_skb_resources(struct gfar_private *priv)
 			  sizeof(struct rxbd8) * priv->total_rx_ring_size,
 			  priv->tx_queue[0]->tx_bd_base,
 			  priv->tx_queue[0]->tx_bd_dma_base);
-	skb_queue_purge(&priv->rx_recycle);
 }
 
 void gfar_start(struct net_device *dev)
@@ -1943,8 +1942,6 @@ static int gfar_enet_open(struct net_device *dev)
 
 	enable_napi(priv);
 
-	skb_queue_head_init(&priv->rx_recycle);
-
 	/* Initialize a bunch of registers */
 	init_registers(dev);
 
@@ -2533,16 +2530,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 
 		bytes_sent += skb->len;
 
-		/* If there's room in the queue (limit it to rx_buffer_size)
-		 * we add this skb back into the pool, if it's the right size
-		 */
-		if (skb_queue_len(&priv->rx_recycle) < rx_queue->rx_ring_size &&
-		    skb_recycle_check(skb, priv->rx_buffer_size +
-				      RXBUF_ALIGNMENT)) {
-			gfar_align_skb(skb);
-			skb_queue_head(&priv->rx_recycle, skb);
-		} else
-			dev_kfree_skb_any(skb);
+		dev_kfree_skb_any(skb);
 
 		tx_queue->tx_skbuff[skb_dirtytx] = NULL;
 
@@ -2608,7 +2596,7 @@ static void gfar_new_rxbdp(struct gfar_priv_rx_q *rx_queue, struct rxbd8 *bdp,
 static struct sk_buff *gfar_alloc_skb(struct net_device *dev)
 {
 	struct gfar_private *priv = netdev_priv(dev);
-	struct sk_buff *skb = NULL;
+	struct sk_buff *skb;
 
 	skb = netdev_alloc_skb(dev, priv->rx_buffer_size + RXBUF_ALIGNMENT);
 	if (!skb)
@@ -2621,14 +2609,7 @@ static struct sk_buff *gfar_alloc_skb(struct net_device *dev)
 
 struct sk_buff *gfar_new_skb(struct net_device *dev)
 {
-	struct gfar_private *priv = netdev_priv(dev);
-	struct sk_buff *skb = NULL;
-
-	skb = skb_dequeue(&priv->rx_recycle);
-	if (!skb)
-		skb = gfar_alloc_skb(dev);
-
-	return skb;
+	return gfar_alloc_skb(dev);
 }
 
 static inline void count_errors(unsigned short status, struct net_device *dev)
@@ -2787,7 +2768,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
 			if (unlikely(!newskb))
 				newskb = skb;
 			else if (skb)
-				skb_queue_head(&priv->rx_recycle, skb);
+				dev_kfree_skb(skb);
 		} else {
 			/* Increment the number of packets */
 			rx_queue->stats.rx_packets++;
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 4141ef2..22eabc1 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -1080,8 +1080,6 @@ struct gfar_private {
 
 	u32 cur_filer_idx;
 
-	struct sk_buff_head rx_recycle;
-
 	/* RX queue filer rule set*/
 	struct ethtool_rx_list rx_list;
 	struct mutex rx_queue_access;
diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index 1642884..dfa0aaa 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -209,14 +209,12 @@ static struct list_head *dequeue(struct list_head *lh)
 static struct sk_buff *get_new_skb(struct ucc_geth_private *ugeth,
 		u8 __iomem *bd)
 {
-	struct sk_buff *skb = NULL;
+	struct sk_buff *skb;
 
-	skb = __skb_dequeue(&ugeth->rx_recycle);
+	skb = netdev_alloc_skb(ugeth->ndev,
+			       ugeth->ug_info->uf_info.max_rx_buf_length +
+			       UCC_GETH_RX_DATA_BUF_ALIGNMENT);
 	if (!skb)
-		skb = netdev_alloc_skb(ugeth->ndev,
-				      ugeth->ug_info->uf_info.max_rx_buf_length +
-				      UCC_GETH_RX_DATA_BUF_ALIGNMENT);
-	if (skb == NULL)
 		return NULL;
 
 	/* We need the data buffer to be aligned properly.  We will reserve
@@ -2020,8 +2018,6 @@ static void ucc_geth_memclean(struct ucc_geth_private *ugeth)
 		iounmap(ugeth->ug_regs);
 		ugeth->ug_regs = NULL;
 	}
-
-	skb_queue_purge(&ugeth->rx_recycle);
 }
 
 static void ucc_geth_set_multi(struct net_device *dev)
@@ -2230,8 +2226,6 @@ static int ucc_struct_init(struct ucc_geth_private *ugeth)
 		return -ENOMEM;
 	}
 
-	skb_queue_head_init(&ugeth->rx_recycle);
-
 	return 0;
 }
 
@@ -3274,12 +3268,7 @@ static int ucc_geth_rx(struct ucc_geth_private *ugeth, u8 rxQ, int rx_work_limit
 			if (netif_msg_rx_err(ugeth))
 				ugeth_err("%s, %d: ERROR!!! skb - 0x%08x",
 					   __func__, __LINE__, (u32) skb);
-			if (skb) {
-				skb->data = skb->head + NET_SKB_PAD;
-				skb->len = 0;
-				skb_reset_tail_pointer(skb);
-				__skb_queue_head(&ugeth->rx_recycle, skb);
-			}
+			dev_free_skb(skb);
 
 			ugeth->rx_skbuff[rxQ][ugeth->skb_currx[rxQ]] = NULL;
 			dev->stats.rx_dropped++;
@@ -3349,13 +3338,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
 
 		dev->stats.tx_packets++;
 
-		if (skb_queue_len(&ugeth->rx_recycle) < RX_BD_RING_LEN &&
-			     skb_recycle_check(skb,
-				    ugeth->ug_info->uf_info.max_rx_buf_length +
-				    UCC_GETH_RX_DATA_BUF_ALIGNMENT))
-			__skb_queue_head(&ugeth->rx_recycle, skb);
-		else
-			dev_kfree_skb(skb);
+		dev_kfree_skb(skb);
 
 		ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]] = NULL;
 		ugeth->skb_dirtytx[txQ] =
diff --git a/drivers/net/ethernet/freescale/ucc_geth.h b/drivers/net/ethernet/freescale/ucc_geth.h
index f71b3e7..75f3371 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.h
+++ b/drivers/net/ethernet/freescale/ucc_geth.h
@@ -1214,8 +1214,6 @@ struct ucc_geth_private {
 	/* index of the first skb which hasn't been transmitted yet. */
 	u16 skb_dirtytx[NUM_TX_QUEUES];
 
-	struct sk_buff_head rx_recycle;
-
 	struct ugeth_mii_info *mii_info;
 	struct phy_device *phydev;
 	phy_interface_t phy_interface;
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index 087b9e0..84c1326 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -412,7 +412,6 @@ struct mv643xx_eth_private {
 	u8 work_rx_refill;
 
 	int skb_size;
-	struct sk_buff_head rx_recycle;
 
 	/*
 	 * RX state.
@@ -673,9 +672,7 @@ static int rxq_refill(struct rx_queue *rxq, int budget)
 		struct rx_desc *rx_desc;
 		int size;
 
-		skb = __skb_dequeue(&mp->rx_recycle);
-		if (skb == NULL)
-			skb = netdev_alloc_skb(mp->dev, mp->skb_size);
+		skb = netdev_alloc_skb(mp->dev, mp->skb_size);
 
 		if (skb == NULL) {
 			mp->oom = 1;
@@ -989,14 +986,7 @@ static int txq_reclaim(struct tx_queue *txq, int budget, int force)
 				       desc->byte_cnt, DMA_TO_DEVICE);
 		}
 
-		if (skb != NULL) {
-			if (skb_queue_len(&mp->rx_recycle) <
-					mp->rx_ring_size &&
-			    skb_recycle_check(skb, mp->skb_size))
-				__skb_queue_head(&mp->rx_recycle, skb);
-			else
-				dev_kfree_skb(skb);
-		}
+		dev_kfree_skb(skb);
 	}
 
 	__netif_tx_unlock(nq);
@@ -2349,8 +2339,6 @@ static int mv643xx_eth_open(struct net_device *dev)
 
 	napi_enable(&mp->napi);
 
-	skb_queue_head_init(&mp->rx_recycle);
-
 	mp->int_mask = INT_EXT;
 
 	for (i = 0; i < mp->rxq_count; i++) {
@@ -2445,8 +2433,6 @@ static int mv643xx_eth_stop(struct net_device *dev)
 	mib_counters_update(mp);
 	del_timer_sync(&mp->mib_counters_timer);
 
-	skb_queue_purge(&mp->rx_recycle);
-
 	for (i = 0; i < mp->rxq_count; i++)
 		rxq_deinit(mp->rxq + i);
 	for (i = 0; i < mp->txq_count; i++)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index e872e1d..7d51a65 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -50,7 +50,6 @@ struct stmmac_priv {
 	unsigned int dirty_rx;
 	struct sk_buff **rx_skbuff;
 	dma_addr_t *rx_skbuff_dma;
-	struct sk_buff_head rx_recycle;
 
 	struct net_device *dev;
 	dma_addr_t dma_rx_phy;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3be8833..c6cdbc4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -747,18 +747,7 @@ static void stmmac_tx(struct stmmac_priv *priv)
 		priv->hw->ring->clean_desc3(p);
 
 		if (likely(skb != NULL)) {
-			/*
-			 * If there's room in the queue (limit it to size)
-			 * we add this skb back into the pool,
-			 * if it's the right size.
-			 */
-			if ((skb_queue_len(&priv->rx_recycle) <
-				priv->dma_rx_size) &&
-				skb_recycle_check(skb, priv->dma_buf_sz))
-				__skb_queue_head(&priv->rx_recycle, skb);
-			else
-				dev_kfree_skb(skb);
-
+			dev_kfree_skb(skb);
 			priv->tx_skbuff[entry] = NULL;
 		}
 
@@ -1169,7 +1158,6 @@ static int stmmac_open(struct net_device *dev)
 	priv->eee_enabled = stmmac_eee_init(priv);
 
 	napi_enable(&priv->napi);
-	skb_queue_head_init(&priv->rx_recycle);
 	netif_start_queue(dev);
 
 	return 0;
@@ -1222,7 +1210,6 @@ static int stmmac_release(struct net_device *dev)
 		kfree(priv->tm);
 #endif
 	napi_disable(&priv->napi);
-	skb_queue_purge(&priv->rx_recycle);
 
 	/* Free the IRQ lines */
 	free_irq(dev->irq, dev);
@@ -1388,10 +1375,7 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv)
 		if (likely(priv->rx_skbuff[entry] == NULL)) {
 			struct sk_buff *skb;
 
-			skb = __skb_dequeue(&priv->rx_recycle);
-			if (skb == NULL)
-				skb = netdev_alloc_skb_ip_align(priv->dev,
-								bfsize);
+			skb = netdev_alloc_skb_ip_align(priv->dev, bfsize);
 
 			if (unlikely(skb == NULL))
 				break;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b33a3a1..6a2c34e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -589,9 +589,6 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
 }
 
-extern void skb_recycle(struct sk_buff *skb);
-extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
-
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 extern int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 extern struct sk_buff *skb_clone(struct sk_buff *skb,
@@ -2645,27 +2642,6 @@ static inline void skb_checksum_none_assert(const struct sk_buff *skb)
 
 bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
 
-static inline bool skb_is_recycleable(const struct sk_buff *skb, int skb_size)
-{
-	if (irqs_disabled())
-		return false;
-
-	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)
-		return false;
-
-	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
-		return false;
-
-	skb_size = SKB_DATA_ALIGN(skb_size + NET_SKB_PAD);
-	if (skb_end_offset(skb) < skb_size)
-		return false;
-
-	if (skb_shared(skb) || skb_cloned(skb))
-		return false;
-
-	return true;
-}
-
 /**
  * skb_head_is_locked - Determine if the skb->head is locked down
  * @skb: skb to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cdc2859..6e04b1f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -655,53 +655,6 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
-/**
- * 	skb_recycle - clean up an skb for reuse
- * 	@skb: buffer
- *
- * 	Recycles the skb to be reused as a receive buffer. This
- * 	function does any necessary reference count dropping, and
- * 	cleans up the skbuff as if it just came from __alloc_skb().
- */
-void skb_recycle(struct sk_buff *skb)
-{
-	struct skb_shared_info *shinfo;
-
-	skb_release_head_state(skb);
-
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-
-	memset(skb, 0, offsetof(struct sk_buff, tail));
-	skb->data = skb->head + NET_SKB_PAD;
-	skb_reset_tail_pointer(skb);
-}
-EXPORT_SYMBOL(skb_recycle);
-
-/**
- *	skb_recycle_check - check if skb can be reused for receive
- *	@skb: buffer
- *	@skb_size: minimum receive buffer size
- *
- *	Checks that the skb passed in is not shared or cloned, and
- *	that it is linear and its head portion at least as large as
- *	skb_size so that it can be recycled as a receive buffer.
- *	If these conditions are met, this function does any necessary
- *	reference count dropping and cleans up the skbuff as if it
- *	just came from __alloc_skb().
- */
-bool skb_recycle_check(struct sk_buff *skb, int skb_size)
-{
-	if (!skb_is_recycleable(skb, skb_size))
-		return false;
-
-	skb_recycle(skb);
-
-	return true;
-}
-EXPORT_SYMBOL(skb_recycle_check);
-
 static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	new->tstamp		= old->tstamp;

^ permalink raw reply related

* Re: [PATCH net] ipv6: clean up tcp_v6_early_demux() icsk variable
From: Neal Cardwell @ 2012-10-05 16:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Netdev, Eric Dumazet
In-Reply-To: <1349445365.21172.74.camel@edumazet-glaptop>

OK, thanks, Eric. This sounds good. We can scrap this one, and we'll
do a full merge/clean-up when net-next is open.

neal

On Fri, Oct 5, 2012 at 9:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-10-05 at 09:21 -0400, Neal Cardwell wrote:
>> Remove an icsk variable, which by convention should refer to an
>> inet_connection_sock rather than an inet_sock. In the process, make
>> the tcp_v6_early_demux() code and formatting a bit more like
>> tcp_v4_early_demux(), to ease comparisons and maintenance.
>>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>> ---
>>  net/ipv6/tcp_ipv6.c |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>> index 49c8903..491c41b 100644
>> --- a/net/ipv6/tcp_ipv6.c
>> +++ b/net/ipv6/tcp_ipv6.c
>> @@ -1740,11 +1740,11 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
>>               skb->destructor = sock_edemux;
>>               if (sk->sk_state != TCP_TIME_WAIT) {
>>                       struct dst_entry *dst = sk->sk_rx_dst;
>> -                     struct inet_sock *icsk = inet_sk(sk);
>> +
>>                       if (dst)
>>                               dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
>>                       if (dst &&
>> -                         icsk->rx_dst_ifindex == skb->skb_iif)
>> +                         inet_sk(sk)->rx_dst_ifindex == skb->skb_iif)
>>                               skb_dst_set_noref(skb, dst);
>>               }
>>       }
>
> Hi Neal
>
> I would wait net-next being opened, and do a full merge/cleanup
>
> For example ipv6 uses :
>
> if (!pskb_may_pull(skb, skb_transport_offset(skb) +
>                    sizeof(struct tcphdr)))
> th = tcp_hdr(skb);
>
> while ipv4 uses :
>
> if (!pskb_may_pull(skb, ip_hdrlen(skb) + sizeof(struct tcphdr)))
> th = (struct tcphdr *) ((char *)iph + ip_hdrlen(skb));
>
> It would be good to use the ipv6 variant (its less instructions and
> cleaner)
>
> Same for the "struct net *net = dev_net(skb->dev);" used in ipv4, while
> ipv6 doesnt need this extra net variable.
>
> Thanks
>
>

^ permalink raw reply

* Re: [PATCH] net, TTY: initialize tty->driver_data before usage
From: Greg KH @ 2012-10-05 16:25 UTC (permalink / raw)
  To: Sasha Levin
  Cc: samuel, davem, jslaby, alan, levinsasha928, davej, linux-kernel,
	netdev
In-Reply-To: <1349395281-12495-1-git-send-email-sasha.levin@oracle.com>

On Thu, Oct 04, 2012 at 08:01:21PM -0400, Sasha Levin wrote:
> Commit 9c650ffc ("TTY: ircomm_tty, add tty install") split _open() to
> _install() and _open(). It also moved the initialization of driver_data
> out of open(), but never added it to install() - causing a NULL ptr
> deref whenever the driver was used.
> 
> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
> Acked-by: Jiri Slaby <jslaby@suse.cz>
> ---
>  net/irda/ircomm/ircomm_tty.c | 2 ++
>  1 file changed, 2 insertions(+)

Nice catch, now applied.

greg k-h

^ permalink raw reply

* RE: [PATCH] net: Fix skb_under_panic oops in neigh_resolve_output
From: Ramesh Nagappa @ 2012-10-05 16:33 UTC (permalink / raw)
  To: Eric Dumazet, ramesh.nagappa@gmail.com
  Cc: davem@davemloft.net, edumazet@google.com, ebiederm@xmission.com,
	xemul@parallels.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1349418445.21172.11.camel@edumazet-glaptop>

Hi Eric,

Yes, that is a good optimization. neigh_resolve_output() also has the
__skb_pull() outside the loop, is that required ? The changes would be
like ...

neigh_resolve_output()
...
-        __skb_pull(skb, skb_network_offset(skb));


        if (!neigh_event_send(neigh, skb)) {
                int err;
                struct net_device *dev = neigh->dev;
                unsigned int seq;

                if (dev->header_ops->cache && !neigh->hh.hh_len)
                        neigh_hh_init(neigh, dst);

                do {                        
+			      __skb_pull(skb, skb_network_offset(skb));
                        seq = read_seqbegin(&neigh->ha_lock);
                        err = dev_hard_header(skb, dev, ntohs(skb->protocol),
                                              neigh->ha, NULL, skb->len);
                } while (read_seqretry(&neigh->ha_lock, seq));

Thanks,
Ramesh
 

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
> >  	do {
> >  		seq = read_seqbegin(&neigh->ha_lock);
> > +		__skb_pull(skb, skb_network_offset(skb));
> 
> This __skb_pull() could be done before the read_seqbegin() to 
> keep the protected section short.
> 
> >  		err = dev_hard_header(skb, dev, ntohs(skb->protocol),
> >  				      neigh->ha, NULL, skb->len);
> >  	} while (read_seqretry(&neigh->ha_lock, seq));
> 
> Thanks
> 
> 
> 

^ permalink raw reply

* RE: [PATCH] net: Fix skb_under_panic oops in neigh_resolve_output
From: Eric Dumazet @ 2012-10-05 16:59 UTC (permalink / raw)
  To: Ramesh Nagappa
  Cc: ramesh.nagappa@gmail.com, davem@davemloft.net,
	edumazet@google.com, ebiederm@xmission.com, xemul@parallels.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CA3E5132752D1F49871ED0FC40D56BBA43ECF33592@EUSAACMS0701.eamcs.ericsson.se>

On Fri, 2012-10-05 at 12:33 -0400, Ramesh Nagappa wrote:
> Hi Eric,
> 
> Yes, that is a good optimization. neigh_resolve_output() also has the
> __skb_pull() outside the loop, is that required ? The changes would be
> like ...
> 
> neigh_resolve_output()
> ...
> -        __skb_pull(skb, skb_network_offset(skb));
> 
> 
>         if (!neigh_event_send(neigh, skb)) {
>                 int err;
>                 struct net_device *dev = neigh->dev;
>                 unsigned int seq;
> 
>                 if (dev->header_ops->cache && !neigh->hh.hh_len)
>                         neigh_hh_init(neigh, dst);
> 
>                 do {                        
> +			      __skb_pull(skb, skb_network_offset(skb));
>                         seq = read_seqbegin(&neigh->ha_lock);
>                         err = dev_hard_header(skb, dev, ntohs(skb->protocol),
>                                               neigh->ha, NULL, skb->len);
>                 } while (read_seqretry(&neigh->ha_lock, seq));

All similar constructions should be audited.

For example in net/decnet/dn_neigh.c , dn_neigh_output_packet()

^ permalink raw reply

* PROJECT FILE.
From: MINERALS AND ENERGY @ 2012-10-05 17:41 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 39 bytes --]

PLEASE OPEN ATTACH FILE FOR MORE DETAIL

[-- Attachment #2: BUSINESS PROPOSAL1.doc --]
[-- Type: application/msword, Size: 33792 bytes --]

^ permalink raw reply

* Filter matching problems
From: John A. Sullivan III @ 2012-10-05 17:46 UTC (permalink / raw)
  To: netdev

Hello, all.  Can anyone tell me why this filter does not match anything
when I think it should match everything:

tc filter replace dev bond3 parent ffff: protocol ip prio 1 u32 match u8
0 0 action mirred egress redirect dev ifb0

I get:
tc -s filter show dev bond3 parent ffff:
filter protocol ip pref 1 u32
filter protocol ip pref 1 u32 fh 800: ht divisor 1
filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0
(rule hit 1216169263 success 0)
  match 00000000/00000000 at 0 (success 1216169263 )
        action order 1: mirred (Egress Redirect to device ifb0) stolen
        index 2 ref 1 bind 1 installed 11362879 sec used 11362879 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        rate 0bit 0pps backlog 0b 0p requeues 0

which I think is telling me I'm not getting any matches.  Here are more
details.

We are redirecting ingress into ifb0 and redirecting egress into ifb1
(because several interfaces, e.g., OpenVPN tun devices) need the exact
same traffic shaping.  We realize that we cannot place the ingress
filter at ffff: and the egress filter at root as those appear to be
treated as one and the same, i.e., the first match of those two filters
will be used and the other will thus never see any packets.

To get around that problem, we inserted a priomap on the interface and
apply the egress redirect filter on it's child like this:

tc qdisc replace dev bond3 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tc filter replace dev bond3 parent 1:0 protocol ip prio 1 u32 match u8 0 0 flowid 1:1 action mirred egress redirect dev ifb1
tc qdisc replace dev bond3 ingress
tc filter replace dev bond3 parent ffff: protocol ip prio 1 u32 match u8 0 0 action mirred egress redirect dev ifb0

So what have I mangled? Thanks - John

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Rick Jones @ 2012-10-05 18:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <1349448747.21172.113.camel@edumazet-glaptop>

On 10/05/2012 07:52 AM, Eric Dumazet wrote:
> What we could do :
>
> 1) Use a hash to avoid expensive gro_list management and allow
>     much more concurrent flows.
>
> Use skb_get_rxhash(skb) to compute rxhash
>
> If l4_rxhash not set -> not a GRO candidate.
>
> If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
> candidates.
>
> (tcp stack could eventually use rxhash instead of its custom hash
> computation ...)
>
> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>     even if the napi never completes. Each time we process a new packet,
>     being a GRO candidate or not, we increment a napi->sequence, and we
>     flush the oldest packet in gro_lru_list if its own sequence is too
>     old.
>
>    That would give a latency guarantee.

Flushing things if N packets have come though sounds like goodness, and 
it reminds me a bit about what happens with IP fragment reassembly - 
another area where the stack is trying to guess just how long to 
hang-onto a packet before doing something else with it.  But the value 
of N to get a "decent" per-flow GRO aggregation rate will depend on the 
number of concurrent flows right?  If I want to have a good shot at 
getting 2 segments combined for 1000 active, concurrent flows entering 
my system via that interface, won't N have to approach 2000?

GRO (and HW LRO) has a fundamental limitation/disadvantage here.  GRO 
does provide a very nice "boost" on various situations (especially 
numbers of concurrent netperfs that don't blow-out the tracking limits) 
but since it won't really know anything about the flow(s) involved (*) 
or even their number (?), it will always be guessing.  That is why it is 
really only "poor man's JumboFrames" (or larger MTU - Sadly, the IEEE 
keeps us all beggars here).

A goodly portion of the benefit of GRO comes from the "incidental" ACK 
avoidance it causes yes?  That being the case, might that be a 
worthwhile avenue to explore?   It would then naturally scale as TCP et 
al do today.

When we go to 40 GbE will we have 4x as many flows, or the same number 
of 4x faster flows?

rick jones

* for example - does this TCP segment contain the last byte(s) of a 
pipelined http request/response and the first byte(s) of the next one 
and so should "flush" now?

^ permalink raw reply

* [PATCH] mlx4: set carrier off after register netdev
From: Min Zhang @ 2012-10-05 18:28 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

ifconfig mlx4_en port reported RUNNING even though the link was down.

mlx4_en_init_netdev didn't initialize the dev operstate properly so
the operstate stayed as default IF_OPER_UNKNOWN, then ifconfig treated
the UNKNOWN as RUNNING state for backward compatiblity per RFC2863.

The fix calls netif_carrier_off which is supposed to set operstate
after register_netdev. Calling it before register_netdev has no effect
since the dev->state is still NETREG_UNINITIALIZED

Tested by removing the physical link signal to the mellanox 10G port,
modprobe mlx4_en, then ifconfig up. Verify there is no RUNNING status.

Signed-off-by: Min Zhang <mzhang@mvista.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c96113b..8c562f7a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1306,12 +1306,12 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,

 	mdev->pndev[port] = dev;

-	netif_carrier_off(dev);
 	err = register_netdev(dev);
 	if (err) {
 		mlx4_err(mdev, "Netdev registration failed for port %d\n", port);
 		goto out;
 	}
+	netif_carrier_off(dev);

 	en_warn(priv, "Using %d TX rings\n", prof->tx_ring_num);
 	en_warn(priv, "Using %d RX rings\n", prof->rx_ring_num);
-- 
1.7.7.4

^ permalink raw reply related

* Re: [PATCH] net, bluetooth: don't attempt to free a channel that wasn't created
From: Andrei Emeltchenko @ 2012-10-05 18:53 UTC (permalink / raw)
  To: Sasha Levin
  Cc: marcel-kz+m5ild9QBg9hUCZPvPmw, gustavo-THi1TnShQwVAfugRpC6u6w,
	johan.hedberg-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, davej-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <506EF827.3060100-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Hi Sasha,

On Fri, Oct 05, 2012 at 11:09:27AM -0400, Sasha Levin wrote:
> On 10/05/2012 06:22 AM, Andrei Emeltchenko wrote:
> > Hi Sasha,
> > 
> > On Thu, Oct 04, 2012 at 07:59:57PM -0400, Sasha Levin wrote:
> >> We may currently attempt to free a channel which wasn't created due to
> >> an error in the initialization path, this would cause a NULL ptr deref.
> > 
> > Please put oops dump here.
> 
> [   12.919073] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> [   12.919131] IP: [<ffffffff836645c4>] l2cap_chan_put+0x34/0x50
> [   12.919135] PGD 0
> [   12.919138] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   12.919193] Dumping ftrace buffer:
> [   12.919242]    (ftrace buffer empty)
> [   12.919314] Modules linked in:
> [   12.919318] CPU 1
> [   12.919319] Pid: 6210, comm: krfcommd Tainted: G        W    3.6.0-next-20121004-sasha-00005-gb010653-dirty #30
> [   12.919374] RIP: 0010:[<ffffffff836645c4>]  [<ffffffff836645c4>] l2cap_chan_put+0x34/0x50
> [   12.919377] RSP: 0000:ffff880066933c38  EFLAGS: 00010246
> [   12.919378] RAX: ffffffff8366c780 RBX: 0000000000000000 RCX: 6666666666666667
> [   12.919379] RDX: 0000000000000fa0 RSI: ffffffff84d3f79e RDI: 0000000000000010
> [   12.919381] RBP: ffff880066933c48 R08: ffffffff859989f8 R09: 0000000000000001
> [   12.919382] R10: 0000000000000000 R11: 7fffffffffffffff R12: 0000000000000000
> [   12.919383] R13: ffff88009b00a200 R14: ffff88009b00a200 R15: 0000000000000001
> [   12.919385] FS:  0000000000000000(0000) GS:ffff880033600000(0000) knlGS:0000000000000000
> [   12.919437] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.919440] CR2: 0000000000000010 CR3: 0000000005026000 CR4: 00000000000406e0
> [   12.919446] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   12.919451] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   12.919504] Process krfcommd (pid: 6210, threadinfo ffff880066932000, task ffff880065c4b000)
> [   12.919506] Stack:
> [   12.919510]  ffff88009b00a200 ffff880032084000 ffff880066933c68 ffffffff8366c7bc
> [   12.919513]  7fffffffffffffff ffff880032084000 ffff880066933c98 ffffffff833ae0ae
> [   12.919516]  ffff880066933ca8 0000000000000000 0000000000000000 ffff88009b00a200
> [   12.919517] Call Trace:
> [   12.919522]  [<ffffffff8366c7bc>] l2cap_sock_destruct+0x3c/0x80
> [   12.919527]  [<ffffffff833ae0ae>] __sk_free+0x1e/0x1f0
> [   12.919530]  [<ffffffff833ae2f7>] sk_free+0x17/0x20
> [   12.919585]  [<ffffffff8366ca4e>] l2cap_sock_alloc.constprop.5+0x9e/0xd0
> [   12.919591]  [<ffffffff8366cb9e>] l2cap_sock_create+0x7e/0x100
> [   12.919652]  [<ffffffff83a4f32a>] ? _raw_read_lock+0x6a/0x80
> [   12.919658]  [<ffffffff836402c4>] ? bt_sock_create+0x74/0x110
> [   12.919660]  [<ffffffff83640308>] bt_sock_create+0xb8/0x110
> [   12.919664]  [<ffffffff833aa232>] __sock_create+0x282/0x3b0
> [   12.919720]  [<ffffffff833aa0b0>] ? __sock_create+0x100/0x3b0
> [   12.919725]  [<ffffffff836785b0>] ? rfcomm_process_sessions+0x17e0/0x17e0
> [   12.919779]  [<ffffffff833aa37f>] sock_create_kern+0x1f/0x30
> [   12.919784]  [<ffffffff83675714>] rfcomm_l2sock_create+0x44/0x70
> [   12.919787]  [<ffffffff836785b0>] ? rfcomm_process_sessions+0x17e0/0x17e0
> [   12.919790]  [<ffffffff836785fe>] rfcomm_run+0x4e/0x1f0
> [   12.919846]  [<ffffffff836785b0>] ? rfcomm_process_sessions+0x17e0/0x17e0
> [   12.919852]  [<ffffffff81138ee3>] kthread+0xe3/0xf0
> [   12.919908]  [<ffffffff8117b12e>] ? put_lock_stats.isra.14+0xe/0x40
> [   12.919914]  [<ffffffff81138e00>] ? flush_kthread_work+0x1f0/0x1f0
> [   12.919968]  [<ffffffff83a5077c>] ret_from_fork+0x7c/0x90
> [   12.919973]  [<ffffffff81138e00>] ? flush_kthread_work+0x1f0/0x1f0
> [   12.920161] Code: 83 ec 08 f6 05 ff 58 44 02 04 74 1b 8b 4f 10 48 89 fa 48 c7 c6 d9 d7 d4 84 48 c7 c7 80 9e aa 85 31 c0 e8 80
> ac 3a fe 48 8d 7b 10 <f0> 83 6b 10 01 0f 94 c0 84 c0 74 05 e8 8b e0 ff ff 48 83 c4 08
> [   12.920165] RIP  [<ffffffff836645c4>] l2cap_chan_put+0x34/0x50
> [   12.920166]  RSP <ffff880066933c38>
> [   12.920167] CR2: 0000000000000010
> [   12.920417] ---[ end trace 5a9114e8a158ab84 ]---
> 
> > 
> >> Introduced in commit 61d6ef3e ("Bluetooth: Make better use of l2cap_chan
> >> reference counting").
> >>
> >> Signed-off-by: Sasha Levin <sasha.levin-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> >> ---
> >>  net/bluetooth/l2cap_sock.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
> >> index 083f2bf..66c295a 100644
> >> --- a/net/bluetooth/l2cap_sock.c
> >> +++ b/net/bluetooth/l2cap_sock.c
> >> @@ -1083,7 +1083,8 @@ static void l2cap_sock_destruct(struct sock *sk)
> >>  {
> >>  	BT_DBG("sk %p", sk);
> >>  
> >> -	l2cap_chan_put(l2cap_pi(sk)->chan);
> >> +	if (l2cap_pi(sk)->chan)
> >> +		l2cap_chan_put(l2cap_pi(sk)->chan);
> > 
> > This does not look right, I suppose you want to put somewhere missing
> > chan_hold
> 
> The issue is basically kzalloc() failing in l2cap_chan_create(), this would lead to sk_free()
> getting called with chan being NULL, which is why I don't think that chan_hold is relevant
> at this stage.

Then this looks OK. 

Best regards 
Andrei Emeltchenko 

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-05 19:00 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <506F23F6.1060704@hp.com>

On Fri, 2012-10-05 at 11:16 -0700, Rick Jones wrote:
> O
> Flushing things if N packets have come though sounds like goodness, and 
> it reminds me a bit about what happens with IP fragment reassembly - 
> another area where the stack is trying to guess just how long to 
> hang-onto a packet before doing something else with it.  But the value 
> of N to get a "decent" per-flow GRO aggregation rate will depend on the 
> number of concurrent flows right?  If I want to have a good shot at 
> getting 2 segments combined for 1000 active, concurrent flows entering 
> my system via that interface, won't N have to approach 2000?
> 

It all depends on the max latency you can afford.

> GRO (and HW LRO) has a fundamental limitation/disadvantage here.  GRO 
> does provide a very nice "boost" on various situations (especially 
> numbers of concurrent netperfs that don't blow-out the tracking limits) 
> but since it won't really know anything about the flow(s) involved (*) 
> or even their number (?), it will always be guessing.  That is why it is 
> really only "poor man's JumboFrames" (or larger MTU - Sadly, the IEEE 
> keeps us all beggars here).
> 
> A goodly portion of the benefit of GRO comes from the "incidental" ACK 
> avoidance it causes yes?  That being the case, might that be a 
> worthwhile avenue to explore?   It would then naturally scale as TCP et 
> al do today.
> 
> When we go to 40 GbE will we have 4x as many flows, or the same number 
> of 4x faster flows?
> 
> rick jones
> 
> * for example - does this TCP segment contain the last byte(s) of a 
> pipelined http request/response and the first byte(s) of the next one 
> and so should "flush" now?

Some remarks :

1) I use some 40Gbe links, thats probably why I try to improve things ;)

2) benefit of GRO can be huge, and not only for the ACK avoidance
   (other tricks could be done for ACK avoidance in the stack)

3) High speeds probably need multiqueue device, and each queue has its
own GRO unit.

  For example on a 40Gbe, 8 queues -> 5Gbps per queue (about 400k
packets/sec)

Lets say we allow no more than 1ms of delay in GRO, this means we could
have about 400 packets in the GRO queue (assuming 1500 bytes packets)

Another idea to play with would be to extend GRO to allow packet
reorder.

^ permalink raw reply

* EG20T and Micrel KSZ9021 Gigabit Ethernet
From: James Kosin @ 2012-10-05 19:10 UTC (permalink / raw)
  To: netdev; +Cc: James Kosin

Anyone,

I'm using a Congatec QA6 Intel Atom board and need help getting the
onboard Micrel Ethernet PHY working correctly.
I've downloaded the latest stable kernel 3.5.5 from kernel.org, and
attempted to configure everything for the Debian platform I'm running
on.  Only problems are (1) the Ethernet PHY seems to be using pch_gbe
which seems to track back to an oki-semi directory under
drivers/net/ethernet ...
I'm tempted to reconfigure to remove that; but, am afraid that there is
no other driver for the network.  The only other set of drivers that
seem to be related are for the intel directory; but, these seem to be
intel specific.  I don't see support for this combination.

Anyone have an idea where I need to start?

PS:  Please reply to ALL when replying, I'm not a member of this list.

Thanks,
James

^ permalink raw reply

* Re: [PATCH] Packet mmap : allow the user to choose the offset of the tx payload.
From: pchavent @ 2012-10-05 19:21 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, edumazet, xemul, herbert, netdev, johann.baudy, uaca
In-Reply-To: <CAD6jFUTmNiRHdAkckkwcV2rEOf_kxjHmntTO+rG6ewLs-RhFfg@mail.gmail.com>

Hi

On Fri, 5 Oct 2012 16:17:12 +0200, Daniel Borkmann wrote:
> On Fri, Oct 5, 2012 at 3:10 PM, Paul Chavent <Paul.Chavent@onera.fr> 
> wrote:
>> The tx offset of packet mmap tx ring used to be :
>> (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))
>>
>> The problem is that depending on the usage of SOCK_DGRAM or
>> SOCK_RAW, the payload could be aligned or not.
>>
>> This patch allow to let the user give an offset for it's tx
>> payload if he desires.
>>
>> Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
>
> Can you provide an example when it doesn't hit TPACKET_ALIGNMENT?

When we use tx ring, the user have to write at (TPACKET_HDRLEN - 
sizeof(struct sockaddr_ll))

This adress is aligned on TPACKET_ALIGNMENT since
TPACKET_HDRLEN = (TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + 
sizeof(struct sockaddr_ll))

When we use the tx ring with SOCK_RAW option, the mac header is aligned 
on TPACKET_ALIGNMENT, but not the payload (14 bytes away).

>
> On the first look, could it be that your patch currently enforces the
> use of your tp_net's offset for *all* TX_RING users, even if they
> don't care about it? So in case off==0, you probably get a negative
> offset in case of SOCK_RAW, thus it won't hit the second 
> if-statement.
> Sure, but this does not look intuitive in my opinion. Maybe, it's
> better to only enter this path if the offset *is* used by someone.

Yes, moreover i had a problem with the signed/unsigned comparison.
I've fixed all those problems for the next submission.

>
> Also, to my knowledge tp_net is currently only applied in receive
> path. So, if for whatever reason people did not explicitly set tp_net
> to 0, it might break their code, if there's a random offset in it,
> right?

I haven't found where all frames were initialized to 
TP_STATUS_AVAILABLE, so i have supposed that the frames were initialized 
to zero.

So your are right, if tp_net is not forced to zero before sending the 
packet it could break the legacy code :(

>
> Best,
> Daniel

Thank for your review.

Paul.

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Rick Jones @ 2012-10-05 19:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <1349463634.21172.152.camel@edumazet-glaptop>

On 10/05/2012 12:00 PM, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 11:16 -0700, Rick Jones wrote:
>
> Some remarks :
>
> 1) I use some 40Gbe links, thats probably why I try to improve things ;)

Path length before workarounds :)

> 2) benefit of GRO can be huge, and not only for the ACK avoidance
>     (other tricks could be done for ACK avoidance in the stack)

Just how much code path is there between NAPI and the socket?? (And I 
guess just how much combining are you hoping for?)

> 3) High speeds probably need multiqueue device, and each queue has its
> own GRO unit.
>
>    For example on a 40Gbe, 8 queues -> 5Gbps per queue (about 400k
> packets/sec)
>
> Lets say we allow no more than 1ms of delay in GRO,

OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
that kind of added delay anyway.  I'm not sure if that also then 
eliminates the networked storage types.

> this means we could have about 400 packets in the GRO queue (assuming
> 1500 bytes packets)

How many flows are you going to have entering via that queue?  And just 
how well "shuffled" will the segments of those flows be?  That is what 
it all comes down to right?  How many (active) flows and how well 
shuffled they are.  If the flows aren't well shuffled, you can get away 
with a smallish coalescing context.  If they are perfectly shuffled and 
greater in number than your delay allowance you get right back to square 
with all the overhead of GRO attempts with none of the benefit.

If the flow count is < 400 to allow a decent shot at a non-zero 
combining rate on well shuffled flows with the 400 packet limit, then 
that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
aggregated.  And I think you then get two segments per flow aggregated 
at a time.  Is that consistent with what you expect to be the 
characteristics of the flows entering via that queue?

rick jones

^ permalink raw reply

* Re: [PATCH] Packet mmap : allow the user to choose the offset of the tx payload.
From: Daniel Borkmann @ 2012-10-05 19:37 UTC (permalink / raw)
  To: pchavent; +Cc: davem, edumazet, xemul, herbert, netdev, johann.baudy, uaca
In-Reply-To: <7b317aa8fa2c96f3bf7ce916d18eb550@sybille.onecert.fr>

On Fri, Oct 5, 2012 at 9:21 PM, pchavent <Paul.Chavent@onera.fr> wrote:
> On Fri, 5 Oct 2012 16:17:12 +0200, Daniel Borkmann wrote:
>> On Fri, Oct 5, 2012 at 3:10 PM, Paul Chavent <Paul.Chavent@onera.fr>
>> wrote:
>>>
>>> The tx offset of packet mmap tx ring used to be :
>>> (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))
>>>
>>> The problem is that depending on the usage of SOCK_DGRAM or
>>> SOCK_RAW, the payload could be aligned or not.
>>>
>>> This patch allow to let the user give an offset for it's tx
>>> payload if he desires.
>>>
>>> Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
>>
>> Can you provide an example when it doesn't hit TPACKET_ALIGNMENT?
>
> When we use tx ring, the user have to write at (TPACKET_HDRLEN -
> sizeof(struct sockaddr_ll))
>
> This adress is aligned on TPACKET_ALIGNMENT since
> TPACKET_HDRLEN = (TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + sizeof(struct
> sockaddr_ll))
>
> When we use the tx ring with SOCK_RAW option, the mac header is aligned on
> TPACKET_ALIGNMENT, but not the payload (14 bytes away).

Okay, I'm confused about your intentions, maybe I'm missing something.
The man-page of packet(7) clearly says:

The socket_type is either SOCK_RAW for raw packets *including* the
link level header or SOCK_DGRAM for cooked packets with  the link
level header *removed*.

So this is perfectly intended behavior of PF_PACKET.

Cheers,

Daniel

^ permalink raw reply

* Re: [RFC PATCH 0/5] net: socket bind to file descriptor introduced
From: J. Bruce Fields @ 2012-10-05 20:00 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: Eric W. Biederman, tglx@linutronix.de, mingo@redhat.com,
	davem@davemloft.net, hpa@zytor.com,
	thierry.reding@avionic-design.de, bfields@redhat.com,
	eric.dumazet@gmail.com, Pavel Emelianov, neilb@suse.de,
	netdev@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org, paul.gortmaker@windriver.com,
	viro@zeniv.linux.org.uk, gorcunov@openvz.org,
	akpm@linux-foundation.org, tim.c.chen@linux.intel.com,
	"devel@ope
In-Reply-To: <20120904190007.GB29369@fieldses.org>

On Tue, Sep 04, 2012 at 03:00:07PM -0400, bfields wrote:
> On Mon, Aug 20, 2012 at 02:18:13PM +0400, Stanislav Kinsbursky wrote:
> > 16.08.2012 07:03, Eric W. Biederman пишет:
> > >Stanislav Kinsbursky <skinsbursky@parallels.com> writes:
> > >
> > >>This patch set introduces new socket operation and new system call:
> > >>sys_fbind(), which allows to bind socket to opened file.
> > >>File to bind to can be created by sys_mknod(S_IFSOCK) and opened by
> > >>open(O_PATH).
> > >>
> > >>This system call is especially required for UNIX sockets, which has name
> > >>lenght limitation.
> > >>
> > >>The following series implements...
> > >
> > >Hmm.  I just realized this patchset is even sillier than I thought.
> > >
> > >Stanislav is the problem you are ultimately trying to solve nfs clients
> > >in a container connecting to the wrong user space rpciod?
> > >
> > 
> > Hi, Eric.
> > The problem you mentioned was the reason why I started to think about this.
> > But currently I believe, that limitations in unix sockets connect or
> > bind should be removed, because it will be useful it least for CRIU
> > project.
> > 
> > >Aka net/sunrpc/xprtsock.c:xs_setup_local only taking an absolute path
> > >and then creating a delayed work item to actually open the unix domain
> > >socket?
> > >
> > >The straight correct and straight forward thing to do appears to be:
> > >- Capture the root from current->fs in xs_setup_local.
> > >- In xs_local_finish_connect change current->fs.root to the captured
> > >   version of root before kernel_connect, and restore current->fs.root
> > >   after kernel_connect.
> > >
> > >It might not be a bad idea to implement open on unix domain sockets in
> > >a filesystem as create(AF_LOCAL)+connect() which would allow you to
> > >replace __sock_create + kernel_connect with a simple file_open_root.
> > >
> > 
> > I like the idea of introducing new family (AF_LOCAL_AT for example)
> > and new sockaddr for connecting or binding from specified root. The
> > only thing I'm worrying is passing file descriptor to unix bind or
> > connect routine. Because this approach doesn't provide easy way to
> > use such family and sockaddr in kernel (like in NFS example).
> > 
> > >But I think the simple scheme of:
> > >struct path old_root;
> > >old_root = current->fs.root;
> > >kernel_connect(...);
> > >current->fs.root = old_root;
> > >
> > >Is more than sufficient and will remove the need for anything
> > >except a purely local change to get nfs clients to connect from
> > >containers.
> > >
> > 
> > That was my first idea.
> 
> So is this what you're planning on doing now?

What ever happened to this?

--b.

> 
> > And probably it would be worth to change all
> > fs_struct to support sockets with relative path.
> > What do you think about it?
> 
> I didn't understand the question.  Are you suggesting that changes to
> fs_struct would be required to make this work?  I don't see why.
> 
> --b.

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-05 20:06 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <506F368F.3070403@hp.com>

On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:

> Just how much code path is there between NAPI and the socket?? (And I 
> guess just how much combining are you hoping for?)
> 

When GRO correctly works, you can save about 30% of cpu cycles, it
depends...

Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
gives an improvement as well...

> > Lets say we allow no more than 1ms of delay in GRO,
> 
> OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
> that kind of added delay anyway.  I'm not sure if that also then 
> eliminates the networked storage types.
> 

I took this 1ms delay, but I never said it was a fixed value ;)

Also remember one thing, this is the _max_ delay in case your napi
handler is flooded. This almost never happen (tm)

> > this means we could have about 400 packets in the GRO queue (assuming
> > 1500 bytes packets)
> 
> How many flows are you going to have entering via that queue?  And just 
> how well "shuffled" will the segments of those flows be?  That is what 
> it all comes down to right?  How many (active) flows and how well 
> shuffled they are.  If the flows aren't well shuffled, you can get away 
> with a smallish coalescing context.  If they are perfectly shuffled and 
> greater in number than your delay allowance you get right back to square 
> with all the overhead of GRO attempts with none of the benefit.

Not sure what you mean by shuffle. We use a hash table to locate a flow,
but we also have a LRU list to get the packets ordered by their entry in
the 'GRO unit'.

If napi completes, all the LRU list content is flushed to IP stack.
( napi_gro_flush()) 

If napi doesnt complete, we would only flush 'too old' packets found in
the LRU.

Note: this selective flush can be called once per napi run from
net_rx_action(). Extra cost to get a somewhat precise timestamp
would be acceptable (one call to ktime_get() or get_cycles() every 64
packets)

This timestamp could be stored in napi->timestamp and done once per
n->poll(n, weight) call.

> 
> If the flow count is < 400 to allow a decent shot at a non-zero 
> combining rate on well shuffled flows with the 400 packet limit, then 
> that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
> aggregated.  And I think you then get two segments per flow aggregated 
> at a time.  Is that consistent with what you expect to be the 
> characteristics of the flows entering via that queue?

If a packet cant stay more than 1ms, then a flow sending less than 1000
packets per second wont benefit from GRO.

So yes, 12.5 Mbit/s would be the threshold.

By the way, when TCP timestamps are used, and hosts are linux machines
with HZ=1000, current GRO can not coalesce packets anyway because their
TCP options are different.

(So it would be not useful trying bigger sojourn time than 1ms)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox