Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 00/19] Enable various Renesas drivers on all ARM platforms
From: Laurent Pinchart @ 2013-10-29 13:22 UTC (permalink / raw)
  To: dedekind1-Re5JQEeQqe8AvxtiuMwx3w
  Cc: linux-fbdev-u79uwXL29TY76Z2rM5mHXA,
	linux-sh-u79uwXL29TY76Z2rM5mHXA, Linus Walleij,
	Guennadi Liakhovetski, Thierry Reding,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-i2c-u79uwXL29TY76Z2rM5mHXA, Laurent Pinchart,
	David S. Miller, Vinod Koul, Wolfram Sang, Magnus Damm,
	Eduardo Valentin, Tomi Valkeinen,
	linux-serial-u79uwXL29TY76Z2rM5mHXA,
	linux-input-u79uwXL29TY76Z2rM5mHXA, Zhang Rui, Chris Ball,
	Jean-Christophe Plagniol-Villard,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	linux-pwm-u79uwXL29TY76Z2rM5mHXA, Samuel Ortiz,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, Ian Molton, Mark
In-Reply-To: <1383051980.29619.33.camel-Bxnoe/o8FG+Ef9UqXRslZEEOCMrvLtNR@public.gmane.org>

Hi Artem,

On Tuesday 29 October 2013 15:06:20 Artem Bityutskiy wrote:
> On Tue, 2013-10-29 at 10:12 +0100, Guennadi Liakhovetski wrote:
> > On Tue, 29 Oct 2013, Laurent Pinchart wrote:
> > > Hello,
> > > 
> > > This patch series, based on v3.12-rc7, prepares various Renesas drivers
> > > for migration to multiplatform kernels by enabling their compilation or
> > > otherwise fixing them on all ARM platforms. The patches are pretty
> > > straightforward and are described in their commit message.
> > > 
> > > I'd like to get all these patches merged in v3.14. As they will need to
> > > go through their respective subsystems' trees, I would appreciate if all
> > > maintainers involved could notify me when they merge patches from this
> > > series in their tree to help me tracking the merge status. I don't plan
> > > to send pull requests individually for these patches, and I will repost
> > > patches individually if changes are requested during review.
> > > 
> > > If you believe the issue should be solved in a different way (for
> > > instance by removing the architecture dependency completely) please
> > > reply to the cover letter to let other maintainers chime in.
> > 
> > Exactly this was my doubt. If we let these drivers build on all ARM
> > platforms... Maybe we should just let them build everywhere? Unless there
> > are real ARM dependencies. Maybe you could try to remove the restriction
> > and try to build them all on x86?
> 
> If they have never been used on anything but ARM, why would you remove
> ARM dependencies? Just for the sake of compile-checking?
> 
> Also, if ARM dependency is ever removed, all these should become 'n' by
> default in the Kconfig, in order to make sure they do not slip into
> defconfigs of different architectures.

The idea is that, if ARM is neither a compile-time nor runtime dependency, it 
should not be specified in Kconfig. However, if the IP core has never been 
used on anything but SuperH and ARM, I don't think clobbering the config 
process with drivers that can't be used on the target architecture would be a 
really good idea, especially now that we have a COMPILE_TEST Kconfig option. 
My preference does goes to SUPERH || ARM || COMPILE_TEST over no dependency at 
all.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Neil Horman @ 2013-10-29 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029131149.GB20408@gmail.com>

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 
Ok, well, I'll run it again in just a bit here.
Neil

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Ingo Molnar @ 2013-10-29 13:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029130712.GA25078@neilslaptop.think-freely.org>


* Neil Horman <nhorman@tuxdriver.com> wrote:

> I'm sure it worked properly on my system here, I specificially 
> checked it, but I'll gladly run it again.  You have to give me an 
> hour as I have a meeting to run to, but I'll have results shortly.

So what I tried to react to was this observation of yours:

> > > Heres my data for running the same test with taskset 
> > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > going on here, but doing so resulted in a 10x slowdown of the 
> > > runtime of each iteration which I can't explain. [...]

A 10x slowdown would be consistent with not running your testcase 
but 'perf bench sched messaging' by accident, or so.

But I was really just guessing wildly here.

Thanks,

	Ingo

^ permalink raw reply

* Re: [patch net-next] ipv6: allow userspace to create address with IFLA_F_TEMPORARY flag
From: Hannes Frederic Sowa @ 2013-10-29 13:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, David Miller, jiri, vyasevich, netdev, kuznet,
	jmorris, yoshfuji, kaber, thaller, stephen, Dave Taht
In-Reply-To: <1383052163.5464.28.camel@edumazet-glaptop.roam.corp.google.com>

On Tue, Oct 29, 2013 at 06:09:23AM -0700, Eric Dumazet wrote:
> On Tue, 2013-10-29 at 13:40 +0100, Hannes Frederic Sowa wrote:
> 
> > Yes, this is a very unfortunate situation. From my experience it is not that
> > easy to get a patch merged into isc-dhcp.
> > 
> > It seems not that invasive to switch from af_packet to an udp socket with
> > SO_BROADCAST set.
> 
> Has anyone cooked and submitted such a patch ?

Not me, my patch was intended for the server side (dhcrelay correctly
choosing source addresses).

> I can visit ISC with a hammer if needed ;)

Actually that is a very big motivation for me to try to come up with a
patch at the weekend. ;-)

^ permalink raw reply

* Re: [patch net-next] ipv6: allow userspace to create address with IFLA_F_TEMPORARY flag
From: Eric Dumazet @ 2013-10-29 13:09 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, David Miller, jiri, vyasevich, netdev, kuznet,
	jmorris, yoshfuji, kaber, thaller, stephen, Dave Taht
In-Reply-To: <20131029124010.GA15762@order.stressinduktion.org>

On Tue, 2013-10-29 at 13:40 +0100, Hannes Frederic Sowa wrote:

> Yes, this is a very unfortunate situation. From my experience it is not that
> easy to get a patch merged into isc-dhcp.
> 
> It seems not that invasive to switch from af_packet to an udp socket with
> SO_BROADCAST set.

Has anyone cooked and submitted such a patch ?

I can visit ISC with a hammer if needed ;)

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Neil Horman @ 2013-10-29 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029125233.GA17449@gmail.com>

On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > > done >> counters.txt 2>&1
> > > > 
> > > > where test.sh is:
> > > > #!/bin/sh
> > > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > > 
> > > What does '-- /root/test.sh' do?
> > > 
> > > Unless I'm missing something, the line above will run:
> > > 
> > >   perf bench sched messaging -- /root/test.sh
> > > 
> > > which should be equivalent to:
> > > 
> > >   perf bench sched messaging
> > > 
> > > i.e. /root/test.sh won't be run.
> > 
> > According to the perf man page, I'm supposed to be able to use -- 
> > to separate perf command line parameters from the command I want 
> > to run.  And it definately executed test.sh, I added an echo to 
> > stdout in there as a test run and observed them get captured in 
> > counters.txt
> 
> Well, '--' can be used to delineate the command portion for cases 
> where it's ambiguous.
> 
> Here's it's unambiguous though. This:
> 
>   perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> 
> stops parsing a valid option after the -ddd option, so in theory it 
> should execute 'perf bench sched messaging -- /root/test.sh' where 
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
> ignored.
> 
> The message output you provided seems to suggest that to be the 
> case:
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
> See how the command executed by perf stat was 'perf bench ...'.
> 
> Did you want to run:
> 
>   perf stat --repeat 20 -C 0 -ddd /root/test.sh
> 
I'm sure it worked properly on my system here, I specificially checked it, but
I'll gladly run it again.  You have to give me an hour as I have a meeting to
run to, but I'll have results shortly.
Neil

> ?
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply

* Re: [PATCH 00/19] Enable various Renesas drivers on all ARM platforms
From: Artem Bityutskiy @ 2013-10-29 13:06 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: linux-fbdev, linux-sh, Linus Walleij, Guennadi Liakhovetski,
	Thierry Reding, linux-mtd, linux-i2c, Laurent Pinchart,
	Vinod Koul, Joerg Roedel, Wolfram Sang, Magnus Damm,
	Eduardo Valentin, Tomi Valkeinen, linux-serial, linux-input,
	Zhang Rui, Chris Ball, Jean-Christophe Plagniol-Villard,
	linux-media, linux-pwm, Samuel Ortiz, linux-pm, Ian Molton,
	Mark Brown <b
In-Reply-To: <Pine.LNX.4.64.1310291009121.8404@axis700.grange>

On Tue, 2013-10-29 at 10:12 +0100, Guennadi Liakhovetski wrote:
> Hi Laurent
> 
> On Tue, 29 Oct 2013, Laurent Pinchart wrote:
> 
> > Hello,
> > 
> > This patch series, based on v3.12-rc7, prepares various Renesas drivers
> > for migration to multiplatform kernels by enabling their compilation or
> > otherwise fixing them on all ARM platforms. The patches are pretty
> > straightforward and are described in their commit message.
> > 
> > I'd like to get all these patches merged in v3.14. As they will need to go
> > through their respective subsystems' trees, I would appreciate if all
> > maintainers involved could notify me when they merge patches from this series
> > in their tree to help me tracking the merge status. I don't plan to send pull
> > requests individually for these patches, and I will repost patches
> > individually if changes are requested during review.
> > 
> > If you believe the issue should be solved in a different way (for instance by
> > removing the architecture dependency completely) please reply to the cover
> > letter to let other maintainers chime in.
> 
> Exactly this was my doubt. If we let these drivers build on all ARM 
> platforms... Maybe we should just let them build everywhere? Unless there 
> are real ARM dependencies. Maybe you could try to remove the restriction 
> and try to build them all on x86?

If they have never been used on anything but ARM, why would you remove
ARM dependencies? Just for the sake of compile-checking?

Also, if ARM dependency is ever removed, all these should become 'n' by
default in the Kconfig, in order to make sure they do not slip into
defconfigs of different architectures.

-- 
Best Regards,
Artem Bityutskiy

^ permalink raw reply

* [PATCH net-next] net: introduce gro_frag_list_enable sysctl
From: Eric Dumazet @ 2013-10-29 13:06 UTC (permalink / raw)
  To: Christoph Paasch, David Miller
  Cc: Herbert Xu, netdev, Jerry Chu, Michael Dalton
In-Reply-To: <20131029090849.GC5944@cpaasch-mac>

From: Eric Dumazet <edumazet@google.com>

Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")

(Jerry is working on adding native GRO support for tunnels)

skb_segment() only deals with a frag_list chain containing MSS sized
fragments.

This patch adds support any kind of frag, and adds a new sysctl,
as clearly the GRO layer should avoid building frag_list skbs
on a router, as the segmentation is adding cpu overhead.

Note that we could try to reuse page fragments instead of doing
copy to linear skbs, but this requires a fair amount of work,
and possible truesize nightmares, as we do not track individual
(per page fragment) truesizes.

/proc/sys/net/core/gro_frag_list_enable possible values are :

0 : GRO layer is not allowed to use frag_list to extend skb capacity
1 : GRO layer is allowed to use frag_list, but skb_segment()
    automatically sets the sysctl to 0.
2 : GRO is allowed to use frag_list, and skb_segment() wont
    clear the sysctl.

Default value is 1 : automatic discovery

Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: Jerry Chu <hkchu@google.com>
Cc: Michael Dalton <mwdalton@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/sysctl/net.txt |   19 +++++++++++++++++++
 include/linux/netdevice.h    |    1 +
 net/core/skbuff.c            |   29 ++++++++++++++++++++---------
 net/core/sysctl_net_core.c   |   10 ++++++++++
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 9a0319a82470..8778568ae64e 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -87,6 +87,25 @@ sysctl.net.busy_read globally.
 Will increase power usage.
 Default: 0 (off)
 
+gro_frag_list_enable
+--------------------
+
+GRO layer can build full size GRO packets (~64K of payload) if it is allowed
+to extend skb using the frag_list pointer. However, this strategy is a win
+on hosts, where TCP flows are terminated. For a router, using frag_list
+skbs is not a win because we have to segment skbs before transmit,
+as most NIC drivers do not support frag_list.
+As soon as one frag_list skb has to be segmented, this sysctl is automatically
+changed from 1 to 0.
+If the value is set to 2, kernel wont change it.
+
+Choices : 0 (off),
+          1 (on, with automatic change to 0)
+          2 (on, permanent)
+
+Default: 1 (on, with automatic downgrade on a router)
+
+
 rmem_default
 ------------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27f62f746621..b82ff52f301e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2807,6 +2807,7 @@ extern int		netdev_max_backlog;
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
+extern int		sysctl_gro_frag_list_enable;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 bool netdev_has_any_upper_dev(struct net_device *dev);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0ab32faa520f..d5b74ed1e9cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2761,7 +2761,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	unsigned int len;
 	__be16 proto;
 	bool csum;
-	int sg = !!(features & NETIF_F_SG);
+	bool sg = !!(features & NETIF_F_SG);
 	int nfrags = skb_shinfo(skb)->nr_frags;
 	int err = -ENOMEM;
 	int i = 0;
@@ -2793,7 +2793,13 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				if (sysctl_gro_frag_list_enable == 1)
+					sysctl_gro_frag_list_enable = 0;
+				hsize = len;
+				sg = false;
+				goto do_linear;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
@@ -2812,6 +2818,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			skb_release_head_state(nskb);
 			__skb_push(nskb, doffset);
 		} else {
+do_linear:
 			nskb = __alloc_skb(hsize + doffset + headroom,
 					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
 					   NUMA_NO_NODE);
@@ -2838,9 +2845,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
-
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
@@ -2849,6 +2853,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			continue;
 		}
 
+		if (fskb != skb_shinfo(skb)->frag_list)
+			goto perform_csum_check;
+
 		frag = skb_shinfo(nskb)->frags;
 
 		skb_copy_from_linear_data_offset(skb, offset,
@@ -2944,9 +2951,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		int i = skbinfo->nr_frags;
 		int nr_frags = pinfo->nr_frags + i;
 
-		if (nr_frags > MAX_SKB_FRAGS)
+		if (unlikely(nr_frags > MAX_SKB_FRAGS)) {
+			if (!sysctl_gro_frag_list_enable)
+				return -E2BIG;
 			goto merge;
-
+		}
 		offset -= headlen;
 		pinfo->nr_frags = nr_frags;
 		skbinfo->nr_frags = 0;
@@ -2977,9 +2986,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		unsigned int first_size = headlen - offset;
 		unsigned int first_offset;
 
-		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+		if (unlikely(nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)) {
+			if (!sysctl_gro_frag_list_enable)
+				return -E2BIG;
 			goto merge;
-
+		}
 		first_offset = skb->data -
 			       (unsigned char *)page_address(page) +
 			       offset;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cca444190907..2d6aaf6d5838 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -24,6 +24,7 @@
 
 static int zero = 0;
 static int one = 1;
+static int two = 2;
 static int ushort_max = USHRT_MAX;
 
 #ifdef CONFIG_RPS
@@ -360,6 +361,15 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "gro_frag_list_enable",
+		.data		= &sysctl_gro_frag_list_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
 	{ }
 };
 

^ permalink raw reply related

* Re: Bug in skb_segment: fskb->len != len
From: Eric Dumazet @ 2013-10-29 12:57 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: Herbert Xu, netdev
In-Reply-To: <20131029090849.GC5944@cpaasch-mac>

On Tue, 2013-10-29 at 10:08 +0100, Christoph Paasch wrote:
> Ok, my router does not crash anymore with my workload.
> 
> Thanks for fixing it!
> 
> Tested-by: Christoph Paasch <christoph.paasch@uclouvain.be>

Ok fine ;)

Do you mind to test the official patch, as I added a new sysctl so that
GRO layer do not use frag_list on a router ?

You could check that /proc/sys/net/core/gro_frag_list_enable is
automatically cleared.

This way, we have optimal behavior for host or router.

Thanks !

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Ingo Molnar @ 2013-10-29 12:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029114907.GE24477@neilslaptop.think-freely.org>


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > done >> counters.txt 2>&1
> > > 
> > > where test.sh is:
> > > #!/bin/sh
> > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > 
> > What does '-- /root/test.sh' do?
> > 
> > Unless I'm missing something, the line above will run:
> > 
> >   perf bench sched messaging -- /root/test.sh
> > 
> > which should be equivalent to:
> > 
> >   perf bench sched messaging
> > 
> > i.e. /root/test.sh won't be run.
> 
> According to the perf man page, I'm supposed to be able to use -- 
> to separate perf command line parameters from the command I want 
> to run.  And it definately executed test.sh, I added an echo to 
> stdout in there as a test run and observed them get captured in 
> counters.txt

Well, '--' can be used to delineate the command portion for cases 
where it's ambiguous.

Here's it's unambiguous though. This:

  perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh

stops parsing a valid option after the -ddd option, so in theory it 
should execute 'perf bench sched messaging -- /root/test.sh' where 
'-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
ignored.

The message output you provided seems to suggest that to be the 
case:

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

See how the command executed by perf stat was 'perf bench ...'.

Did you want to run:

  perf stat --repeat 20 -C 0 -ddd /root/test.sh

?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH net-next] ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
From: Hannes Frederic Sowa @ 2013-10-29 12:50 UTC (permalink / raw)
  To: David Miller, netdev, fweimer
In-Reply-To: <20131029120424.GD26185@order.stressinduktion.org>

On Tue, Oct 29, 2013 at 01:04:25PM +0100, Hannes Frederic Sowa wrote:
> The reason for this is, that __ip_rt_update_pmtu is called with zero mtu
> and will get clamped to min_pmtu. It is possible to tell people to also
> increase min_pmtu but I don't want to take the risk of breaking other
> applications on a DNS server given how the current semantics are.

It is actually difficult to select a reasonable mtu here given more than two
interfaces with different mtus.

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH] bridge: pass correct vlan id to multicast code
From: Vlad Yasevich @ 2013-10-29 12:45 UTC (permalink / raw)
  To: Amos Kong, Vlad Yasevich; +Cc: netdev, shemminger, makita.toshiaki
In-Reply-To: <20131029023646.GA2795@amosk.info>

On 10/28/2013 10:36 PM, Amos Kong wrote:
> On Mon, Oct 28, 2013 at 03:45:07PM -0400, Vlad Yasevich wrote:
>> Currently multicast code attempts to extrace the vlan id from
>> the skb even when vlan filtering is disabled.  This can lead
>> to mdb entries being created with the wrong vlan id.
>> Pass the already extracted vlan id to the multicast
>> filtering code to make the correct id is used in
>> creation as well as lookup.
>
> Hi Vlad,
>
> Can we just update br_vlan_get_tag() to set vid to 0 if dev->vlan is
> disabled? I guess it would effect br_handle_local_finish().

I have another patch that refactors the br_vlan_get_tag() and that would 
address this issue.  I am testing it now.

I am hoping to simplify the patch a bit though since I am not
crazy about the function signature that br_vlan_get_tag() gets.

-vlad

>
>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
>> ---
>>   net/bridge/br_device.c    |  2 +-
>>   net/bridge/br_input.c     |  2 +-
>>   net/bridge/br_multicast.c | 44 +++++++++++++++++++-------------------------
>>   net/bridge/br_private.h   |  6 ++++--
>>   4 files changed, 25 insertions(+), 29 deletions(-)
>
> ...
>
>> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
>> index 8b0b610..686284f 100644
>> --- a/net/bridge/br_multicast.c
>> +++ b/net/bridge/br_multicast.c
>> @@ -947,7 +947,8 @@ void br_multicast_disable_port(struct net_bridge_port *port)
>>
>>   static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
>>   					 struct net_bridge_port *port,
>> -					 struct sk_buff *skb)
>> +					 struct sk_buff *skb,
>> +					 u16 vid)
>>   {
>>   	struct igmpv3_report *ih;
>>   	struct igmpv3_grec *grec;
>> @@ -957,12 +958,10 @@ static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
>>   	int type;
>>   	int err = 0;
>>   	__be32 group;
>> -	u16 vid = 0;
>>
>>   	if (!pskb_may_pull(skb, sizeof(*ih)))
>>   		return -EINVAL;
>>
>> -	br_vlan_get_tag(skb, &vid);
>
> After applied the patch, we always use vid in br_dev_xmit()->br_allowed_ingress(),
> is it possible that the vlan of bridge is re-enabled when other
> changed functions are called?
>
> We can just add a enabled checking before this kind of br_vlan_get_tag()?
>
> if (!br->vlan_enabled)
>      br_vlan_get_tag(skb2, &vid);
>
>
>>   	ih = igmpv3_report_hdr(skb);
>>   	num = ntohs(ih->ngrec);
>>   	len = sizeof(*ih);
>
> ...
>

^ permalink raw reply

* RE: [tipc-discussion] [PATCH net-next] tipc: remove two indentation levels in tipc_recv_msg routine
From: David Laight @ 2013-10-29 12:42 UTC (permalink / raw)
  To: Andreas Bofjäll, Erik Hugne
  Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <526FACA9.9020004@ericsson.com>

>   		continue;
> +
> +unlock:
> +		tipc_node_unlock(n_ptr);
>   cont:
>   		kfree_skb(buf);
>   	}

Might be better to call the labels 'unlock_discard' and 'discard'.

	David

^ permalink raw reply

* Re: [patch net-next] ipv6: allow userspace to create address with IFLA_F_TEMPORARY flag
From: Hannes Frederic Sowa @ 2013-10-29 12:40 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, jiri, vyasevich, netdev, kuznet, jmorris, yoshfuji,
	kaber, thaller, stephen
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73BB@saturn3.aculab.com>

On Tue, Oct 29, 2013 at 09:37:06AM -0000, David Laight wrote:
> > Note that you don't even need to put the DHCP protocol core into the
> > kernel to fix the promiscuous problem.  You just have to use the
> > current kernel interfaces correctly.
> > 
> > It used to be the case a very long time ago that you couldn't even
> > receive broadcast UDP datagrams on a socket until an address was
> > configured on it.
> > 
> > So everyone turns on promiscuous mode and uses RAW sockets or
> > AF_PACKET.
> > 
> > Stupid?  yes.
> 
> Not only that, but the dhcp client could use a normal UDP socket
> to keep the lease renewed - I suspect it has only ever needed
> to use the BPF interface (I didn't think it set promiscuous)
> when acquiring the initial lease.

Yes, this is a very unfortunate situation. From my experience it is not that
easy to get a patch merged into isc-dhcp.

It seems not that invasive to switch from af_packet to an udp socket with
SO_BROADCAST set.

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH net-next] tipc: remove two indentation levels in tipc_recv_msg routine
From: Andreas Bofjäll @ 2013-10-29 12:40 UTC (permalink / raw)
  To: Erik Hugne, David Laight
  Cc: jon.maloy, netdev, tipc-discussion, davem, Paul.Gortmaker
In-Reply-To: <20131029120708.GI14038@eerihug-hybrid.rnd.ki.sw.ericsson.se>

On 10/29/2013 01:07 PM, Erik Hugne wrote:
> On Tue, Oct 29, 2013 at 11:13:09AM +0000, David Laight wrote:
>> I can only see one 'goto cont', an explicit kfree_skb() and
>> continue would be clearer.
>
> There's actually a few more 'goto cont' in the early error checks
> in this function causes it to bail early. Those are not visible
> in the patchdiff however.

If you wanted to make this a bit simpler, you could make two labels and 
move the calls to tipc_node_unlock(n_ptr). The blocks on lines 1545, 
1558 and 1614 (in the patched code) could be replaced by a simple "goto 
unlock" this way.

--- link.c-old	2013-10-29 13:34:35.804926348 +0100
+++ link.c	2013-10-29 13:39:23.991842809 +0100
@@ -1541,10 +1541,8 @@

  		/* Locate unicast link endpoint that should handle message */
  		l_ptr = n_ptr->links[b_ptr->identity];
-		if (unlikely(!l_ptr)) {
-			tipc_node_unlock(n_ptr);
-			goto cont;
-		}
+		if (unlikely(!l_ptr))
+			goto unlock;

  		/* Verify that communication with node is currently allowed */
  		if ((n_ptr->block_setup & WAIT_PEER_DOWN) &&
@@ -1554,10 +1552,8 @@
  			!msg_redundant_link(msg))
  			n_ptr->block_setup &= ~WAIT_PEER_DOWN;

-		if (n_ptr->block_setup) {
-			tipc_node_unlock(n_ptr);
-			goto cont;
-		}
+		if (n_ptr->block_setup)
+			goto unlock;

  		/* Validate message sequence number info */
  		seq_no = msg_seqno(msg);
@@ -1611,8 +1607,7 @@
  				tipc_node_unlock(n_ptr);
  				continue;
  			}
-			tipc_node_unlock(n_ptr);
-			goto cont;
+			goto unlock;
  		}

  		/* Link is now in state WORKING_WORKING */
@@ -1681,6 +1676,9 @@
  		tipc_node_unlock(n_ptr);
  		tipc_net_route_msg(buf);
  		continue;
+
+unlock:
+		tipc_node_unlock(n_ptr);
  cont:
  		kfree_skb(buf);
  	}

/Andreas

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk

^ permalink raw reply

* [PATCH] netfilter: xt_NFQUEUE: fix --queue-bypass regression
From: Pablo Neira Ayuso @ 2013-10-29 12:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1383048796-7964-1-git-send-email-pablo@netfilter.org>

From: Holger Eitzenberger <holger@eitzenberger.org>

V3 of the NFQUEUE target ignores the --queue-bypass flag,
causing packets to be dropped when the userspace listener
isn't running.

Regression is in since 8746ddcf12bb26 ("netfilter: xt_NFQUEUE:
introduce CPU fanout").

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_NFQUEUE.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/xt_NFQUEUE.c b/net/netfilter/xt_NFQUEUE.c
index 1e2fae3..ed00fef 100644
--- a/net/netfilter/xt_NFQUEUE.c
+++ b/net/netfilter/xt_NFQUEUE.c
@@ -147,6 +147,7 @@ nfqueue_tg_v3(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_NFQ_info_v3 *info = par->targinfo;
 	u32 queue = info->queuenum;
+	int ret;
 
 	if (info->queues_total > 1) {
 		if (info->flags & NFQ_FLAG_CPU_FANOUT) {
@@ -157,7 +158,11 @@ nfqueue_tg_v3(struct sk_buff *skb, const struct xt_action_param *par)
 			queue = nfqueue_hash(skb, par);
 	}
 
-	return NF_QUEUE_NR(queue);
+	ret = NF_QUEUE_NR(queue);
+	if (info->flags & NFQ_FLAG_BYPASS)
+		ret |= NF_VERDICT_FLAG_QUEUE_BYPASS;
+
+	return ret;
 }
 
 static struct xt_target nfqueue_tg_reg[] __read_mostly = {
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] netfilter fix for net
From: Pablo Neira Ayuso @ 2013-10-29 12:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

This pull request contains the following netfilter fix:

* fix --queue-bypass in xt_NFQUEUE revision 3. While adding the
  revision 3 of this target, the bypass flags were not correctly
  handled anymore, thus, breaking packet bypassing if no application
  is listening from userspace, patch from Holger Eitzenberger,
  reported by Florian Westphal.

You can pull this change from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

Thanks!

----------------------------------------------------------------

The following changes since commit fecda03493646b53f53892fa3c38c75ba9310374:

  net: sctp: fix ASCONF to allow non SCTP_ADDR_SRC addresses in ipv6 (2013-10-23 16:57:14 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

for you to fetch changes up to d954777324ffcba0b2f8119c102237426c654eeb:

  netfilter: xt_NFQUEUE: fix --queue-bypass regression (2013-10-29 13:05:54 +0100)

----------------------------------------------------------------
Holger Eitzenberger (1):
      netfilter: xt_NFQUEUE: fix --queue-bypass regression

 net/netfilter/xt_NFQUEUE.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

^ permalink raw reply

* Re: [PATCH net-next 0/2] Removal of struct esp_data
From: Steffen Klassert @ 2013-10-29 12:11 UTC (permalink / raw)
  To: David Miller; +Cc: mathias.krause, netdev, herbert
In-Reply-To: <20131023.154002.1833051243694861649.davem@davemloft.net>

On Wed, Oct 23, 2013 at 03:40:02PM -0400, David Miller wrote:
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Wed, 23 Oct 2013 10:07:16 +0200
> 
> > Well, I thought to either take this as a reminder to implement the
> > missing stuff or to take the removing patches if this is really obsolete.
> > I'll do one of both once I'm back from conference week in Edinburgh.
> 
> Sounds like a good plan, thanks.

RFC 4303 recommends to use TFC padding instead of the padding field:

'As noted above, the Padding field is limited to 255 bytes in length.
This generally will not be adequate to hide traffic characteristics
relative to traffic flow confidentiality requirements.  An optional
field, within the payload data, is provided specifically to address
the TFC requirement.'

So looks like we can remove the padlen field, both patches
applied to ipsec-next. Thanks Mathias!

^ permalink raw reply

* Re: [PATCH net-next] tipc: remove two indentation levels in tipc_recv_msg routine
From: Erik Hugne @ 2013-10-29 12:07 UTC (permalink / raw)
  To: David Laight; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73BC@saturn3.aculab.com>

On Tue, Oct 29, 2013 at 11:13:09AM +0000, David Laight wrote:
> I can only see one 'goto cont', an explicit kfree_skb() and
> continue would be clearer.

There's actually a few more 'goto cont' in the early error checks
in this function causes it to bail early. Those are not visible
in the patchdiff however.

//E

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk

^ permalink raw reply

* Re: [PATCH net-next] ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
From: Hannes Frederic Sowa @ 2013-10-29 12:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fweimer
In-Reply-To: <20131029.000844.1092862708536984032.davem@davemloft.net>

On Tue, Oct 29, 2013 at 12:08:44AM -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Sat, 26 Oct 2013 22:11:58 +0200
> 
> > Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
> > their sockets won't accept and install new path mtu information and they
> > will always use the interface mtu for outgoing packets. It is guaranteed
> > that the packet is not fragmented locally. But we won't set the DF-Flag
> > on the outgoing frames.
> > 
> > Florian Weimer had the idea to use this flag to ensure DNS servers are
> > never generating outgoing fragments. They may well be fragmented on the
> > path, but the server never stores or usees path mtu values, which could
> > well be forged in an attack.
> > 
> > (The root of the problem with path MTU discovery is that there is
> > no reliable way to authenticate ICMP Fragmentation Needed But DF Set
> > messages because they are sent from intermediate routers with their
> > source addresses, and the IMCP payload will not always contain sufficient
> > information to identify a flow.)
> 
> I do not like this reasoning.  You have several more acceptable paths to take
> to resolve this problem:

So, I hope I can convince you to merge this patch. ;-)

I really tried hard to find alternatives or even a way to enable
the protection automatically given that at least unbound does apply
IP_PMTUDISC_DONT to its sockets already. These are the reasons why I
came up with the new IP_PMTUDISC_INTERFACE value:

> 1) "I don't trust path MTU information at all"
> 
>    Just turn it off globally, end of story.  It has the same effect as your
>    new per-application mode.

I think you are referring to ipv4/ip_no_pmtu_disc?

This setting actually worsens the situation:

0.000 `echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc`
0.500 `ethtool -K tun0 ufo off`

/* check current mtu with connected socket */
1.000 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 3
1.000 connect(3, ..., ...) = 0
1.000 getsockopt(3, IPPROTO_IP, IP_MTU, [1500], [4]) = 0
1.000 close(3) = 0

/* inject frag_needed to unconnected socket */
2.000 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 3
2.000 setsockopt(3, IPPROTO_IP, IP_MTU_DISCOVER, [IP_PMTUDISC_DONT], 4) = 0
2.000 sendto(3, ..., 1000, 0, ..., ...) = 1000
2.000 > udp (1000)
/* this simulates a forged icmp packet which tries to lower the mtu */
2.100 < [udp (1000)] icmp unreachable frag_needed mtu 1400

2.5 sendto(3, ..., 1000, 0, ..., ...) = 1000
/* first fragment splitted at 522 bytes */
2.5 > udp(522)
/* didn't find a way to match for the second fragment - so error out here
 * comment out this check to check for the mtu a IP_PMTUDISC_DONT socket now has
 *
 * it really is used, otherwise we would not have seen this fragment
 */

3.000 close(3) = 0

/* verify mtu again */
3.000 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 3
3.000 connect(3, ..., ...) = 0
/* we actually have 522 as the minimal mtu - so error out here */
3.000 getsockopt(3, IPPROTO_IP, IP_MTU, [1400], [4]) = 0
3.000 close(3) = 0

5.000 `echo 0 > /proc/sys/net/ipv4/ip_no_pmtu_disc`

Given this packetdrill snippet we see it is actually very
counterproductive to tell people to globally disable pmtu discovery as
it will certainly generate more fragments.

The reason for this is, that __ip_rt_update_pmtu is called with zero mtu
and will get clamped to min_pmtu. It is possible to tell people to also
increase min_pmtu but I don't want to take the risk of breaking other
applications on a DNS server given how the current semantics are.

If we receive a pmtu event on a TCP or DCCP socket we actually
check for the pmtudisc setting and won't apply it if it is set to
IP_PMTUDISC_DONT. This is not checked on UDP and RAW sockets and it was
very hard to think about the consequences to introduce such checks now
(at least for me).

Adding a new sysctl would be a viable option but it is harder to use
by DNS server implementations and could be fiddled with by users or
scripts etc.  I dislike a globel (or even per namespace) setting. Also
Fedora e.g. installs an unbound server on Desktops if one installs
the dnssec-trigger package.  Maybe we will see more DNS servers on
workstations where we don't want to globally disable pmtu.

> 2) "I don't trust path MTU information unless the full socket ID is available
>     in the ICMP packets quoted headers"
> 
>    Then simply implement a policy as such and submit it to me.

We don't trust the MTU information in any case for that socket.

It seems easy to inject wrong path mtu information on a socket which will
be installed as a globel fnhe for 600 seconds. This is a very viable
target if an attacker is able to do cache poisoning afterwards. There
is not much protection even if we match the full socket ID on an
UDP socket. As this socket is more exposed to attacks we drop the mtu
information there. Of course, other ports could also install wrong pmtu
information causing packets being fragmented so we always use the path
mtu and don't fragment locally.  Dropping the pmtu information on these
sockets just makes it more symmetrical and causes less stress on the fnhe
hash tables if people do these attacks (they could well be running these
attacks in parallel). If the DNS port is the only one reachable externally
we won't do any pmtu processing on that server from untrusted peers, which is
a nice side effect.

All that said, I hope dns servers provide a setting to apply this option
on sockets and fail to start up if it could not be applied. So an
administrator is not in some sense of false hope that this protection
is enabled when running on an older kernel. This could only be done if we
provide a new value for IP_MTU_DISCOVER.

Greetings,

  Hannes

^ permalink raw reply

* Re: IPV6 nf defrag does not work
From: Jiri Pirko @ 2013-10-29 12:03 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, pablo, netfilter-devel, yoshfuji, kadlec, kaber
In-Reply-To: <20131029115617.GA16615@breakpoint.cc>

Tue, Oct 29, 2013 at 12:56:17PM CET, fw@strlen.de wrote:
>Jiri Pirko <jiri@resnulli.us> wrote:
>> On the current net-next if you on HOSTA do:
>> ip6tables -I INPUT -p icmpv6 -j DROP
>> ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
>> 
>> and on HOSTB you do:
>> ping6 HOSTA -s2000    (MTU is 1500)
>> 
>> Only the first ICMP echo request will be passed through, the rest is not
>> passed on HOSTA. This issue does not occur with smaller packets than MTU (where
>> fragmentation does not happen).
>>
>> I'm trying to find out where the problem is.
>
>Are you sure this is new behaviour? As far back as I can remember
>it was always like this.

Yes. This is not new.

>
>in ip6tables, the individual fragments are sent through the ruleset,
>iow. you'll need to make use of '-m conntrack' to match the fragments
>belonging to an existing connection.

Hmm. I think that it is not correct to force user (iptables user) to
make dirrerent rules because some ipv6 packets might be fragmented.
This should be handled in kernel.

>
>I don't know why this is, and I don't like this either.
>But this is how it was implemented, see
>
>net/ipv6/netfilter/nf_defrag_ipv6_hooks.c, ipv6_defrag() ->
>nf_ct_frag6_output()

Yep. I'm studying the code atm.



^ permalink raw reply

* [net-next  11/11] i40e: fix error return code in i40e_probe()
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Wei Yongjun, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>

Fix to return -ENOMEM in the memory alloc error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 41a79df..be15938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7204,8 +7204,10 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	 */
 	len = sizeof(struct i40e_vsi *) * pf->hw.func_caps.num_vsis;
 	pf->vsi = kzalloc(len, GFP_KERNEL);
-	if (!pf->vsi)
+	if (!pf->vsi) {
+		err = -ENOMEM;
 		goto err_switch_setup;
+	}
 
 	err = i40e_setup_pf_switch(pf);
 	if (err) {
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  10/11] ixgbevf: Add zero_base handler to network statistics
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Alexander Duyck,
	Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

This patch removes the need to keep a zero_base variable in the adapter
structure. Now we just use two different macros to set the non-zero and
zero base. This adds to readability and shortens some of the structure
initialization under 80 columns. The gathering of status for ethtool was
slightly modified to again better fit into 80 columns and become a bit
more readable.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ethtool.c | 78 ++++++++++++++++------------
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h |  1 -
 2 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ethtool.c b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
index 44f7c42..54d9ace 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
@@ -45,16 +45,27 @@
 
 struct ixgbe_stats {
 	char stat_string[ETH_GSTRING_LEN];
-	int sizeof_stat;
-	int stat_offset;
-	int base_stat_offset;
-	int saved_reset_offset;
+	struct {
+		int sizeof_stat;
+		int stat_offset;
+		int base_stat_offset;
+		int saved_reset_offset;
+	};
 };
 
-#define IXGBEVF_STAT(m, b, r)  sizeof(((struct ixgbevf_adapter *)0)->m), \
-			    offsetof(struct ixgbevf_adapter, m),         \
-			    offsetof(struct ixgbevf_adapter, b),         \
-			    offsetof(struct ixgbevf_adapter, r)
+#define IXGBEVF_STAT(m, b, r) { \
+	.sizeof_stat = FIELD_SIZEOF(struct ixgbevf_adapter, m), \
+	.stat_offset = offsetof(struct ixgbevf_adapter, m), \
+	.base_stat_offset = offsetof(struct ixgbevf_adapter, b), \
+	.saved_reset_offset = offsetof(struct ixgbevf_adapter, r) \
+}
+
+#define IXGBEVF_ZSTAT(m) { \
+	.sizeof_stat = FIELD_SIZEOF(struct ixgbevf_adapter, m), \
+	.stat_offset = offsetof(struct ixgbevf_adapter, m), \
+	.base_stat_offset = -1, \
+	.saved_reset_offset = -1 \
+}
 
 static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 	{"rx_packets", IXGBEVF_STAT(stats.vfgprc, stats.base_vfgprc,
@@ -65,22 +76,19 @@ static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 				  stats.saved_reset_vfgorc)},
 	{"tx_bytes", IXGBEVF_STAT(stats.vfgotc, stats.base_vfgotc,
 				  stats.saved_reset_vfgotc)},
-	{"tx_busy", IXGBEVF_STAT(tx_busy, zero_base, zero_base)},
+	{"tx_busy", IXGBEVF_ZSTAT(tx_busy)},
 	{"multicast", IXGBEVF_STAT(stats.vfmprc, stats.base_vfmprc,
 				   stats.saved_reset_vfmprc)},
-	{"rx_csum_offload_good", IXGBEVF_STAT(hw_csum_rx_good, zero_base,
-					      zero_base)},
-	{"rx_csum_offload_errors", IXGBEVF_STAT(hw_csum_rx_error, zero_base,
-						zero_base)},
-	{"tx_csum_offload_ctxt", IXGBEVF_STAT(hw_csum_tx_good, zero_base,
-					      zero_base)},
+	{"rx_csum_offload_good", IXGBEVF_ZSTAT(hw_csum_rx_good)},
+	{"rx_csum_offload_errors", IXGBEVF_ZSTAT(hw_csum_rx_error)},
+	{"tx_csum_offload_ctxt", IXGBEVF_ZSTAT(hw_csum_tx_good)},
 #ifdef BP_EXTENDED_STATS
-	{"rx_bp_poll_yield", IXGBEVF_STAT(bp_rx_yields, zero_base, zero_base)},
-	{"rx_bp_cleaned", IXGBEVF_STAT(bp_rx_cleaned, zero_base, zero_base)},
-	{"rx_bp_misses", IXGBEVF_STAT(bp_rx_missed, zero_base, zero_base)},
-	{"tx_bp_napi_yield", IXGBEVF_STAT(bp_tx_yields, zero_base, zero_base)},
-	{"tx_bp_cleaned", IXGBEVF_STAT(bp_tx_cleaned, zero_base, zero_base)},
-	{"tx_bp_misses", IXGBEVF_STAT(bp_tx_missed, zero_base, zero_base)},
+	{"rx_bp_poll_yield", IXGBEVF_ZSTAT(bp_rx_yields)},
+	{"rx_bp_cleaned", IXGBEVF_ZSTAT(bp_rx_cleaned)},
+	{"rx_bp_misses", IXGBEVF_ZSTAT(bp_rx_missed)},
+	{"tx_bp_napi_yield", IXGBEVF_ZSTAT(bp_tx_yields)},
+	{"tx_bp_cleaned", IXGBEVF_ZSTAT(bp_tx_cleaned)},
+	{"tx_bp_misses", IXGBEVF_ZSTAT(bp_tx_missed)},
 #endif
 };
 
@@ -398,6 +406,7 @@ static void ixgbevf_get_ethtool_stats(struct net_device *netdev,
 				      struct ethtool_stats *stats, u64 *data)
 {
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
+	char *base = (char *) adapter;
 	int i;
 #ifdef BP_EXTENDED_STATS
 	u64 rx_yields = 0, rx_cleaned = 0, rx_missed = 0,
@@ -426,18 +435,21 @@ static void ixgbevf_get_ethtool_stats(struct net_device *netdev,
 
 	ixgbevf_update_stats(adapter);
 	for (i = 0; i < IXGBE_GLOBAL_STATS_LEN; i++) {
-		char *p = (char *)adapter +
-			ixgbe_gstrings_stats[i].stat_offset;
-		char *b = (char *)adapter +
-			ixgbe_gstrings_stats[i].base_stat_offset;
-		char *r = (char *)adapter +
-			ixgbe_gstrings_stats[i].saved_reset_offset;
-		data[i] = ((ixgbe_gstrings_stats[i].sizeof_stat ==
-			    sizeof(u64)) ? *(u64 *)p : *(u32 *)p) -
-			  ((ixgbe_gstrings_stats[i].sizeof_stat ==
-			    sizeof(u64)) ? *(u64 *)b : *(u32 *)b) +
-			  ((ixgbe_gstrings_stats[i].sizeof_stat ==
-			    sizeof(u64)) ? *(u64 *)r : *(u32 *)r);
+		char *p = base + ixgbe_gstrings_stats[i].stat_offset;
+		char *b = base + ixgbe_gstrings_stats[i].base_stat_offset;
+		char *r = base + ixgbe_gstrings_stats[i].saved_reset_offset;
+
+		if (ixgbe_gstrings_stats[i].sizeof_stat == sizeof(u64)) {
+			if (ixgbe_gstrings_stats[i].base_stat_offset >= 0)
+				data[i] = *(u64 *)p - *(u64 *)b + *(u64 *)r;
+			else
+				data[i] = *(u64 *)p;
+		} else {
+			if (ixgbe_gstrings_stats[i].base_stat_offset >= 0)
+				data[i] = *(u32 *)p - *(u32 *)b + *(u32 *)r;
+			else
+				data[i] = *(u32 *)p;
+		}
 	}
 }
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 78e4c51..acf3806 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -361,7 +361,6 @@ struct ixgbevf_adapter {
 	struct ixgbe_hw hw;
 	u16 msg_enable;
 	struct ixgbevf_hw_stats stats;
-	u64 zero_base;
 	/* Interrupt Throttle Rate */
 	u32 eitr_param;
 
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  09/11] ixgbevf: add BP_EXTENDED_STATS for CONFIG_NET_RX_BUSY_POLL
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch adds the extended statistics similar to the ixgbe driver. These
statistics keep track of how often the busy polling yields, as well as how many
packets are cleaned or missed by the polling routine.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ethtool.c      | 32 +++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 22 ++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  6 +++++
 3 files changed, 60 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ethtool.c b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
index 21adb1b..44f7c42 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
@@ -74,6 +74,14 @@ static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 						zero_base)},
 	{"tx_csum_offload_ctxt", IXGBEVF_STAT(hw_csum_tx_good, zero_base,
 					      zero_base)},
+#ifdef BP_EXTENDED_STATS
+	{"rx_bp_poll_yield", IXGBEVF_STAT(bp_rx_yields, zero_base, zero_base)},
+	{"rx_bp_cleaned", IXGBEVF_STAT(bp_rx_cleaned, zero_base, zero_base)},
+	{"rx_bp_misses", IXGBEVF_STAT(bp_rx_missed, zero_base, zero_base)},
+	{"tx_bp_napi_yield", IXGBEVF_STAT(bp_tx_yields, zero_base, zero_base)},
+	{"tx_bp_cleaned", IXGBEVF_STAT(bp_tx_cleaned, zero_base, zero_base)},
+	{"tx_bp_misses", IXGBEVF_STAT(bp_tx_missed, zero_base, zero_base)},
+#endif
 };
 
 #define IXGBE_QUEUE_STATS_LEN 0
@@ -391,6 +399,30 @@ static void ixgbevf_get_ethtool_stats(struct net_device *netdev,
 {
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
 	int i;
+#ifdef BP_EXTENDED_STATS
+	u64 rx_yields = 0, rx_cleaned = 0, rx_missed = 0,
+	    tx_yields = 0, tx_cleaned = 0, tx_missed = 0;
+
+	for (i = 0; i < adapter->num_rx_queues; i++) {
+		rx_yields += adapter->rx_ring[i].bp_yields;
+		rx_cleaned += adapter->rx_ring[i].bp_cleaned;
+		rx_yields += adapter->rx_ring[i].bp_yields;
+	}
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		tx_yields += adapter->tx_ring[i].bp_yields;
+		tx_cleaned += adapter->tx_ring[i].bp_cleaned;
+		tx_yields += adapter->tx_ring[i].bp_yields;
+	}
+
+	adapter->bp_rx_yields = rx_yields;
+	adapter->bp_rx_cleaned = rx_cleaned;
+	adapter->bp_rx_missed = rx_missed;
+
+	adapter->bp_tx_yields = tx_yields;
+	adapter->bp_tx_cleaned = tx_cleaned;
+	adapter->bp_tx_missed = tx_missed;
+#endif
 
 	ixgbevf_update_stats(adapter);
 	for (i = 0; i < IXGBE_GLOBAL_STATS_LEN; i++) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 85c7560..78e4c51 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -40,6 +40,7 @@
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 #include <net/busy_poll.h>
+#define BP_EXTENDED_STATS
 #endif
 
 /* wrapper around a pointer to a socket buffer,
@@ -80,6 +81,11 @@ struct ixgbevf_ring {
 	struct u64_stats_sync	syncp;
 	u64 hw_csum_rx_error;
 	u64 hw_csum_rx_good;
+#ifdef BP_EXTENDED_STATS
+	u64 bp_yields;
+	u64 bp_misses;
+	u64 bp_cleaned;
+#endif
 
 	u16 head;
 	u16 tail;
@@ -181,6 +187,9 @@ static inline bool ixgbevf_qv_lock_napi(struct ixgbevf_q_vector *q_vector)
 		WARN_ON(q_vector->state & IXGBEVF_QV_STATE_NAPI);
 		q_vector->state |= IXGBEVF_QV_STATE_NAPI_YIELD;
 		rc = false;
+#ifdef BP_EXTENDED_STATS
+		q_vector->tx.ring->bp_yields++;
+#endif
 	} else {
 		/* we don't care if someone yielded */
 		q_vector->state = IXGBEVF_QV_STATE_NAPI;
@@ -213,6 +222,9 @@ static inline bool ixgbevf_qv_lock_poll(struct ixgbevf_q_vector *q_vector)
 	if ((q_vector->state & IXGBEVF_QV_LOCKED)) {
 		q_vector->state |= IXGBEVF_QV_STATE_POLL_YIELD;
 		rc = false;
+#ifdef BP_EXTENDED_STATS
+		q_vector->rx.ring->bp_yields++;
+#endif
 	} else {
 		/* preserve yield marks */
 		q_vector->state |= IXGBEVF_QV_STATE_POLL;
@@ -358,6 +370,16 @@ struct ixgbevf_adapter {
 	unsigned int tx_ring_count;
 	unsigned int rx_ring_count;
 
+#ifdef BP_EXTENDED_STATS
+	u64 bp_rx_yields;
+	u64 bp_rx_cleaned;
+	u64 bp_rx_missed;
+
+	u64 bp_tx_yields;
+	u64 bp_tx_cleaned;
+	u64 bp_tx_missed;
+#endif
+
 	u32 link_speed;
 	bool link_up;
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index bc03a2e..b16b694 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -648,6 +648,12 @@ static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
 
 	ixgbevf_for_each_ring(ring, q_vector->rx) {
 		found = ixgbevf_clean_rx_irq(q_vector, ring, 4);
+#ifdef BP_EXTENDED_STATS
+		if (found)
+			ring->bp_cleaned += found;
+		else
+			ring->bp_misses++;
+#endif
 		if (found)
 			break;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  08/11] ixgbevf: implement CONFIG_NET_RX_BUSY_POLL
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch enables CONFIG_NET_RX_BUSY_POLL support in the VF code. This enables
sockets which have enabled the SO_BUSY_POLL socket option to use the
ndo_busy_poll_recv operation which could result in lower latency, at the cost
of higher CPU utilization, and increased power usage. This support is similar
to how the ixgbe driver works.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 109 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  68 ++++++++++++++
 2 files changed, 177 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index d7837dcc..85c7560 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -38,6 +38,10 @@
 
 #include "vf.h"
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+#include <net/busy_poll.h>
+#endif
+
 /* wrapper around a pointer to a socket buffer,
  * so a DMA handle can be stored along with the buffer */
 struct ixgbevf_tx_buffer {
@@ -145,7 +149,112 @@ struct ixgbevf_q_vector {
 	struct napi_struct napi;
 	struct ixgbevf_ring_container rx, tx;
 	char name[IFNAMSIZ + 9];
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	unsigned int state;
+#define IXGBEVF_QV_STATE_IDLE		0
+#define IXGBEVF_QV_STATE_NAPI		1    /* NAPI owns this QV */
+#define IXGBEVF_QV_STATE_POLL		2    /* poll owns this QV */
+#define IXGBEVF_QV_STATE_DISABLED	4    /* QV is disabled */
+#define IXGBEVF_QV_OWNED (IXGBEVF_QV_STATE_NAPI | IXGBEVF_QV_STATE_POLL)
+#define IXGBEVF_QV_LOCKED (IXGBEVF_QV_OWNED | IXGBEVF_QV_STATE_DISABLED)
+#define IXGBEVF_QV_STATE_NAPI_YIELD	8    /* NAPI yielded this QV */
+#define IXGBEVF_QV_STATE_POLL_YIELD	16   /* poll yielded this QV */
+#define IXGBEVF_QV_YIELD (IXGBEVF_QV_STATE_NAPI_YIELD | IXGBEVF_QV_STATE_POLL_YIELD)
+#define IXGBEVF_QV_USER_PEND (IXGBEVF_QV_STATE_POLL | IXGBEVF_QV_STATE_POLL_YIELD)
+	spinlock_t lock;
+#endif /* CONFIG_NET_RX_BUSY_POLL */
 };
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static inline void ixgbevf_qv_init_lock(struct ixgbevf_q_vector *q_vector)
+{
+
+	spin_lock_init(&q_vector->lock);
+	q_vector->state = IXGBEVF_QV_STATE_IDLE;
+}
+
+/* called from the device poll routine to get ownership of a q_vector */
+static inline bool ixgbevf_qv_lock_napi(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if (q_vector->state & IXGBEVF_QV_LOCKED) {
+		WARN_ON(q_vector->state & IXGBEVF_QV_STATE_NAPI);
+		q_vector->state |= IXGBEVF_QV_STATE_NAPI_YIELD;
+		rc = false;
+	} else {
+		/* we don't care if someone yielded */
+		q_vector->state = IXGBEVF_QV_STATE_NAPI;
+	}
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* returns true is someone tried to get the qv while napi had it */
+static inline bool ixgbevf_qv_unlock_napi(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = false;
+	spin_lock_bh(&q_vector->lock);
+	WARN_ON(q_vector->state & (IXGBEVF_QV_STATE_POLL |
+				   IXGBEVF_QV_STATE_NAPI_YIELD));
+
+	if (q_vector->state & IXGBEVF_QV_STATE_POLL_YIELD)
+		rc = true;
+	/* reset state to idle, unless QV is disabled */
+	q_vector->state &= IXGBEVF_QV_STATE_DISABLED;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* called from ixgbevf_low_latency_poll() */
+static inline bool ixgbevf_qv_lock_poll(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if ((q_vector->state & IXGBEVF_QV_LOCKED)) {
+		q_vector->state |= IXGBEVF_QV_STATE_POLL_YIELD;
+		rc = false;
+	} else {
+		/* preserve yield marks */
+		q_vector->state |= IXGBEVF_QV_STATE_POLL;
+	}
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* returns true if someone tried to get the qv while it was locked */
+static inline bool ixgbevf_qv_unlock_poll(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = false;
+	spin_lock_bh(&q_vector->lock);
+	WARN_ON(q_vector->state & (IXGBEVF_QV_STATE_NAPI));
+
+	if (q_vector->state & IXGBEVF_QV_STATE_POLL_YIELD)
+		rc = true;
+	/* reset state to idle, unless QV is disabled */
+	q_vector->state &= IXGBEVF_QV_STATE_DISABLED;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* true if a socket is polling, even if it did not get the lock */
+static inline bool ixgbevf_qv_busy_polling(struct ixgbevf_q_vector *q_vector)
+{
+	WARN_ON(!(q_vector->state & IXGBEVF_QV_OWNED));
+	return q_vector->state & IXGBEVF_QV_USER_PEND;
+}
+
+/* false if QV is currently owned */
+static inline bool ixgbevf_qv_disable(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if (q_vector->state & IXGBEVF_QV_OWNED)
+		rc = false;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+#endif /* CONFIG_NET_RX_BUSY_POLL */
 
 /*
  * microsecond values for various ITR rates shifted by 2 to fit itr register
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 06e29bd..bc03a2e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -310,6 +310,16 @@ static void ixgbevf_rx_skb(struct ixgbevf_q_vector *q_vector,
 			   struct sk_buff *skb, u8 status,
 			   union ixgbe_adv_rx_desc *rx_desc)
 {
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	skb_mark_napi_id(skb, &q_vector->napi);
+
+	if (ixgbevf_qv_busy_polling(q_vector)) {
+		netif_receive_skb(skb);
+		/* exit early if we busy polled */
+		return;
+	}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
 	ixgbevf_receive_skb(q_vector, skb, status, rx_desc);
 }
 
@@ -563,6 +573,11 @@ static int ixgbevf_poll(struct napi_struct *napi, int budget)
 	ixgbevf_for_each_ring(ring, q_vector->tx)
 		clean_complete &= ixgbevf_clean_tx_irq(q_vector, ring);
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	if (!ixgbevf_qv_lock_napi(q_vector))
+		return budget;
+#endif
+
 	/* attempt to distribute budget to each queue fairly, but don't allow
 	 * the budget to go below 1 because we'll exit polling */
 	if (q_vector->rx.count > 1)
@@ -577,6 +592,10 @@ static int ixgbevf_poll(struct napi_struct *napi, int budget)
 				   < per_ring_budget);
 	adapter->flags &= ~IXGBE_FLAG_IN_NETPOLL;
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	ixgbevf_qv_unlock_napi(q_vector);
+#endif
+
 	/* If all work not completed, return budget and keep polling */
 	if (!clean_complete)
 		return budget;
@@ -611,6 +630,34 @@ void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector)
 	IXGBE_WRITE_REG(hw, IXGBE_VTEITR(v_idx), itr_reg);
 }
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+/* must be called with local_bh_disable()d */
+static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
+{
+	struct ixgbevf_q_vector *q_vector =
+			container_of(napi, struct ixgbevf_q_vector, napi);
+	struct ixgbevf_adapter *adapter = q_vector->adapter;
+	struct ixgbevf_ring  *ring;
+	int found = 0;
+
+	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
+		return LL_FLUSH_FAILED;
+
+	if (!ixgbevf_qv_lock_poll(q_vector))
+		return LL_FLUSH_BUSY;
+
+	ixgbevf_for_each_ring(ring, q_vector->rx) {
+		found = ixgbevf_clean_rx_irq(q_vector, ring, 4);
+		if (found)
+			break;
+	}
+
+	ixgbevf_qv_unlock_poll(q_vector);
+
+	return found;
+}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
 /**
  * ixgbevf_configure_msix - Configure MSI-X hardware
  * @adapter: board private structure
@@ -1297,6 +1344,9 @@ static void ixgbevf_napi_enable_all(struct ixgbevf_adapter *adapter)
 
 	for (q_idx = 0; q_idx < q_vectors; q_idx++) {
 		q_vector = adapter->q_vector[q_idx];
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		ixgbevf_qv_init_lock(adapter->q_vector[q_idx]);
+#endif
 		napi_enable(&q_vector->napi);
 	}
 }
@@ -1310,6 +1360,12 @@ static void ixgbevf_napi_disable_all(struct ixgbevf_adapter *adapter)
 	for (q_idx = 0; q_idx < q_vectors; q_idx++) {
 		q_vector = adapter->q_vector[q_idx];
 		napi_disable(&q_vector->napi);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		while (!ixgbevf_qv_disable(adapter->q_vector[q_idx])) {
+			pr_info("QV %d locked\n", q_idx);
+			usleep_range(1000, 20000);
+		}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
 	}
 }
 
@@ -1960,6 +2016,9 @@ static int ixgbevf_alloc_q_vectors(struct ixgbevf_adapter *adapter)
 		q_vector->v_idx = q_idx;
 		netif_napi_add(adapter->netdev, &q_vector->napi,
 			       ixgbevf_poll, 64);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		napi_hash_add(&q_vector->napi);
+#endif
 		adapter->q_vector[q_idx] = q_vector;
 	}
 
@@ -1969,6 +2028,9 @@ err_out:
 	while (q_idx) {
 		q_idx--;
 		q_vector = adapter->q_vector[q_idx];
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		napi_hash_del(&q_vector->napi);
+#endif
 		netif_napi_del(&q_vector->napi);
 		kfree(q_vector);
 		adapter->q_vector[q_idx] = NULL;
@@ -1992,6 +2054,9 @@ static void ixgbevf_free_q_vectors(struct ixgbevf_adapter *adapter)
 		struct ixgbevf_q_vector *q_vector = adapter->q_vector[q_idx];
 
 		adapter->q_vector[q_idx] = NULL;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		napi_hash_del(&q_vector->napi);
+#endif
 		netif_napi_del(&q_vector->napi);
 		kfree(q_vector);
 	}
@@ -3323,6 +3388,9 @@ static const struct net_device_ops ixgbevf_netdev_ops = {
 	.ndo_tx_timeout		= ixgbevf_tx_timeout,
 	.ndo_vlan_rx_add_vid	= ixgbevf_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid	= ixgbevf_vlan_rx_kill_vid,
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	.ndo_busy_poll		= ixgbevf_busy_poll_recv,
+#endif
 };
 
 static void ixgbevf_assign_netdev_ops(struct net_device *dev)
-- 
1.8.3.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox