Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH net-next] tipc: remove two indentation levels in tipc_recv_msg routine
From: David Laight @ 2013-10-29 11:13 UTC (permalink / raw)
  To: Ying Xue, davem, maloy
  Cc: Paul.Gortmaker, jon.maloy, erik.hugne, tipc-discussion, netdev
In-Reply-To: <1383044471-12982-1-git-send-email-ying.xue@windriver.com>

> The message dispatching part of tipc_recv_msg() is wrapped layers of
> while/if/if/switch, causing out-of-control indentation and does not
> look very good. We reduce two indentation levels by separating the
> message dispatching from the blocks that checks link state and
> sequence numbers, allowing longer function and arg names to be
> consistently indented without wrapping. This is a cosmetic change
> that does not alter the operation of TIPC in any way.
...
> +			tipc_node_unlock(n_ptr);
> +			goto cont;
> +		}
...
>		continue;
>  cont:
>  		kfree_skb(buf);
>  	}

I can only see one 'goto cont', an explicit kfree_skb() and
continue would be clearer.

	David

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Neil Horman @ 2013-10-29 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029082542.GA24625@gmail.com>

On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Heres my data for running the same test with taskset restricting 
> > execution to only cpu0.  I'm not quite sure whats going on here, 
> > but doing so resulted in a 10x slowdown of the runtime of each 
> > iteration which I can't explain.  As before however, both the 
> > parallel alu run and the prefetch run resulted in speedups, but 
> > the two together were not in any way addative.  I'm going to keep 
> > playing with the prefetch stride, unless you have an alternate 
> > theory.
> 
> Could you please cite the exact command-line you used for running 
> the test?
> 
> Thanks,
> 
> 	Ingo
> 

Sure it was this:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
done >> counters.txt 2>&1

where test.sh is:
#!/bin/sh
echo 1 > /sys/module/csum_test/parameters/test_fire


As before, module_test_mode selects a case in a switch statement I added in
do_csum to test one of the 4 csum variants we've been discusing (base, prefetch,
parallel ALU or both), and test_fire is a callback trigger I use in the test
module to run 100000 iterations of a checksum operation.  As you requested, I
ran the above on cpu 0 (-C 0 on perf and -c 0 on taskset), and I removed all irq
affinity to cpu 0.

Regards
Neil

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Ingo Molnar @ 2013-10-29 11:30 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029112022.GA24477@neilslaptop.think-freely.org>


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> done >> counters.txt 2>&1
> 
> where test.sh is:
> #!/bin/sh
> echo 1 > /sys/module/csum_test/parameters/test_fire

What does '-- /root/test.sh' do?

Unless I'm missing something, the line above will run:

  perf bench sched messaging -- /root/test.sh

which should be equivalent to:

  perf bench sched messaging

i.e. /root/test.sh won't be run.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Neil Horman @ 2013-10-29 11:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131029113031.GA16897@gmail.com>

On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > done >> counters.txt 2>&1
> > 
> > where test.sh is:
> > #!/bin/sh
> > echo 1 > /sys/module/csum_test/parameters/test_fire
> 
> What does '-- /root/test.sh' do?
> 
> Unless I'm missing something, the line above will run:
> 
>   perf bench sched messaging -- /root/test.sh
> 
> which should be equivalent to:
> 
>   perf bench sched messaging
> 
> i.e. /root/test.sh won't be run.
> 
According to the perf man page, I'm supposed to be able to use -- to separate
perf command line parameters from the command I want to run.  And it definately
executed test.sh, I added an echo to stdout in there as a test run and observed
them get captured in counters.txt

Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply

* Re: IPV6 nf defrag does not work
From: Florian Westphal @ 2013-10-29 11:56 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, pablo, netfilter-devel, yoshfuji, kadlec, kaber
In-Reply-To: <20131029105208.GA18526@minipsycho.orion>

Jiri Pirko <jiri@resnulli.us> wrote:
> On the current net-next if you on HOSTA do:
> ip6tables -I INPUT -p icmpv6 -j DROP
> ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
> 
> and on HOSTB you do:
> ping6 HOSTA -s2000    (MTU is 1500)
> 
> Only the first ICMP echo request will be passed through, the rest is not
> passed on HOSTA. This issue does not occur with smaller packets than MTU (where
> fragmentation does not happen).
>
> I'm trying to find out where the problem is.

Are you sure this is new behaviour? As far back as I can remember
it was always like this.

in ip6tables, the individual fragments are sent through the ruleset,
iow. you'll need to make use of '-m conntrack' to match the fragments
belonging to an existing connection.

I don't know why this is, and I don't like this either.
But this is how it was implemented, see

net/ipv6/netfilter/nf_defrag_ipv6_hooks.c, ipv6_defrag() ->
nf_ct_frag6_output()

^ permalink raw reply

* [net-next  00/11][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series contains updates to vxlan, net, ixgbe, ixgbevf, and i40e.

Joseph provides a single patch against vxlan which removes the burden
from the NIC drivers to check if the vxlan driver is enabled in the
kernel and also makes available the vxlan headrooms to the drivers.

Jacob provides majority of the patches, with patches against net, ixgbe
and ixgbevf.  His net patch adds might_sleep() call to napi_disable so
that every use of napi_disable during atomic context will be visible.
Then Jacob provides a patch to fix qv_lock_napi call in
ixgbe_napi_disable_all.  The other ixgbe patches cleanup
ixgbe_check_minimum_link function to correctly show that there are some
minor loss of encoding, even though we don't calculate it and remove
unnecessary duplication of PCIe bandwidth display.  Lastly, Jacob
provides 4 patches against ixgbevf to add ixgbevf_rx_skb in line with
how ixgbe handles the variations on how packets can be received, adds
support in order to track how many packets were cleaned during busy poll
as part of the extended statistics.

Wei Yongjun provides a fix for i40e to return -ENOMEN in the memory
allocation error handling case instead of returning 0, as done
elsewhere in this function.

The following are changes since commit cdfb97bc010d9e9d994eb68f2cbac3a8ada26104:
  net, mc: fix the incorrect comments in two mc-related functions
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Don Skidmore (1):
  ixgbevf: Add zero_base handler to network statistics

Jacob Keller (8):
  net: add might_sleep() call to napi_disable
  ixgbe: fix qv_lock_napi call in ixgbe_napi_disable_all
  ixgbe: show <2% for encoding loss on PCIe Gen3
  ixgbe: remove unnecessary duplication of PCIe bandwidth display
  ixgbevf: add ixgbevf_rx_skb
  ixgbevf: have clean_rx_irq return total_rx_packets cleaned
  ixgbevf: implement CONFIG_NET_RX_BUSY_POLL
  ixgbevf: add BP_EXTENDED_STATS for CONFIG_NET_RX_BUSY_POLL

Joseph Gasparakis (1):
  vxlan: Have the NIC drivers do less work for offloads

Wei Yongjun (1):
  i40e: fix error return code in i40e_probe()

 drivers/net/ethernet/intel/i40e/i40e_main.c       |   4 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h          |  48 ++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  46 +++-----
 drivers/net/ethernet/intel/ixgbevf/ethtool.c      |  98 +++++++++++-----
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 132 +++++++++++++++++++++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 103 +++++++++++++++--
 drivers/net/vxlan.c                               |   4 -
 include/linux/netdevice.h                         |   1 +
 include/net/vxlan.h                               |  11 ++
 9 files changed, 366 insertions(+), 81 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [net-next  01/11] vxlan: Have the NIC drivers do less work for offloads
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Joseph Gasparakis, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Joseph Gasparakis <joseph.gasparakis@intel.com>

This patch removes the burden from the NIC drivers to check if the
vxlan driver is enabled in the kernel and also makes available
the vxlan headrooms to them.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/vxlan.c |  4 ----
 include/net/vxlan.h | 11 +++++++++++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 75a3a74..24260ce 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -60,10 +60,6 @@
 
 #define VXLAN_N_VID	(1u << 24)
 #define VXLAN_VID_MASK	(VXLAN_N_VID - 1)
-/* IP header + UDP + VXLAN + Ethernet header */
-#define VXLAN_HEADROOM (20 + 8 + 8 + 14)
-/* IPv6 header + UDP + VXLAN + Ethernet header */
-#define VXLAN6_HEADROOM (40 + 8 + 8 + 14)
 #define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr))
 
 #define VXLAN_FLAGS 0x08000000	/* struct vxlanhdr.vx_flags required value. */
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 2d64d3c..6b6d180 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -36,5 +36,16 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 
 __be16 vxlan_src_port(__u16 port_min, __u16 port_max, struct sk_buff *skb);
 
+/* IP header + UDP + VXLAN + Ethernet header */
+#define VXLAN_HEADROOM (20 + 8 + 8 + 14)
+/* IPv6 header + UDP + VXLAN + Ethernet header */
+#define VXLAN6_HEADROOM (40 + 8 + 8 + 14)
+
+#if IS_ENABLED(CONFIG_VXLAN)
 void vxlan_get_rx_port(struct net_device *netdev);
+#else
+static inline void vxlan_get_rx_port(struct net_device *netdev)
+{
+}
+#endif
 #endif
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  02/11] net: add might_sleep() call to napi_disable
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem
  Cc: Jacob Keller, netdev, gospo, sassmann, Eliezer Tamir,
	Alexander Duyck, Hyong-Youb Kim, Amir Vadai, Dmitry Kravkov,
	Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

napi_disable uses an msleep() call to wait for outstanding napi work to be
finished after setting the disable bit. It does not always sleep incase there
was no outstanding work. This resulted in a rare bug in ixgbe_down operation
where a napi_disable call took place inside of a local_bh_disable()d context.
In order to enable easier detection of future sleep while atomic BUGs, this
patch adds a might_sleep() call, so that every use of napi_disable during
atomic context will be visible.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@intel.com>
Cc: Hyong-Youb Kim <hykim@myri.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27f62f7..cb1d918 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -483,6 +483,7 @@ void napi_hash_del(struct napi_struct *napi);
  */
 static inline void napi_disable(struct napi_struct *n)
 {
+	might_sleep();
 	set_bit(NAPI_STATE_DISABLE, &n->state);
 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
 		msleep(1);
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  04/11] ixgbe: show <2% for encoding loss on PCIe Gen3
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch updates the ixgbe_check_minimum_link function to correctly show that
there is some minor loss of encoding, even though we don't calculate it in the
max GT/s equation. It is small enough to not bother, but is better to report it
than not.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ee90dfb..9753c8a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -245,7 +245,7 @@ static void ixgbe_check_minimum_link(struct ixgbe_adapter *adapter,
 		max_gts = 4 * width;
 		break;
 	case PCIE_SPEED_8_0GT:
-		/* 128b/130b encoding only reduces throughput by 1% */
+		/* 128b/130b encoding reduces throughput by less than 2% */
 		max_gts = 8 * width;
 		break;
 	default:
@@ -263,7 +263,7 @@ static void ixgbe_check_minimum_link(struct ixgbe_adapter *adapter,
 		   width,
 		   (speed == PCIE_SPEED_2_5GT ? "20%" :
 		    speed == PCIE_SPEED_5_0GT ? "20%" :
-		    speed == PCIE_SPEED_8_0GT ? "N/a" :
+		    speed == PCIE_SPEED_8_0GT ? "<2%" :
 		    "Unknown"));
 
 	if (max_gts < expected_gts) {
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  03/11] ixgbe: fix qv_lock_napi call in ixgbe_napi_disable_all
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem
  Cc: Jacob Keller, netdev, gospo, sassmann, Eliezer Tamir,
	Alexander Duyck, Hyong-Youb Kim, Amir Vadai, Dmitry Kravkov,
	Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

ixgbe_napi_disable_all calls napi_disable on each queue, however the busy
polling code introduced a local_bh_disable()d context around the napi_disable.
The original author did not realize that napi_disable might sleep, which would
cause a sleep while atomic BUG. In addition, on a single processor system, the
ixgbe_qv_lock_napi loop shouldn't have to mdelay. This patch adds an
ixgbe_qv_disable along with a new IXGBE_QV_STATE_DISABLED bit, which it uses to
indicate to the poll and napi routines that the q_vector has been disabled. Now
the ixgbe_napi_disable_all function will wait until all pending work has been
finished and prevent any future work from being started.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@intel.com>
Cc: Hyong-Youb Kim <hykim@myri.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      | 48 ++++++++++++++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  6 ++--
 2 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index dc1588e..f51fd1f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -369,11 +369,13 @@ struct ixgbe_q_vector {
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	unsigned int state;
 #define IXGBE_QV_STATE_IDLE        0
-#define IXGBE_QV_STATE_NAPI	   1    /* NAPI owns this QV */
-#define IXGBE_QV_STATE_POLL	   2    /* poll owns this QV */
-#define IXGBE_QV_LOCKED (IXGBE_QV_STATE_NAPI | IXGBE_QV_STATE_POLL)
-#define IXGBE_QV_STATE_NAPI_YIELD  4    /* NAPI yielded this QV */
-#define IXGBE_QV_STATE_POLL_YIELD  8    /* poll yielded this QV */
+#define IXGBE_QV_STATE_NAPI	   1     /* NAPI owns this QV */
+#define IXGBE_QV_STATE_POLL	   2     /* poll owns this QV */
+#define IXGBE_QV_STATE_DISABLED	   4     /* QV is disabled */
+#define IXGBE_QV_OWNED (IXGBE_QV_STATE_NAPI | IXGBE_QV_STATE_POLL)
+#define IXGBE_QV_LOCKED (IXGBE_QV_OWNED | IXGBE_QV_STATE_DISABLED)
+#define IXGBE_QV_STATE_NAPI_YIELD  8     /* NAPI yielded this QV */
+#define IXGBE_QV_STATE_POLL_YIELD  16    /* poll yielded this QV */
 #define IXGBE_QV_YIELD (IXGBE_QV_STATE_NAPI_YIELD | IXGBE_QV_STATE_POLL_YIELD)
 #define IXGBE_QV_USER_PEND (IXGBE_QV_STATE_POLL | IXGBE_QV_STATE_POLL_YIELD)
 	spinlock_t lock;
@@ -394,7 +396,7 @@ static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
 static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
 {
 	int rc = true;
-	spin_lock(&q_vector->lock);
+	spin_lock_bh(&q_vector->lock);
 	if (q_vector->state & IXGBE_QV_LOCKED) {
 		WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
 		q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
@@ -405,7 +407,7 @@ static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
 	} else
 		/* we don't care if someone yielded */
 		q_vector->state = IXGBE_QV_STATE_NAPI;
-	spin_unlock(&q_vector->lock);
+	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
 
@@ -413,14 +415,15 @@ static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
 static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
 {
 	int rc = false;
-	spin_lock(&q_vector->lock);
+	spin_lock_bh(&q_vector->lock);
 	WARN_ON(q_vector->state & (IXGBE_QV_STATE_POLL |
 			       IXGBE_QV_STATE_NAPI_YIELD));
 
 	if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
 		rc = true;
-	q_vector->state = IXGBE_QV_STATE_IDLE;
-	spin_unlock(&q_vector->lock);
+	/* will reset state to idle, unless QV is disabled */
+	q_vector->state &= IXGBE_QV_STATE_DISABLED;
+	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
 
@@ -451,7 +454,8 @@ static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
 
 	if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
 		rc = true;
-	q_vector->state = IXGBE_QV_STATE_IDLE;
+	/* will reset state to idle, unless QV is disabled */
+	q_vector->state &= IXGBE_QV_STATE_DISABLED;
 	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
@@ -459,9 +463,23 @@ static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
 /* true if a socket is polling, even if it did not get the lock */
 static inline bool ixgbe_qv_busy_polling(struct ixgbe_q_vector *q_vector)
 {
-	WARN_ON(!(q_vector->state & IXGBE_QV_LOCKED));
+	WARN_ON(!(q_vector->state & IXGBE_QV_OWNED));
 	return q_vector->state & IXGBE_QV_USER_PEND;
 }
+
+/* false if QV is currently owned */
+static inline bool ixgbe_qv_disable(struct ixgbe_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if (q_vector->state & IXGBE_QV_OWNED)
+		rc = false;
+	q_vector->state |= IXGBE_QV_STATE_DISABLED;
+	spin_unlock_bh(&q_vector->lock);
+
+	return rc;
+}
+
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
 {
@@ -491,6 +509,12 @@ static inline bool ixgbe_qv_busy_polling(struct ixgbe_q_vector *q_vector)
 {
 	return false;
 }
+
+static inline bool ixgbe_qv_disable(struct ixgbe_q_vector *q_vector)
+{
+	return true;
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 #ifdef CONFIG_IXGBE_HWMON
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ce3eb60..ee90dfb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3891,15 +3891,13 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
 {
 	int q_idx;
 
-	local_bh_disable(); /* for ixgbe_qv_lock_napi() */
 	for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
 		napi_disable(&adapter->q_vector[q_idx]->napi);
-		while (!ixgbe_qv_lock_napi(adapter->q_vector[q_idx])) {
+		while (!ixgbe_qv_disable(adapter->q_vector[q_idx])) {
 			pr_info("QV %d locked\n", q_idx);
-			mdelay(1);
+			usleep_range(1000, 20000);
 		}
 	}
-	local_bh_enable();
 }
 
 #ifdef CONFIG_IXGBE_DCB
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  05/11] ixgbe: remove unnecessary duplication of PCIe bandwidth display
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch removes the unnecessary display of PCIe bandwidth twice. Since the
ixgbe_check_minimum_link does a better job, and ensures accurate detection on
even complex chains, this older check is no longer necessary.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 36 ++++++++++-----------------
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9753c8a..a7d1a1c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7752,29 +7752,6 @@ skip_sriov:
 	if (ixgbe_pcie_from_parent(hw))
 		ixgbe_get_parent_bus_info(adapter);
 
-	/* print bus type/speed/width info */
-	e_dev_info("(PCI Express:%s:%s) %pM\n",
-		   (hw->bus.speed == ixgbe_bus_speed_8000 ? "8.0GT/s" :
-		    hw->bus.speed == ixgbe_bus_speed_5000 ? "5.0GT/s" :
-		    hw->bus.speed == ixgbe_bus_speed_2500 ? "2.5GT/s" :
-		    "Unknown"),
-		   (hw->bus.width == ixgbe_bus_width_pcie_x8 ? "Width x8" :
-		    hw->bus.width == ixgbe_bus_width_pcie_x4 ? "Width x4" :
-		    hw->bus.width == ixgbe_bus_width_pcie_x1 ? "Width x1" :
-		    "Unknown"),
-		   netdev->dev_addr);
-
-	err = ixgbe_read_pba_string_generic(hw, part_str, IXGBE_PBANUM_LENGTH);
-	if (err)
-		strncpy(part_str, "Unknown", IXGBE_PBANUM_LENGTH);
-	if (ixgbe_is_sfp(hw) && hw->phy.sfp_type != ixgbe_sfp_type_not_present)
-		e_dev_info("MAC: %d, PHY: %d, SFP+: %d, PBA No: %s\n",
-			   hw->mac.type, hw->phy.type, hw->phy.sfp_type,
-		           part_str);
-	else
-		e_dev_info("MAC: %d, PHY: %d, PBA No: %s\n",
-			   hw->mac.type, hw->phy.type, part_str);
-
 	/* calculate the expected PCIe bandwidth required for optimal
 	 * performance. Note that some older parts will never have enough
 	 * bandwidth due to being older generation PCIe parts. We clamp these
@@ -7790,6 +7767,19 @@ skip_sriov:
 	}
 	ixgbe_check_minimum_link(adapter, expected_gts);
 
+	err = ixgbe_read_pba_string_generic(hw, part_str, IXGBE_PBANUM_LENGTH);
+	if (err)
+		strncpy(part_str, "Unknown", IXGBE_PBANUM_LENGTH);
+	if (ixgbe_is_sfp(hw) && hw->phy.sfp_type != ixgbe_sfp_type_not_present)
+		e_dev_info("MAC: %d, PHY: %d, SFP+: %d, PBA No: %s\n",
+			   hw->mac.type, hw->phy.type, hw->phy.sfp_type,
+		           part_str);
+	else
+		e_dev_info("MAC: %d, PHY: %d, PBA No: %s\n",
+			   hw->mac.type, hw->phy.type, part_str);
+
+	e_dev_info("%pM\n", netdev->dev_addr);
+
 	/* reset the hardware with the new settings */
 	err = hw->mac.ops.start_hw(hw);
 	if (err == IXGBE_ERR_EEPROM_VERSION) {
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  06/11] ixgbevf: add ixgbevf_rx_skb
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch adds ixgbevf_rx_skb in line with how ixgbe handles the variations on
how packets can be received. It will be extended in a following patch for
CONFIG_NET_RX_BUSY_POLL support.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 87279c8a..a2cc6bb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -300,6 +300,20 @@ static void ixgbevf_receive_skb(struct ixgbevf_q_vector *q_vector,
 }
 
 /**
+ * ixgbevf_rx_skb - Helper function to determine proper Rx method
+ * @q_vector: structure containing interrupt and ring information
+ * @skb: packet to send up
+ * @status: hardware indication of status of receive
+ * @rx_desc: rx descriptor
+ **/
+static void ixgbevf_rx_skb(struct ixgbevf_q_vector *q_vector,
+			   struct sk_buff *skb, u8 status,
+			   union ixgbe_adv_rx_desc *rx_desc)
+{
+	ixgbevf_receive_skb(q_vector, skb, status, rx_desc);
+}
+
+/**
  * ixgbevf_rx_checksum - indicate in skb if hw indicated a good cksum
  * @ring: pointer to Rx descriptor ring structure
  * @status_err: hardware indication of status of receive
@@ -494,7 +508,7 @@ static bool ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 			goto next_desc;
 		}
 
-		ixgbevf_receive_skb(q_vector, skb, staterr, rx_desc);
+		ixgbevf_rx_skb(q_vector, skb, staterr, rx_desc);
 
 next_desc:
 		rx_desc->wb.upper.status_error = 0;
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  07/11] ixgbevf: have clean_rx_irq return total_rx_packets cleaned
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

Rather than return true/false indicating whether there was budget left, return
the total packets cleaned. This currently has no use, but will be used in a
following patch which enables CONFIG_NET_RX_BUSY_POLL support in order to track
how many packets were cleaned during the busy poll as part of the extended
statistics.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a2cc6bb..06e29bd 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -410,9 +410,9 @@ static inline void ixgbevf_irq_enable_queues(struct ixgbevf_adapter *adapter,
 	IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, qmask);
 }
 
-static bool ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
-				 struct ixgbevf_ring *rx_ring,
-				 int budget)
+static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
+				struct ixgbevf_ring *rx_ring,
+				int budget)
 {
 	struct ixgbevf_adapter *adapter = q_vector->adapter;
 	struct pci_dev *pdev = adapter->pdev;
@@ -540,7 +540,7 @@ next_desc:
 	q_vector->rx.total_packets += total_rx_packets;
 	q_vector->rx.total_bytes += total_rx_bytes;
 
-	return !!budget;
+	return total_rx_packets;
 }
 
 /**
@@ -572,8 +572,9 @@ static int ixgbevf_poll(struct napi_struct *napi, int budget)
 
 	adapter->flags |= IXGBE_FLAG_IN_NETPOLL;
 	ixgbevf_for_each_ring(ring, q_vector->rx)
-		clean_complete &= ixgbevf_clean_rx_irq(q_vector, ring,
-						       per_ring_budget);
+		clean_complete &= (ixgbevf_clean_rx_irq(q_vector, ring,
+							per_ring_budget)
+				   < per_ring_budget);
 	adapter->flags &= ~IXGBE_FLAG_IN_NETPOLL;
 
 	/* If all work not completed, return budget and keep polling */
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  08/11] ixgbevf: implement CONFIG_NET_RX_BUSY_POLL
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch enables CONFIG_NET_RX_BUSY_POLL support in the VF code. This enables
sockets which have enabled the SO_BUSY_POLL socket option to use the
ndo_busy_poll_recv operation which could result in lower latency, at the cost
of higher CPU utilization, and increased power usage. This support is similar
to how the ixgbe driver works.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 109 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  68 ++++++++++++++
 2 files changed, 177 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index d7837dcc..85c7560 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -38,6 +38,10 @@
 
 #include "vf.h"
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+#include <net/busy_poll.h>
+#endif
+
 /* wrapper around a pointer to a socket buffer,
  * so a DMA handle can be stored along with the buffer */
 struct ixgbevf_tx_buffer {
@@ -145,7 +149,112 @@ struct ixgbevf_q_vector {
 	struct napi_struct napi;
 	struct ixgbevf_ring_container rx, tx;
 	char name[IFNAMSIZ + 9];
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	unsigned int state;
+#define IXGBEVF_QV_STATE_IDLE		0
+#define IXGBEVF_QV_STATE_NAPI		1    /* NAPI owns this QV */
+#define IXGBEVF_QV_STATE_POLL		2    /* poll owns this QV */
+#define IXGBEVF_QV_STATE_DISABLED	4    /* QV is disabled */
+#define IXGBEVF_QV_OWNED (IXGBEVF_QV_STATE_NAPI | IXGBEVF_QV_STATE_POLL)
+#define IXGBEVF_QV_LOCKED (IXGBEVF_QV_OWNED | IXGBEVF_QV_STATE_DISABLED)
+#define IXGBEVF_QV_STATE_NAPI_YIELD	8    /* NAPI yielded this QV */
+#define IXGBEVF_QV_STATE_POLL_YIELD	16   /* poll yielded this QV */
+#define IXGBEVF_QV_YIELD (IXGBEVF_QV_STATE_NAPI_YIELD | IXGBEVF_QV_STATE_POLL_YIELD)
+#define IXGBEVF_QV_USER_PEND (IXGBEVF_QV_STATE_POLL | IXGBEVF_QV_STATE_POLL_YIELD)
+	spinlock_t lock;
+#endif /* CONFIG_NET_RX_BUSY_POLL */
 };
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static inline void ixgbevf_qv_init_lock(struct ixgbevf_q_vector *q_vector)
+{
+
+	spin_lock_init(&q_vector->lock);
+	q_vector->state = IXGBEVF_QV_STATE_IDLE;
+}
+
+/* called from the device poll routine to get ownership of a q_vector */
+static inline bool ixgbevf_qv_lock_napi(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if (q_vector->state & IXGBEVF_QV_LOCKED) {
+		WARN_ON(q_vector->state & IXGBEVF_QV_STATE_NAPI);
+		q_vector->state |= IXGBEVF_QV_STATE_NAPI_YIELD;
+		rc = false;
+	} else {
+		/* we don't care if someone yielded */
+		q_vector->state = IXGBEVF_QV_STATE_NAPI;
+	}
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* returns true is someone tried to get the qv while napi had it */
+static inline bool ixgbevf_qv_unlock_napi(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = false;
+	spin_lock_bh(&q_vector->lock);
+	WARN_ON(q_vector->state & (IXGBEVF_QV_STATE_POLL |
+				   IXGBEVF_QV_STATE_NAPI_YIELD));
+
+	if (q_vector->state & IXGBEVF_QV_STATE_POLL_YIELD)
+		rc = true;
+	/* reset state to idle, unless QV is disabled */
+	q_vector->state &= IXGBEVF_QV_STATE_DISABLED;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* called from ixgbevf_low_latency_poll() */
+static inline bool ixgbevf_qv_lock_poll(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if ((q_vector->state & IXGBEVF_QV_LOCKED)) {
+		q_vector->state |= IXGBEVF_QV_STATE_POLL_YIELD;
+		rc = false;
+	} else {
+		/* preserve yield marks */
+		q_vector->state |= IXGBEVF_QV_STATE_POLL;
+	}
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* returns true if someone tried to get the qv while it was locked */
+static inline bool ixgbevf_qv_unlock_poll(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = false;
+	spin_lock_bh(&q_vector->lock);
+	WARN_ON(q_vector->state & (IXGBEVF_QV_STATE_NAPI));
+
+	if (q_vector->state & IXGBEVF_QV_STATE_POLL_YIELD)
+		rc = true;
+	/* reset state to idle, unless QV is disabled */
+	q_vector->state &= IXGBEVF_QV_STATE_DISABLED;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* true if a socket is polling, even if it did not get the lock */
+static inline bool ixgbevf_qv_busy_polling(struct ixgbevf_q_vector *q_vector)
+{
+	WARN_ON(!(q_vector->state & IXGBEVF_QV_OWNED));
+	return q_vector->state & IXGBEVF_QV_USER_PEND;
+}
+
+/* false if QV is currently owned */
+static inline bool ixgbevf_qv_disable(struct ixgbevf_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if (q_vector->state & IXGBEVF_QV_OWNED)
+		rc = false;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+#endif /* CONFIG_NET_RX_BUSY_POLL */
 
 /*
  * microsecond values for various ITR rates shifted by 2 to fit itr register
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 06e29bd..bc03a2e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -310,6 +310,16 @@ static void ixgbevf_rx_skb(struct ixgbevf_q_vector *q_vector,
 			   struct sk_buff *skb, u8 status,
 			   union ixgbe_adv_rx_desc *rx_desc)
 {
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	skb_mark_napi_id(skb, &q_vector->napi);
+
+	if (ixgbevf_qv_busy_polling(q_vector)) {
+		netif_receive_skb(skb);
+		/* exit early if we busy polled */
+		return;
+	}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
 	ixgbevf_receive_skb(q_vector, skb, status, rx_desc);
 }
 
@@ -563,6 +573,11 @@ static int ixgbevf_poll(struct napi_struct *napi, int budget)
 	ixgbevf_for_each_ring(ring, q_vector->tx)
 		clean_complete &= ixgbevf_clean_tx_irq(q_vector, ring);
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	if (!ixgbevf_qv_lock_napi(q_vector))
+		return budget;
+#endif
+
 	/* attempt to distribute budget to each queue fairly, but don't allow
 	 * the budget to go below 1 because we'll exit polling */
 	if (q_vector->rx.count > 1)
@@ -577,6 +592,10 @@ static int ixgbevf_poll(struct napi_struct *napi, int budget)
 				   < per_ring_budget);
 	adapter->flags &= ~IXGBE_FLAG_IN_NETPOLL;
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	ixgbevf_qv_unlock_napi(q_vector);
+#endif
+
 	/* If all work not completed, return budget and keep polling */
 	if (!clean_complete)
 		return budget;
@@ -611,6 +630,34 @@ void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector)
 	IXGBE_WRITE_REG(hw, IXGBE_VTEITR(v_idx), itr_reg);
 }
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+/* must be called with local_bh_disable()d */
+static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
+{
+	struct ixgbevf_q_vector *q_vector =
+			container_of(napi, struct ixgbevf_q_vector, napi);
+	struct ixgbevf_adapter *adapter = q_vector->adapter;
+	struct ixgbevf_ring  *ring;
+	int found = 0;
+
+	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
+		return LL_FLUSH_FAILED;
+
+	if (!ixgbevf_qv_lock_poll(q_vector))
+		return LL_FLUSH_BUSY;
+
+	ixgbevf_for_each_ring(ring, q_vector->rx) {
+		found = ixgbevf_clean_rx_irq(q_vector, ring, 4);
+		if (found)
+			break;
+	}
+
+	ixgbevf_qv_unlock_poll(q_vector);
+
+	return found;
+}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
 /**
  * ixgbevf_configure_msix - Configure MSI-X hardware
  * @adapter: board private structure
@@ -1297,6 +1344,9 @@ static void ixgbevf_napi_enable_all(struct ixgbevf_adapter *adapter)
 
 	for (q_idx = 0; q_idx < q_vectors; q_idx++) {
 		q_vector = adapter->q_vector[q_idx];
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		ixgbevf_qv_init_lock(adapter->q_vector[q_idx]);
+#endif
 		napi_enable(&q_vector->napi);
 	}
 }
@@ -1310,6 +1360,12 @@ static void ixgbevf_napi_disable_all(struct ixgbevf_adapter *adapter)
 	for (q_idx = 0; q_idx < q_vectors; q_idx++) {
 		q_vector = adapter->q_vector[q_idx];
 		napi_disable(&q_vector->napi);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		while (!ixgbevf_qv_disable(adapter->q_vector[q_idx])) {
+			pr_info("QV %d locked\n", q_idx);
+			usleep_range(1000, 20000);
+		}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
 	}
 }
 
@@ -1960,6 +2016,9 @@ static int ixgbevf_alloc_q_vectors(struct ixgbevf_adapter *adapter)
 		q_vector->v_idx = q_idx;
 		netif_napi_add(adapter->netdev, &q_vector->napi,
 			       ixgbevf_poll, 64);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		napi_hash_add(&q_vector->napi);
+#endif
 		adapter->q_vector[q_idx] = q_vector;
 	}
 
@@ -1969,6 +2028,9 @@ err_out:
 	while (q_idx) {
 		q_idx--;
 		q_vector = adapter->q_vector[q_idx];
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		napi_hash_del(&q_vector->napi);
+#endif
 		netif_napi_del(&q_vector->napi);
 		kfree(q_vector);
 		adapter->q_vector[q_idx] = NULL;
@@ -1992,6 +2054,9 @@ static void ixgbevf_free_q_vectors(struct ixgbevf_adapter *adapter)
 		struct ixgbevf_q_vector *q_vector = adapter->q_vector[q_idx];
 
 		adapter->q_vector[q_idx] = NULL;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		napi_hash_del(&q_vector->napi);
+#endif
 		netif_napi_del(&q_vector->napi);
 		kfree(q_vector);
 	}
@@ -3323,6 +3388,9 @@ static const struct net_device_ops ixgbevf_netdev_ops = {
 	.ndo_tx_timeout		= ixgbevf_tx_timeout,
 	.ndo_vlan_rx_add_vid	= ixgbevf_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid	= ixgbevf_vlan_rx_kill_vid,
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	.ndo_busy_poll		= ixgbevf_busy_poll_recv,
+#endif
 };
 
 static void ixgbevf_assign_netdev_ops(struct net_device *dev)
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  09/11] ixgbevf: add BP_EXTENDED_STATS for CONFIG_NET_RX_BUSY_POLL
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch adds the extended statistics similar to the ixgbe driver. These
statistics keep track of how often the busy polling yields, as well as how many
packets are cleaned or missed by the polling routine.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ethtool.c      | 32 +++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 22 ++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  6 +++++
 3 files changed, 60 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ethtool.c b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
index 21adb1b..44f7c42 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
@@ -74,6 +74,14 @@ static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 						zero_base)},
 	{"tx_csum_offload_ctxt", IXGBEVF_STAT(hw_csum_tx_good, zero_base,
 					      zero_base)},
+#ifdef BP_EXTENDED_STATS
+	{"rx_bp_poll_yield", IXGBEVF_STAT(bp_rx_yields, zero_base, zero_base)},
+	{"rx_bp_cleaned", IXGBEVF_STAT(bp_rx_cleaned, zero_base, zero_base)},
+	{"rx_bp_misses", IXGBEVF_STAT(bp_rx_missed, zero_base, zero_base)},
+	{"tx_bp_napi_yield", IXGBEVF_STAT(bp_tx_yields, zero_base, zero_base)},
+	{"tx_bp_cleaned", IXGBEVF_STAT(bp_tx_cleaned, zero_base, zero_base)},
+	{"tx_bp_misses", IXGBEVF_STAT(bp_tx_missed, zero_base, zero_base)},
+#endif
 };
 
 #define IXGBE_QUEUE_STATS_LEN 0
@@ -391,6 +399,30 @@ static void ixgbevf_get_ethtool_stats(struct net_device *netdev,
 {
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
 	int i;
+#ifdef BP_EXTENDED_STATS
+	u64 rx_yields = 0, rx_cleaned = 0, rx_missed = 0,
+	    tx_yields = 0, tx_cleaned = 0, tx_missed = 0;
+
+	for (i = 0; i < adapter->num_rx_queues; i++) {
+		rx_yields += adapter->rx_ring[i].bp_yields;
+		rx_cleaned += adapter->rx_ring[i].bp_cleaned;
+		rx_yields += adapter->rx_ring[i].bp_yields;
+	}
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		tx_yields += adapter->tx_ring[i].bp_yields;
+		tx_cleaned += adapter->tx_ring[i].bp_cleaned;
+		tx_yields += adapter->tx_ring[i].bp_yields;
+	}
+
+	adapter->bp_rx_yields = rx_yields;
+	adapter->bp_rx_cleaned = rx_cleaned;
+	adapter->bp_rx_missed = rx_missed;
+
+	adapter->bp_tx_yields = tx_yields;
+	adapter->bp_tx_cleaned = tx_cleaned;
+	adapter->bp_tx_missed = tx_missed;
+#endif
 
 	ixgbevf_update_stats(adapter);
 	for (i = 0; i < IXGBE_GLOBAL_STATS_LEN; i++) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 85c7560..78e4c51 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -40,6 +40,7 @@
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 #include <net/busy_poll.h>
+#define BP_EXTENDED_STATS
 #endif
 
 /* wrapper around a pointer to a socket buffer,
@@ -80,6 +81,11 @@ struct ixgbevf_ring {
 	struct u64_stats_sync	syncp;
 	u64 hw_csum_rx_error;
 	u64 hw_csum_rx_good;
+#ifdef BP_EXTENDED_STATS
+	u64 bp_yields;
+	u64 bp_misses;
+	u64 bp_cleaned;
+#endif
 
 	u16 head;
 	u16 tail;
@@ -181,6 +187,9 @@ static inline bool ixgbevf_qv_lock_napi(struct ixgbevf_q_vector *q_vector)
 		WARN_ON(q_vector->state & IXGBEVF_QV_STATE_NAPI);
 		q_vector->state |= IXGBEVF_QV_STATE_NAPI_YIELD;
 		rc = false;
+#ifdef BP_EXTENDED_STATS
+		q_vector->tx.ring->bp_yields++;
+#endif
 	} else {
 		/* we don't care if someone yielded */
 		q_vector->state = IXGBEVF_QV_STATE_NAPI;
@@ -213,6 +222,9 @@ static inline bool ixgbevf_qv_lock_poll(struct ixgbevf_q_vector *q_vector)
 	if ((q_vector->state & IXGBEVF_QV_LOCKED)) {
 		q_vector->state |= IXGBEVF_QV_STATE_POLL_YIELD;
 		rc = false;
+#ifdef BP_EXTENDED_STATS
+		q_vector->rx.ring->bp_yields++;
+#endif
 	} else {
 		/* preserve yield marks */
 		q_vector->state |= IXGBEVF_QV_STATE_POLL;
@@ -358,6 +370,16 @@ struct ixgbevf_adapter {
 	unsigned int tx_ring_count;
 	unsigned int rx_ring_count;
 
+#ifdef BP_EXTENDED_STATS
+	u64 bp_rx_yields;
+	u64 bp_rx_cleaned;
+	u64 bp_rx_missed;
+
+	u64 bp_tx_yields;
+	u64 bp_tx_cleaned;
+	u64 bp_tx_missed;
+#endif
+
 	u32 link_speed;
 	bool link_up;
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index bc03a2e..b16b694 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -648,6 +648,12 @@ static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
 
 	ixgbevf_for_each_ring(ring, q_vector->rx) {
 		found = ixgbevf_clean_rx_irq(q_vector, ring, 4);
+#ifdef BP_EXTENDED_STATS
+		if (found)
+			ring->bp_cleaned += found;
+		else
+			ring->bp_misses++;
+#endif
 		if (found)
 			break;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  10/11] ixgbevf: Add zero_base handler to network statistics
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Alexander Duyck,
	Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

This patch removes the need to keep a zero_base variable in the adapter
structure. Now we just use two different macros to set the non-zero and
zero base. This adds to readability and shortens some of the structure
initialization under 80 columns. The gathering of status for ethtool was
slightly modified to again better fit into 80 columns and become a bit
more readable.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ethtool.c | 78 ++++++++++++++++------------
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h |  1 -
 2 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ethtool.c b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
index 44f7c42..54d9ace 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ethtool.c
@@ -45,16 +45,27 @@
 
 struct ixgbe_stats {
 	char stat_string[ETH_GSTRING_LEN];
-	int sizeof_stat;
-	int stat_offset;
-	int base_stat_offset;
-	int saved_reset_offset;
+	struct {
+		int sizeof_stat;
+		int stat_offset;
+		int base_stat_offset;
+		int saved_reset_offset;
+	};
 };
 
-#define IXGBEVF_STAT(m, b, r)  sizeof(((struct ixgbevf_adapter *)0)->m), \
-			    offsetof(struct ixgbevf_adapter, m),         \
-			    offsetof(struct ixgbevf_adapter, b),         \
-			    offsetof(struct ixgbevf_adapter, r)
+#define IXGBEVF_STAT(m, b, r) { \
+	.sizeof_stat = FIELD_SIZEOF(struct ixgbevf_adapter, m), \
+	.stat_offset = offsetof(struct ixgbevf_adapter, m), \
+	.base_stat_offset = offsetof(struct ixgbevf_adapter, b), \
+	.saved_reset_offset = offsetof(struct ixgbevf_adapter, r) \
+}
+
+#define IXGBEVF_ZSTAT(m) { \
+	.sizeof_stat = FIELD_SIZEOF(struct ixgbevf_adapter, m), \
+	.stat_offset = offsetof(struct ixgbevf_adapter, m), \
+	.base_stat_offset = -1, \
+	.saved_reset_offset = -1 \
+}
 
 static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 	{"rx_packets", IXGBEVF_STAT(stats.vfgprc, stats.base_vfgprc,
@@ -65,22 +76,19 @@ static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 				  stats.saved_reset_vfgorc)},
 	{"tx_bytes", IXGBEVF_STAT(stats.vfgotc, stats.base_vfgotc,
 				  stats.saved_reset_vfgotc)},
-	{"tx_busy", IXGBEVF_STAT(tx_busy, zero_base, zero_base)},
+	{"tx_busy", IXGBEVF_ZSTAT(tx_busy)},
 	{"multicast", IXGBEVF_STAT(stats.vfmprc, stats.base_vfmprc,
 				   stats.saved_reset_vfmprc)},
-	{"rx_csum_offload_good", IXGBEVF_STAT(hw_csum_rx_good, zero_base,
-					      zero_base)},
-	{"rx_csum_offload_errors", IXGBEVF_STAT(hw_csum_rx_error, zero_base,
-						zero_base)},
-	{"tx_csum_offload_ctxt", IXGBEVF_STAT(hw_csum_tx_good, zero_base,
-					      zero_base)},
+	{"rx_csum_offload_good", IXGBEVF_ZSTAT(hw_csum_rx_good)},
+	{"rx_csum_offload_errors", IXGBEVF_ZSTAT(hw_csum_rx_error)},
+	{"tx_csum_offload_ctxt", IXGBEVF_ZSTAT(hw_csum_tx_good)},
 #ifdef BP_EXTENDED_STATS
-	{"rx_bp_poll_yield", IXGBEVF_STAT(bp_rx_yields, zero_base, zero_base)},
-	{"rx_bp_cleaned", IXGBEVF_STAT(bp_rx_cleaned, zero_base, zero_base)},
-	{"rx_bp_misses", IXGBEVF_STAT(bp_rx_missed, zero_base, zero_base)},
-	{"tx_bp_napi_yield", IXGBEVF_STAT(bp_tx_yields, zero_base, zero_base)},
-	{"tx_bp_cleaned", IXGBEVF_STAT(bp_tx_cleaned, zero_base, zero_base)},
-	{"tx_bp_misses", IXGBEVF_STAT(bp_tx_missed, zero_base, zero_base)},
+	{"rx_bp_poll_yield", IXGBEVF_ZSTAT(bp_rx_yields)},
+	{"rx_bp_cleaned", IXGBEVF_ZSTAT(bp_rx_cleaned)},
+	{"rx_bp_misses", IXGBEVF_ZSTAT(bp_rx_missed)},
+	{"tx_bp_napi_yield", IXGBEVF_ZSTAT(bp_tx_yields)},
+	{"tx_bp_cleaned", IXGBEVF_ZSTAT(bp_tx_cleaned)},
+	{"tx_bp_misses", IXGBEVF_ZSTAT(bp_tx_missed)},
 #endif
 };
 
@@ -398,6 +406,7 @@ static void ixgbevf_get_ethtool_stats(struct net_device *netdev,
 				      struct ethtool_stats *stats, u64 *data)
 {
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
+	char *base = (char *) adapter;
 	int i;
 #ifdef BP_EXTENDED_STATS
 	u64 rx_yields = 0, rx_cleaned = 0, rx_missed = 0,
@@ -426,18 +435,21 @@ static void ixgbevf_get_ethtool_stats(struct net_device *netdev,
 
 	ixgbevf_update_stats(adapter);
 	for (i = 0; i < IXGBE_GLOBAL_STATS_LEN; i++) {
-		char *p = (char *)adapter +
-			ixgbe_gstrings_stats[i].stat_offset;
-		char *b = (char *)adapter +
-			ixgbe_gstrings_stats[i].base_stat_offset;
-		char *r = (char *)adapter +
-			ixgbe_gstrings_stats[i].saved_reset_offset;
-		data[i] = ((ixgbe_gstrings_stats[i].sizeof_stat ==
-			    sizeof(u64)) ? *(u64 *)p : *(u32 *)p) -
-			  ((ixgbe_gstrings_stats[i].sizeof_stat ==
-			    sizeof(u64)) ? *(u64 *)b : *(u32 *)b) +
-			  ((ixgbe_gstrings_stats[i].sizeof_stat ==
-			    sizeof(u64)) ? *(u64 *)r : *(u32 *)r);
+		char *p = base + ixgbe_gstrings_stats[i].stat_offset;
+		char *b = base + ixgbe_gstrings_stats[i].base_stat_offset;
+		char *r = base + ixgbe_gstrings_stats[i].saved_reset_offset;
+
+		if (ixgbe_gstrings_stats[i].sizeof_stat == sizeof(u64)) {
+			if (ixgbe_gstrings_stats[i].base_stat_offset >= 0)
+				data[i] = *(u64 *)p - *(u64 *)b + *(u64 *)r;
+			else
+				data[i] = *(u64 *)p;
+		} else {
+			if (ixgbe_gstrings_stats[i].base_stat_offset >= 0)
+				data[i] = *(u32 *)p - *(u32 *)b + *(u32 *)r;
+			else
+				data[i] = *(u32 *)p;
+		}
 	}
 }
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 78e4c51..acf3806 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -361,7 +361,6 @@ struct ixgbevf_adapter {
 	struct ixgbe_hw hw;
 	u16 msg_enable;
 	struct ixgbevf_hw_stats stats;
-	u64 zero_base;
 	/* Interrupt Throttle Rate */
 	u32 eitr_param;
 
-- 
1.8.3.1

^ permalink raw reply related

* [net-next  11/11] i40e: fix error return code in i40e_probe()
From: Jeff Kirsher @ 2013-10-29 12:02 UTC (permalink / raw)
  To: davem; +Cc: Wei Yongjun, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1383048151-15002-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>

Fix to return -ENOMEM in the memory alloc error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 41a79df..be15938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7204,8 +7204,10 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	 */
 	len = sizeof(struct i40e_vsi *) * pf->hw.func_caps.num_vsis;
 	pf->vsi = kzalloc(len, GFP_KERNEL);
-	if (!pf->vsi)
+	if (!pf->vsi) {
+		err = -ENOMEM;
 		goto err_switch_setup;
+	}
 
 	err = i40e_setup_pf_switch(pf);
 	if (err) {
-- 
1.8.3.1

^ permalink raw reply related

* Re: IPV6 nf defrag does not work
From: Jiri Pirko @ 2013-10-29 12:03 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, pablo, netfilter-devel, yoshfuji, kadlec, kaber
In-Reply-To: <20131029115617.GA16615@breakpoint.cc>

Tue, Oct 29, 2013 at 12:56:17PM CET, fw@strlen.de wrote:
>Jiri Pirko <jiri@resnulli.us> wrote:
>> On the current net-next if you on HOSTA do:
>> ip6tables -I INPUT -p icmpv6 -j DROP
>> ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
>> 
>> and on HOSTB you do:
>> ping6 HOSTA -s2000    (MTU is 1500)
>> 
>> Only the first ICMP echo request will be passed through, the rest is not
>> passed on HOSTA. This issue does not occur with smaller packets than MTU (where
>> fragmentation does not happen).
>>
>> I'm trying to find out where the problem is.
>
>Are you sure this is new behaviour? As far back as I can remember
>it was always like this.

Yes. This is not new.

>
>in ip6tables, the individual fragments are sent through the ruleset,
>iow. you'll need to make use of '-m conntrack' to match the fragments
>belonging to an existing connection.

Hmm. I think that it is not correct to force user (iptables user) to
make dirrerent rules because some ipv6 packets might be fragmented.
This should be handled in kernel.

>
>I don't know why this is, and I don't like this either.
>But this is how it was implemented, see
>
>net/ipv6/netfilter/nf_defrag_ipv6_hooks.c, ipv6_defrag() ->
>nf_ct_frag6_output()

Yep. I'm studying the code atm.



^ permalink raw reply

* Re: [PATCH net-next] ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
From: Hannes Frederic Sowa @ 2013-10-29 12:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fweimer
In-Reply-To: <20131029.000844.1092862708536984032.davem@davemloft.net>

On Tue, Oct 29, 2013 at 12:08:44AM -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Sat, 26 Oct 2013 22:11:58 +0200
> 
> > Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
> > their sockets won't accept and install new path mtu information and they
> > will always use the interface mtu for outgoing packets. It is guaranteed
> > that the packet is not fragmented locally. But we won't set the DF-Flag
> > on the outgoing frames.
> > 
> > Florian Weimer had the idea to use this flag to ensure DNS servers are
> > never generating outgoing fragments. They may well be fragmented on the
> > path, but the server never stores or usees path mtu values, which could
> > well be forged in an attack.
> > 
> > (The root of the problem with path MTU discovery is that there is
> > no reliable way to authenticate ICMP Fragmentation Needed But DF Set
> > messages because they are sent from intermediate routers with their
> > source addresses, and the IMCP payload will not always contain sufficient
> > information to identify a flow.)
> 
> I do not like this reasoning.  You have several more acceptable paths to take
> to resolve this problem:

So, I hope I can convince you to merge this patch. ;-)

I really tried hard to find alternatives or even a way to enable
the protection automatically given that at least unbound does apply
IP_PMTUDISC_DONT to its sockets already. These are the reasons why I
came up with the new IP_PMTUDISC_INTERFACE value:

> 1) "I don't trust path MTU information at all"
> 
>    Just turn it off globally, end of story.  It has the same effect as your
>    new per-application mode.

I think you are referring to ipv4/ip_no_pmtu_disc?

This setting actually worsens the situation:

0.000 `echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc`
0.500 `ethtool -K tun0 ufo off`

/* check current mtu with connected socket */
1.000 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 3
1.000 connect(3, ..., ...) = 0
1.000 getsockopt(3, IPPROTO_IP, IP_MTU, [1500], [4]) = 0
1.000 close(3) = 0

/* inject frag_needed to unconnected socket */
2.000 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 3
2.000 setsockopt(3, IPPROTO_IP, IP_MTU_DISCOVER, [IP_PMTUDISC_DONT], 4) = 0
2.000 sendto(3, ..., 1000, 0, ..., ...) = 1000
2.000 > udp (1000)
/* this simulates a forged icmp packet which tries to lower the mtu */
2.100 < [udp (1000)] icmp unreachable frag_needed mtu 1400

2.5 sendto(3, ..., 1000, 0, ..., ...) = 1000
/* first fragment splitted at 522 bytes */
2.5 > udp(522)
/* didn't find a way to match for the second fragment - so error out here
 * comment out this check to check for the mtu a IP_PMTUDISC_DONT socket now has
 *
 * it really is used, otherwise we would not have seen this fragment
 */

3.000 close(3) = 0

/* verify mtu again */
3.000 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 3
3.000 connect(3, ..., ...) = 0
/* we actually have 522 as the minimal mtu - so error out here */
3.000 getsockopt(3, IPPROTO_IP, IP_MTU, [1400], [4]) = 0
3.000 close(3) = 0

5.000 `echo 0 > /proc/sys/net/ipv4/ip_no_pmtu_disc`

Given this packetdrill snippet we see it is actually very
counterproductive to tell people to globally disable pmtu discovery as
it will certainly generate more fragments.

The reason for this is, that __ip_rt_update_pmtu is called with zero mtu
and will get clamped to min_pmtu. It is possible to tell people to also
increase min_pmtu but I don't want to take the risk of breaking other
applications on a DNS server given how the current semantics are.

If we receive a pmtu event on a TCP or DCCP socket we actually
check for the pmtudisc setting and won't apply it if it is set to
IP_PMTUDISC_DONT. This is not checked on UDP and RAW sockets and it was
very hard to think about the consequences to introduce such checks now
(at least for me).

Adding a new sysctl would be a viable option but it is harder to use
by DNS server implementations and could be fiddled with by users or
scripts etc.  I dislike a globel (or even per namespace) setting. Also
Fedora e.g. installs an unbound server on Desktops if one installs
the dnssec-trigger package.  Maybe we will see more DNS servers on
workstations where we don't want to globally disable pmtu.

> 2) "I don't trust path MTU information unless the full socket ID is available
>     in the ICMP packets quoted headers"
> 
>    Then simply implement a policy as such and submit it to me.

We don't trust the MTU information in any case for that socket.

It seems easy to inject wrong path mtu information on a socket which will
be installed as a globel fnhe for 600 seconds. This is a very viable
target if an attacker is able to do cache poisoning afterwards. There
is not much protection even if we match the full socket ID on an
UDP socket. As this socket is more exposed to attacks we drop the mtu
information there. Of course, other ports could also install wrong pmtu
information causing packets being fragmented so we always use the path
mtu and don't fragment locally.  Dropping the pmtu information on these
sockets just makes it more symmetrical and causes less stress on the fnhe
hash tables if people do these attacks (they could well be running these
attacks in parallel). If the DNS port is the only one reachable externally
we won't do any pmtu processing on that server from untrusted peers, which is
a nice side effect.

All that said, I hope dns servers provide a setting to apply this option
on sockets and fail to start up if it could not be applied. So an
administrator is not in some sense of false hope that this protection
is enabled when running on an older kernel. This could only be done if we
provide a new value for IP_MTU_DISCOVER.

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH net-next] tipc: remove two indentation levels in tipc_recv_msg routine
From: Erik Hugne @ 2013-10-29 12:07 UTC (permalink / raw)
  To: David Laight; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73BC@saturn3.aculab.com>

On Tue, Oct 29, 2013 at 11:13:09AM +0000, David Laight wrote:
> I can only see one 'goto cont', an explicit kfree_skb() and
> continue would be clearer.

There's actually a few more 'goto cont' in the early error checks
in this function causes it to bail early. Those are not visible
in the patchdiff however.

//E

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk

^ permalink raw reply

* Re: [PATCH net-next 0/2] Removal of struct esp_data
From: Steffen Klassert @ 2013-10-29 12:11 UTC (permalink / raw)
  To: David Miller; +Cc: mathias.krause, netdev, herbert
In-Reply-To: <20131023.154002.1833051243694861649.davem@davemloft.net>

On Wed, Oct 23, 2013 at 03:40:02PM -0400, David Miller wrote:
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Wed, 23 Oct 2013 10:07:16 +0200
> 
> > Well, I thought to either take this as a reminder to implement the
> > missing stuff or to take the removing patches if this is really obsolete.
> > I'll do one of both once I'm back from conference week in Edinburgh.
> 
> Sounds like a good plan, thanks.

RFC 4303 recommends to use TFC padding instead of the padding field:

'As noted above, the Padding field is limited to 255 bytes in length.
This generally will not be adequate to hide traffic characteristics
relative to traffic flow confidentiality requirements.  An optional
field, within the payload data, is provided specifically to address
the TFC requirement.'

So looks like we can remove the padlen field, both patches
applied to ipsec-next. Thanks Mathias!

^ permalink raw reply

* [PATCH] netfilter fix for net
From: Pablo Neira Ayuso @ 2013-10-29 12:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

This pull request contains the following netfilter fix:

* fix --queue-bypass in xt_NFQUEUE revision 3. While adding the
  revision 3 of this target, the bypass flags were not correctly
  handled anymore, thus, breaking packet bypassing if no application
  is listening from userspace, patch from Holger Eitzenberger,
  reported by Florian Westphal.

You can pull this change from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

Thanks!

----------------------------------------------------------------

The following changes since commit fecda03493646b53f53892fa3c38c75ba9310374:

  net: sctp: fix ASCONF to allow non SCTP_ADDR_SRC addresses in ipv6 (2013-10-23 16:57:14 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

for you to fetch changes up to d954777324ffcba0b2f8119c102237426c654eeb:

  netfilter: xt_NFQUEUE: fix --queue-bypass regression (2013-10-29 13:05:54 +0100)

----------------------------------------------------------------
Holger Eitzenberger (1):
      netfilter: xt_NFQUEUE: fix --queue-bypass regression

 net/netfilter/xt_NFQUEUE.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

^ permalink raw reply

* [PATCH] netfilter: xt_NFQUEUE: fix --queue-bypass regression
From: Pablo Neira Ayuso @ 2013-10-29 12:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1383048796-7964-1-git-send-email-pablo@netfilter.org>

From: Holger Eitzenberger <holger@eitzenberger.org>

V3 of the NFQUEUE target ignores the --queue-bypass flag,
causing packets to be dropped when the userspace listener
isn't running.

Regression is in since 8746ddcf12bb26 ("netfilter: xt_NFQUEUE:
introduce CPU fanout").

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_NFQUEUE.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/xt_NFQUEUE.c b/net/netfilter/xt_NFQUEUE.c
index 1e2fae3..ed00fef 100644
--- a/net/netfilter/xt_NFQUEUE.c
+++ b/net/netfilter/xt_NFQUEUE.c
@@ -147,6 +147,7 @@ nfqueue_tg_v3(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_NFQ_info_v3 *info = par->targinfo;
 	u32 queue = info->queuenum;
+	int ret;
 
 	if (info->queues_total > 1) {
 		if (info->flags & NFQ_FLAG_CPU_FANOUT) {
@@ -157,7 +158,11 @@ nfqueue_tg_v3(struct sk_buff *skb, const struct xt_action_param *par)
 			queue = nfqueue_hash(skb, par);
 	}
 
-	return NF_QUEUE_NR(queue);
+	ret = NF_QUEUE_NR(queue);
+	if (info->flags & NFQ_FLAG_BYPASS)
+		ret |= NF_VERDICT_FLAG_QUEUE_BYPASS;
+
+	return ret;
 }
 
 static struct xt_target nfqueue_tg_reg[] __read_mostly = {
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next] tipc: remove two indentation levels in tipc_recv_msg routine
From: Andreas Bofjäll @ 2013-10-29 12:40 UTC (permalink / raw)
  To: Erik Hugne, David Laight
  Cc: jon.maloy, netdev, tipc-discussion, davem, Paul.Gortmaker
In-Reply-To: <20131029120708.GI14038@eerihug-hybrid.rnd.ki.sw.ericsson.se>

On 10/29/2013 01:07 PM, Erik Hugne wrote:
> On Tue, Oct 29, 2013 at 11:13:09AM +0000, David Laight wrote:
>> I can only see one 'goto cont', an explicit kfree_skb() and
>> continue would be clearer.
>
> There's actually a few more 'goto cont' in the early error checks
> in this function causes it to bail early. Those are not visible
> in the patchdiff however.

If you wanted to make this a bit simpler, you could make two labels and 
move the calls to tipc_node_unlock(n_ptr). The blocks on lines 1545, 
1558 and 1614 (in the patched code) could be replaced by a simple "goto 
unlock" this way.

--- link.c-old	2013-10-29 13:34:35.804926348 +0100
+++ link.c	2013-10-29 13:39:23.991842809 +0100
@@ -1541,10 +1541,8 @@

  		/* Locate unicast link endpoint that should handle message */
  		l_ptr = n_ptr->links[b_ptr->identity];
-		if (unlikely(!l_ptr)) {
-			tipc_node_unlock(n_ptr);
-			goto cont;
-		}
+		if (unlikely(!l_ptr))
+			goto unlock;

  		/* Verify that communication with node is currently allowed */
  		if ((n_ptr->block_setup & WAIT_PEER_DOWN) &&
@@ -1554,10 +1552,8 @@
  			!msg_redundant_link(msg))
  			n_ptr->block_setup &= ~WAIT_PEER_DOWN;

-		if (n_ptr->block_setup) {
-			tipc_node_unlock(n_ptr);
-			goto cont;
-		}
+		if (n_ptr->block_setup)
+			goto unlock;

  		/* Validate message sequence number info */
  		seq_no = msg_seqno(msg);
@@ -1611,8 +1607,7 @@
  				tipc_node_unlock(n_ptr);
  				continue;
  			}
-			tipc_node_unlock(n_ptr);
-			goto cont;
+			goto unlock;
  		}

  		/* Link is now in state WORKING_WORKING */
@@ -1681,6 +1676,9 @@
  		tipc_node_unlock(n_ptr);
  		tipc_net_route_msg(buf);
  		continue;
+
+unlock:
+		tipc_node_unlock(n_ptr);
  cont:
  		kfree_skb(buf);
  	}

/Andreas

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk

^ permalink raw reply

* Re: [patch net-next] ipv6: allow userspace to create address with IFLA_F_TEMPORARY flag
From: Hannes Frederic Sowa @ 2013-10-29 12:40 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, jiri, vyasevich, netdev, kuznet, jmorris, yoshfuji,
	kaber, thaller, stephen
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73BB@saturn3.aculab.com>

On Tue, Oct 29, 2013 at 09:37:06AM -0000, David Laight wrote:
> > Note that you don't even need to put the DHCP protocol core into the
> > kernel to fix the promiscuous problem.  You just have to use the
> > current kernel interfaces correctly.
> > 
> > It used to be the case a very long time ago that you couldn't even
> > receive broadcast UDP datagrams on a socket until an address was
> > configured on it.
> > 
> > So everyone turns on promiscuous mode and uses RAW sockets or
> > AF_PACKET.
> > 
> > Stupid?  yes.
> 
> Not only that, but the dhcp client could use a normal UDP socket
> to keep the lease renewed - I suspect it has only ever needed
> to use the BPF interface (I didn't think it set promiscuous)
> when acquiring the initial lease.

Yes, this is a very unfortunate situation. From my experience it is not that
easy to get a patch merged into isc-dhcp.

It seems not that invasive to switch from af_packet to an udp socket with
SO_BROADCAST set.

Greetings,

  Hannes

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox