Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 4/7] ixgbe: protect vxlan_get_rx_port in ixgbe_service_task with rtnl_lock
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev
  Cc: jesse, Jeff Kirsher, Jesse Brandeburg, Shannon Nelson,
	Carolyn Wyborny, Don Skidmore, Bruce Allan, John Ronciak,
	Mitch Williams
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

vxlan_get_rx_port requires rtnl_lock to be held.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Shannon Nelson <shannon.nelson@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2976df77bf14f5..b2f2cf40f06a87 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7192,10 +7192,12 @@ static void ixgbe_service_task(struct work_struct *work)
 		return;
 	}
 #ifdef CONFIG_IXGBE_VXLAN
+	rtnl_lock();
 	if (adapter->flags2 & IXGBE_FLAG2_VXLAN_REREG_NEEDED) {
 		adapter->flags2 &= ~IXGBE_FLAG2_VXLAN_REREG_NEEDED;
 		vxlan_get_rx_port(adapter->netdev);
 	}
+	rtnl_unlock();
 #endif /* CONFIG_IXGBE_VXLAN */
 	ixgbe_reset_subtask(adapter);
 	ixgbe_phy_interrupt_subtask(adapter);
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 3/7] mlx4: protect mlx4_en_start_port in mlx4_en_restart with rtnl_lock
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev; +Cc: jesse, Eugenia Emantayev, Yishai Hadas
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

mlx4_en_start_port requires rtnl_lock to be held.

Cc: Eugenia Emantayev <eugenia@mellanox.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b4b258c8ca47d4..8bd143dda95d11 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1856,6 +1856,7 @@ static void mlx4_en_restart(struct work_struct *work)
 
 	en_dbg(DRV, priv, "Watchdog task called for port %d\n", priv->port);
 
+	rtnl_lock();
 	mutex_lock(&mdev->state_lock);
 	if (priv->port_up) {
 		mlx4_en_stop_port(dev, 1);
@@ -1863,6 +1864,7 @@ static void mlx4_en_restart(struct work_struct *work)
 			en_err(priv, "Failed restarting port %d\n", priv->port);
 	}
 	mutex_unlock(&mdev->state_lock);
+	rtnl_unlock();
 }
 
 static void mlx4_en_clear_stats(struct net_device *dev)
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 2/7] fm10k: protect fm10k_open in fm10k_io_resume with rtnl_lock
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev
  Cc: jesse, Jeff Kirsher, Jesse Brandeburg, Shannon Nelson,
	Carolyn Wyborny, Don Skidmore, Bruce Allan, John Ronciak,
	Mitch Williams
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

fm10k_open requires rtnl_lock to be held.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Shannon Nelson <shannon.nelson@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index f0992950e228eb..04304d9a633935 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -2362,8 +2362,10 @@ static void fm10k_io_resume(struct pci_dev *pdev)
 	/* reset clock */
 	fm10k_ts_reset(interface);
 
+	rtnl_lock();
 	if (netif_running(netdev))
 		err = fm10k_open(netdev);
+	rtnl_unlock();
 
 	/* final check of hardware state before registering the interface */
 	err = err ? : fm10k_hw_ready(interface);
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 1/7] benet: be_resume needs to protect be_open with rtnl_lock
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev
  Cc: jesse, Sathya Perla, Ajit Khaparde, Padmanabh Ratnakar,
	Sriharsha Basavapatna, Somnath Kotur
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

be_open calls down to functions which expects rtnl lock to be held.

Cc: Sathya Perla <sathya.perla@broadcom.com>
Cc: Ajit Khaparde <ajit.khaparde@broadcom.com>
Cc: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Cc: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/emulex/benet/be_main.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 536686476369bf..ed98ef1ecac38d 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4890,11 +4890,13 @@ static int be_resume(struct be_adapter *adapter)
 	if (status)
 		return status;
 
-	if (netif_running(netdev)) {
+	rtnl_lock();
+	if (netif_running(netdev))
 		status = be_open(netdev);
-		if (status)
-			return status;
-	}
+	rtnl_unlock();
+
+	if (status)
+		return status;
 
 	netif_device_attach(netdev);
 
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 0/7] net: network drivers should not depend on geneve/vxlan
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev; +Cc: jesse

This patchset removes the dependency of network drivers on vxlan or
geneve, so those don't get autoloaded when the nic driver is loaded.

Also audited the code such that vxlan_get_rx_port and geneve_get_rx_port
are not called without rtnl lock.

Hannes Frederic Sowa (7):
  benet: be_resume needs to protect be_open with rtnl_lock
  fm10k: protect fm10k_open in fm10k_io_resume with rtnl_lock
  mlx4: protect mlx4_en_start_port in mlx4_en_restart with rtnl_lock
  ixgbe: protect vxlan_get_rx_port in ixgbe_service_task with rtnl_lock
  qlcnic: protect qlicnic_attach_func with rtnl_lock
  vxlan: break dependency with netdev drivers
  geneve: break dependency with netdev drivers

 drivers/net/ethernet/emulex/benet/be_main.c      | 10 +++++---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c     |  2 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  2 ++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |  2 ++
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 10 ++++++--
 drivers/net/geneve.c                             | 31 +++++++++++++++++++++---
 drivers/net/vxlan.c                              | 14 +++++++----
 include/linux/netdevice.h                        |  2 ++
 include/net/geneve.h                             |  6 ++---
 include/net/vxlan.h                              |  6 ++---
 10 files changed, 63 insertions(+), 22 deletions(-)

-- 
2.5.5

^ permalink raw reply

* [PATCH net-next v2 2/2] bpf: add event output helper for notifications/sampling/logging
From: Daniel Borkmann @ 2016-04-18 19:01 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, netdev, linux-kernel, Daniel Borkmann,
	Alexei Starovoitov
In-Reply-To: <cover.1461005139.git.daniel@iogearbox.net>

This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").

The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.

User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.

While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.

Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h      |  2 ++
 kernel/bpf/core.c        |  7 +++++++
 kernel/trace/bpf_trace.c | 27 +++++++++++++++++++++++++++
 net/core/filter.c        |  2 ++
 4 files changed, 38 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5fb3c61..f63afdc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -169,7 +169,9 @@ u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 void bpf_fd_array_map_clear(struct bpf_map *map);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
+
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
+const struct bpf_func_proto *bpf_get_event_output_proto(void);

 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index be0abf6..e4248fe 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -764,14 +764,21 @@ const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
 const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
 const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
 const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
+
 const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
+
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
 	return NULL;
 }

+const struct bpf_func_proto * __weak bpf_get_event_output_proto(void)
+{
+	return NULL;
+}
+
 /* Always built-in helper functions. */
 const struct bpf_func_proto bpf_tail_call_proto = {
 	.func		= NULL,
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6bfe55c..75cb154 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -277,6 +277,33 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = {
 	.arg5_type	= ARG_CONST_STACK_SIZE,
 };

+static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
+
+static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
+
+	perf_fetch_caller_regs(regs);
+
+	return bpf_perf_event_output((long)regs, r2, flags, r4, size);
+}
+
+static const struct bpf_func_proto bpf_event_output_proto = {
+	.func		= bpf_event_output,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_STACK,
+	.arg5_type	= ARG_CONST_STACK_SIZE,
+};
+
+const struct bpf_func_proto *bpf_get_event_output_proto(void)
+{
+	return &bpf_event_output_proto;
+}
+
 static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5d2ac2b..218e5de 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2039,6 +2039,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
 		return &bpf_redirect_proto;
 	case BPF_FUNC_get_route_realm:
 		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_perf_event_output:
+		return bpf_get_event_output_proto();
 	default:
 		return sk_filter_func_proto(func_id);
 	}
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v2 1/2] bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output
From: Daniel Borkmann @ 2016-04-18 19:01 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, netdev, linux-kernel, Daniel Borkmann,
	Alexei Starovoitov
In-Reply-To: <cover.1461005139.git.daniel@iogearbox.net>

Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
per-CPU ring buffers and the eBPF program pushes the data into the current
CPU's ring buffer which saves us an extra helper function call in eBPF.
Also, make sure to properly reserve the remaining flags which are not used.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h | 4 ++++
 kernel/trace/bpf_trace.c | 7 ++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70eda5a..b7b0fb1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -347,6 +347,10 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
 #define BPF_F_DONT_FRAGMENT		(1ULL << 2)
 
+/* BPF_FUNC_perf_event_output flags. */
+#define BPF_F_INDEX_MASK		0xffffffffULL
+#define BPF_F_CURRENT_CPU		BPF_F_INDEX_MASK
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6855878..6bfe55c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -225,11 +225,12 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
-static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
+static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
 {
 	struct pt_regs *regs = (struct pt_regs *) (long) r1;
 	struct bpf_map *map = (struct bpf_map *) (long) r2;
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	u64 index = flags & BPF_F_INDEX_MASK;
 	void *data = (void *) (long) r4;
 	struct perf_sample_data sample_data;
 	struct perf_event *event;
@@ -239,6 +240,10 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
 		.data = data,
 	};
 
+	if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
+		return -EINVAL;
+	if (index == BPF_F_CURRENT_CPU)
+		index = raw_smp_processor_id();
 	if (unlikely(index >= array->map.max_entries))
 		return -E2BIG;
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v2 0/2] BPF updates
From: Daniel Borkmann @ 2016-04-18 19:01 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, tgraf, netdev, linux-kernel, Daniel Borkmann

This minor set adds a new helper bpf_event_output() for eBPF cls/act
program types which allows to pass events to user space applications.
For details, please see individual patches.

Thanks!

v1 -> v2:
  - Address kbuild bot found compile issue in patch 2
  - Rest as is

Daniel Borkmann (2):
  bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output
  bpf: add event output helper for notifications/sampling/logging

 include/linux/bpf.h      |  2 ++
 include/uapi/linux/bpf.h |  4 ++++
 kernel/bpf/core.c        |  7 +++++++
 kernel/trace/bpf_trace.c | 34 +++++++++++++++++++++++++++++++++-
 net/core/filter.c        |  2 ++
 5 files changed, 48 insertions(+), 1 deletion(-)

-- 
1.9.3

^ permalink raw reply

* Re: [PATCH net-next] macvlan: fix failure during registration
From: Eric W. Biederman @ 2016-04-18 18:48 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: netdev, David S. Miller, Mahesh Bandewar
In-Reply-To: <1460992811-46992-1-git-send-email-fruggeri@arista.com>

Francesco Ruggeri <fruggeri@arista.com> writes:

> Resending, did not include netdev the first time ...
>
> If a macvlan/macvtap creation fails in register_netdevice in
> call_netdevice_notifiers(NETDEV_REGISTER) then while cleaning things up in
> rollback_registered_many it invokes macvlan_uninit. This results in
> port->count being decremented twice (in macvlan_uninit and in
> macvlan_common_newlink).
> A similar problem may exist in the ipvlan driver.
> This patch adds priv_flags to struct macvlan_dev and a flag that
> macvlan_uninit can use to determine if it is invoked in the context of a
> failed newlink.
> In macvtap_device_event(NETDEV_UNREGISTER) it also avoids cleaning up
> /dev/tapNN in case creation of the char device had previously failed.
> Tested in 3.18.

These interactions all seem a little bit funny.  At a quick skim it
would make more sense to increment the port count in macvlan_init,
and completely remove the need to mess with port counts anywhere except
macvlan_init and macvlan_uninit.

If for some reason that can't be done the code can easily look at
dev->reg_state.  If dev->reg_state == NETREG_UNITIALIZED it should
be exactly the same as your new flag being set.

> diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
> index a4ccc31..7cf82d8 100644
> --- a/include/linux/if_macvlan.h
> +++ b/include/linux/if_macvlan.h
> @@ -48,6 +48,7 @@ struct macvlan_dev {
>  	netdev_features_t	set_features;
>  	enum macvlan_mode	mode;
>  	u16			flags;
> +	u16			priv_flags;
>  	/* This array tracks active taps. */
>  	struct macvtap_queue	__rcu *taps[MAX_MACVTAP_QUEUES];
>  	/* This list tracks all taps (both enabled and disabled) */
> @@ -63,6 +64,8 @@ struct macvlan_dev {
>  	unsigned int		macaddr_count;
>  };
>  
> +#define MACVLAN_PRIV_FLAG_REGISTERING	1
> +
>  static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
>  				    unsigned int len, bool success,
>  				    bool multicast)

^ permalink raw reply

* Re: [PATCH net-next] enic: set netdev->vlan_features
From: David Miller @ 2016-04-18 18:53 UTC (permalink / raw)
  To: gvaradar; +Cc: netdev, benve, mkubecek, _govind
In-Reply-To: <1460747443-14993-1-git-send-email-gvaradar@cisco.com>

From: Govindarajulu Varadarajan <gvaradar@cisco.com>
Date: Sat, 16 Apr 2016 00:40:43 +0530

> From: Govindarajulu Varadarajan <_govind@gmx.com>
> 
> Driver sets vlan_feature to netdev->features as hardware supports all of
> them on vlan interface.
> 
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>

Applied.

^ permalink raw reply

* Re: [PATCH 0/3] fec: ethtool: move to new api {get|set}_link_ksettings
From: David Miller @ 2016-04-18 18:45 UTC (permalink / raw)
  To: tremyfr; +Cc: decot, f.fainelli, fugang.duan, linux-kernel, netdev
In-Reply-To: <1460673301-10599-1-git-send-email-tremyfr@gmail.com>

From: Philippe Reynes <tremyfr@gmail.com>
Date: Fri, 15 Apr 2016 00:34:58 +0200

> Ethtool has a new api {get|set}_link_ksettings that deprecate
> the old api {get|set}_settings. We update the fec driver to use
> this new ethtool api.
> 
> For this first version, I've converted old u32 value in phy structure
> to link_modes structure. Another way would be to replace u32 in
> phy structure to use DECLARE_LINK_MODE_MASK for advertising, ....

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] netfilter: ctnetlink: add more #ifdef around unused code
From: Pablo Neira Ayuso @ 2016-04-18 18:43 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Patrick McHardy, Jozsef Kadlecsik, David S. Miller,
	Daniel Borkmann, Ken-ichirou MATSUZAWA, netfilter-devel, coreteam,
	netdev, linux-kernel
In-Reply-To: <4187960.DvZepuNvUx@wuerfel>

On Mon, Apr 18, 2016 at 08:33:15PM +0200, Arnd Bergmann wrote:
> On Monday 18 April 2016 20:16:59 Pablo Neira Ayuso wrote:
> > On Sat, Apr 16, 2016 at 10:17:43PM +0200, Arnd Bergmann wrote:
> > > A recent patch removed many 'inline' annotations for static
> > > functions in this file, which has caused warnings for functions
> > > that are not used in a given configuration, in particular when
> > > CONFIG_NF_CONNTRACK_EVENTS is disabled:
> > > 
> > > nf_conntrack_netlink.c:572:15: 'ctnetlink_timestamp_size' defined but not used
> > > nf_conntrack_netlink.c:546:15: 'ctnetlink_acct_size' defined but not used
> > > nf_conntrack_netlink.c:339:12: 'ctnetlink_label_size' defined but not used
> > 
> > Arnd, thanks for the fix.
> > 
> > I'm planning to push this though:
> > 
> > http://patchwork.ozlabs.org/patch/610820/
> > 
> > This is restoring the inlines for the size calculation functions, but
> > I think that's ok. They are rather small and they're called from the
> > event notification path (ie. packet path), so the compiler just place
> > them out of the way when not needed and we calm down the gcc warning.
> 
> Looks good. I'll put this in my randconfig builder to replace my own
> patch and will let you know if you missed something.

Thanks, will wait for your ack then.

^ permalink raw reply

* Re: [PATCH net-next] tun: don't require serialization lock on tx
From: David Miller @ 2016-04-18 18:36 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, mst, hannes, ebiederm, gkurz, jasowang, edumazet
In-Reply-To: <cc7ac4631f0a32c4d61b73f4f28b52a05ab8651d.1460651429.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 14 Apr 2016 18:39:39 +0200

> The current tun_net_xmit() implementation don't need any external
> lock since it relies on rcu protection for the tun data structure
> and on socket queue lock for skb queuing.
> 
> This patch set the NETIF_F_LLTX feature bit in the tun device, so
> that on xmit, in absence of qdisc, no serialization lock is acquired
> by the caller.
> 
> The user space can remove the default tun qdisc with:
> 
> tc qdisc replace dev <tun device name> root noqueue
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH nf] netfilter: ipv6: Orphan skbs in nf_ct_frag6_gather()
From: Pablo Neira Ayuso @ 2016-04-18 18:35 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Florian Westphal, David Laight, netfilter-devel@vger.kernel.org,
	diproiettod@vmware.com, netdev@vger.kernel.org
In-Reply-To: <CAPWQB7EV-boR1_yaBeY8tQtHgc09+14jQ_q7yk=bpwrtqL+20Q@mail.gmail.com>

On Thu, Apr 14, 2016 at 05:35:39PM -0700, Joe Stringer wrote:
> On 14 April 2016 at 03:35, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Thu, Apr 14, 2016 at 10:40:15AM +0200, Florian Westphal wrote:
> >> David Laight <David.Laight@ACULAB.COM> wrote:
> >> > From: Joe Stringer
> >> > > Sent: 13 April 2016 19:10
> >> > > This is the IPv6 equivalent of commit 8282f27449bf ("inet: frag: Always
> >> > > orphan skbs inside ip_defrag()").
> >> > >
> >> > > Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
> >> > > clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
> >> > > cloned (implicitly orphaning) prior to queueing for reassembly. As such,
> >> > > when the IPv6 message is eventually reassembled, the skb->sk for all
> >> > > fragments would be NULL. After that commit was introduced, rather than
> >> > > cloning, the original skbs were queued directly without orphaning. The
> >> > > end result is that all frags except for the first and last may have a
> >> > > socket attached.
> >> >
> >> > I'd have thought that the queued fragments would still want to be
> >> > resource-counted against the socket (I think that is what skb->sk is for).
> >>
> >> No, ipv4/ipv6 reasm has its own accouting.
> >>
> >> > Although I can't imagine why IPv6 reassembly is happening on skb
> >> > associated with a socket.
> >>
> >> Right, thats a much more interesting question -- both ipv4 and
> >> ipv6 orphan skbs before NF_HOOK prerouting trip.
> >>
> >> (That being said, I don't mind the patch, I'm just be curious how this
> >>  can happen).
> >
> > If this change is specific to get this working in ovs and its
> > conntrack support, then I don't think this belong to core
> > infrastructure. This should be fixed in ovs instead.
> 
> I admit I've only been able to reproduce it with OVS. My main reason
> for proposing the fix this way was just because this is what the IPv4
> code does, so I figured IPv6 should be consistent with that.

You mean that this is what you did in 029f7f3b8701 to fix this, right?

But we shouldn't add code to the core that is OVS specific for no
reason. We don't need this orphan from ipv4 and ipv6 as Florian
indicated.

Is there any chance you can fix this from OVS and its conntrack glue
code? Thanks.

^ permalink raw reply

* Re: [PATCH] netfilter: ctnetlink: add more #ifdef around unused code
From: Arnd Bergmann @ 2016-04-18 18:33 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Patrick McHardy, Jozsef Kadlecsik, David S. Miller,
	Daniel Borkmann, Ken-ichirou MATSUZAWA, netfilter-devel, coreteam,
	netdev, linux-kernel
In-Reply-To: <20160418181659.GA2427@salvia>

On Monday 18 April 2016 20:16:59 Pablo Neira Ayuso wrote:
> On Sat, Apr 16, 2016 at 10:17:43PM +0200, Arnd Bergmann wrote:
> > A recent patch removed many 'inline' annotations for static
> > functions in this file, which has caused warnings for functions
> > that are not used in a given configuration, in particular when
> > CONFIG_NF_CONNTRACK_EVENTS is disabled:
> > 
> > nf_conntrack_netlink.c:572:15: 'ctnetlink_timestamp_size' defined but not used
> > nf_conntrack_netlink.c:546:15: 'ctnetlink_acct_size' defined but not used
> > nf_conntrack_netlink.c:339:12: 'ctnetlink_label_size' defined but not used
> 
> Arnd, thanks for the fix.
> 
> I'm planning to push this though:
> 
> http://patchwork.ozlabs.org/patch/610820/
> 
> This is restoring the inlines for the size calculation functions, but
> I think that's ok. They are rather small and they're called from the
> event notification path (ie. packet path), so the compiler just place
> them out of the way when not needed and we calm down the gcc warning.

Looks good. I'll put this in my randconfig builder to replace my own
patch and will let you know if you missed something.

	Arnd

^ permalink raw reply

* Re: Poorer networking performance in later kernels?
From: Rick Jones @ 2016-04-18 18:22 UTC (permalink / raw)
  To: Butler, Peter, netdev@vger.kernel.org
In-Reply-To: <BY2PR0301MB1990F3095CFA07EBBE600F9DD66B0@BY2PR0301MB1990.namprd03.prod.outlook.com>

On 04/18/2016 04:27 AM, Butler, Peter wrote:
> Hi Rick
>
> Thanks for the reply.
>
> Here is some hardware information, as requested (the two systems are
> identical, and are communicating with one another over a 10GB
> full-duplex Ethernet backplane):
>
> - processor type: Intel(R) Xeon(R) CPU C5528  @ 2.13GHz
> - NIC: Intel 82599EB 10GB XAUI/BX4
> - NIC driver: ixgbe version 4.2.1-k (part of 4.4.0 kernel)
>
> As for the buffer sizes, those rather large ones work fine for us
> with the 3.4.2 kernel.  However, for the sake of being complete, I
> have re-tried the tests with the 'standard' 4.4.0 kernel parameters
> for all /proc/sys/net/* values, and the results still were extremely
> poor in comparison to the 3.4.2 kernel.
>
> Our MTU is actually just the standard 1500 bytes, however the message
> size was chosen to mimic actual traffic which will be segmented.
>
> I ran ethtool -k (indeed I checked all ethtool parameters, not just
> those via -k) and the only real difference I could find was in
> "large-receive-offload" which was ON in 3.4.2 but OFF in 4.4.0 - so I
> used ethtool to change this to match the 3.4.2 settings and re-ran
> the tests.  Didn't help :-(   It's possible of course that I have
> missed a parameter here or there in comparing the 3.4.2 setup to the
> 4.4.0 setup.  I also tried running the ethtool config with the latest
> and greatest ethtool version (4.5) on the 4.4.0 kernel, as compared
> to the old 3.1 version on our 3.4.2 kernel.

So it would seem the stateless offloads are still enabled.  My next 
question would be to wonder if they are still "effective."  To that end, 
you could run a netperf test specifying a particular port number in the 
test-specific portion:

netperf ...   -- -P ,12345

and while that is running something like

tcpdump -s 96 -c 200000 -w /tmp/foo.pcap -i <interface> port 12345

then post-processed with the likes of:

tcpdump -n -r /tmp/foo.pcap | grep -v "length 0" | awk '{sum += 
$NF}END{print "average",sum/NR}'

the intent behind that is to see what the average post-GRO segment size 
happens to be on the receiver and then to compare it between the two 
kernels.  Grepping-away the "length 0" is to avoid counting ACKs and 
look only at data segments.  The specific port number is to avoid 
including any other connections which might happen to have traffic 
passing through at the time.

You could I suspect do the same comparison on the sending side.

There might I suppose be an easier way to get the average segment size - 
perhaps something from looking at ethtool stats - but the stone knives 
and bear skins of tcpdump above would have the added benefit of having a 
packet trace or three for someone to look at if they felt the need.  And 
for that, I would actually suggest starting the capture *before* the 
netperf test so the connection establishment is included.

> I performed the TCP_RR test as requested and in that case, the
> results are much more comparable.  The old kernel is still better,
> but now only around 10% better as opposed to 2-3x better.

Did the service demand change by 10% or just the transaction rate?

> However I still contend that the *_STREAM tests are giving us more
> pertinent data, since our product application is only getting 1/3 to
> 1/2 half of the performance on the 4.4.0 kernel, and this is the same
> thing I see when I use netperf to test.
>
> One other note: I tried running our 3.4.2 and 4.4.0 kernels in a VM
> environment on my workstation, so as to take the 'real' production
> hardware out of the equation.  When I perform the tests in this setup
> the 3.4.2 and 4.4.0 kernels perform identically - just as you would
> expect.

Running in a VM will likely change things massively and could I suppose 
mask other behaviour changes.

happy benchmarking,

rick jones
raj@tardy:~$ cat signatures/toppost
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

:)

>
> Any other ideas?  What can I be missing here?
>
> Peter
>
>
>
>
> -----Original Message-----
> From: Rick Jones [mailto:rick.jones2@hpe.com]
> Sent: April-15-16 6:37 PM
> To: Butler, Peter <pbutler@sonusnet.com>; netdev@vger.kernel.org
> Subject: Re: Poorer networking performance in later kernels?
>
> On 04/15/2016 02:02 PM, Butler, Peter wrote:
>> (Please keep me CC'd to all comments/responses)
>>
>> I've tried a kernel upgrade from 3.4.2 to 4.4.0 and see a marked drop
>> in networking performance.  Nothing was changed on the test systems,
>> other than the kernel itself (and kernel modules).  The identical
>> .config used to build the 3.4.2 kernel was brought over into the
>> 4.4.0 kernel source tree, and any configuration differences (e.g. new
>> parameters, etc.) were taken as default values.
>>
>> The testing was performed on the same actual hardware for both kernel
>> versions (i.e. take the existing 3.4.2 physical setup, simply boot
>> into the (new) kernel and run the same test).  The netperf utility was
>> used for benchmarking and the testing was always performed on idle
>> systems.
>>
>> TCP testing yielded the following results, where the 4.4.0 kernel only
>> got about 1/2 of the throughput:
>>
>
>>         Recv     Send       Send                          Utilization       Service Demand
>>         Socket   Socket     Message Elapsed               Send     Recv     Send    Recv
>>         Size     Size       Size    Time       Throughput local    remote   local   remote
>>         bytes    bytes      bytes   secs.      10^6bits/s % S      % S      us/KB   us/KB
>>
>> 3.4.2 13631488 13631488   8952    30.01      9370.29    10.14    6.50     0.709   0.454
>> 4.4.0 13631488 13631488   8952    30.02      5314.03    9.14     14.31    1.127   1.765
>>
>> SCTP testing yielded the following results, where the 4.4.0 kernel only got about 1/3 of the throughput:
>>
>>         Recv     Send       Send                          Utilization       Service Demand
>>         Socket   Socket     Message Elapsed               Send     Recv     Send    Recv
>>         Size     Size       Size    Time       Throughput local    remote   local   remote
>>         bytes    bytes      bytes   secs.      10^6bits/s  % S     % S      us/KB   us/KB
>>
>> 3.4.2 13631488 13631488   8952    30.00      2306.22    13.87    13.19    3.941   3.747
>> 4.4.0 13631488 13631488   8952    30.01       882.74    16.86    19.14    12.516  14.210
>>
>> The same tests were performed a multitude of time, and are always
>> consistent (within a few percent).  I've also tried playing with
>> various run-time kernel parameters (/proc/sys/kernel/net/...) on the
>> 4.4.0 kernel to alleviate the issue but have had no success at all.
>>
>> I'm at a loss as to what could possibly account for such a discrepancy...
>>
>
> I suspect I am not alone in being curious about the CPU(s) present in the systems and the model/whatnot of the NIC being used.  I'm also curious as to why you have what at first glance seem like absurdly large socket buffer sizes.
>
> That said, it looks like you have some Really Big (tm) increases in service demand.  Many more CPU cycles being consumed per KB of data transferred.
>
> Your message size makes me wonder if you were using a 9000 byte MTU.
>
> Perhaps in the move from 3.4.2 to 4.4.0 you lost some or all of the stateless offloads for your NIC(s)?  Running ethtool -k <interface> on both ends under both kernels might be good.
>
> Also, if you did have a 9000 byte MTU under 3.4.2 are you certain you still had it under 4.4.0?
>
> It would (at least to me) also be interesting to run a TCP_RR test comparing the two kernels.  TCP_RR (at least with the default request/response size of one byte) doesn't really care about stateless offloads or MTUs and could show how much difference there is in basic path length (or I suppose in interrupt coalescing behaviour if the NIC in question has a mildly dodgy heuristic for such things).
>
> happy benchmarking,
>
> rick jones
>

^ permalink raw reply

* Re: [PATCH] netfilter: ctnetlink: add more #ifdef around unused code
From: Pablo Neira Ayuso @ 2016-04-18 18:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Patrick McHardy, Jozsef Kadlecsik, David S. Miller,
	Daniel Borkmann, Ken-ichirou MATSUZAWA, netfilter-devel, coreteam,
	netdev, linux-kernel
In-Reply-To: <1460837916-1241019-1-git-send-email-arnd@arndb.de>

On Sat, Apr 16, 2016 at 10:17:43PM +0200, Arnd Bergmann wrote:
> A recent patch removed many 'inline' annotations for static
> functions in this file, which has caused warnings for functions
> that are not used in a given configuration, in particular when
> CONFIG_NF_CONNTRACK_EVENTS is disabled:
> 
> nf_conntrack_netlink.c:572:15: 'ctnetlink_timestamp_size' defined but not used
> nf_conntrack_netlink.c:546:15: 'ctnetlink_acct_size' defined but not used
> nf_conntrack_netlink.c:339:12: 'ctnetlink_label_size' defined but not used

Arnd, thanks for the fix.

I'm planning to push this though:

http://patchwork.ozlabs.org/patch/610820/

This is restoring the inlines for the size calculation functions, but
I think that's ok. They are rather small and they're called from the
event notification path (ie. packet path), so the compiler just place
them out of the way when not needed and we calm down the gcc warning.

Thanks!

^ permalink raw reply

* Re: Poorer networking performance in later kernels?
From: Eric Dumazet @ 2016-04-18 18:05 UTC (permalink / raw)
  To: Butler, Peter; +Cc: netdev@vger.kernel.org
In-Reply-To: <BY2PR0301MB1990B79988EF1A3FB461882FD66B0@BY2PR0301MB1990.namprd03.prod.outlook.com>

On Mon, 2016-04-18 at 16:27 +0000, Butler, Peter wrote:
> Hi Eric
> 
> Thanks for your response.  My apologies for being late in getting back
> to you - I wasn't able to have access to the lab hardware on the
> weekend.
> 
> I performed your test as suggested - I've provided a side-by-side diff
> of the nstat output below for the SCTP test only (not the TCP test).
> Note that the fields that are output are somewhat different for the
> two kernels - i.e. some fields exist in one but not the other
> (presumably this comes from the kernel internals?).
> 
> Other than seeing 'larger' throughput numbers in this output I'm not
> sure what to take from it - I'm certainly not a networking
> expert :-(   Let me know if there's anything that speaks to you.
> 
> Note that this test was again done on a clean, freshly rebooted and
> idle system.  Let me know if there's any issues with the output format
> of this data in the email.
> 
> Thanks,
> 
> Peter


OK, please do not top-post on netdev.


I can not really comment on SCTP, could you please post numbers with
TCP ?

Thanks !

^ permalink raw reply

* Greetings!!!
From: andreas111 @ 2016-04-18 17:49 UTC (permalink / raw)
  To: Mr. Eric

Hi, how are you? My name is J Eric Denials, External Financial Auditor at Lloyds Banking Group plc., London. It is a pleasure to contact you at this time through this medium. I have a cool and legitimate deal to do with you as you're a foreigner, it will be mutually beneficial to both. If you’re interested, urgently, get back to me for more explanation about the deal.
Regards
Eric

^ permalink raw reply

* Re: switchdev fib offload issues
From: David Miller @ 2016-04-18 17:52 UTC (permalink / raw)
  To: jiri
  Cc: netdev, idosch, eladr, yotamg, ogerlitz, roopa, nikolay, jhs,
	john.fastabend, rami.rosen, gospo, stephen, sfeldma, dsa,
	f.fainelli, andrew, vivien.didelot, tgraf, aduyck
In-Reply-To: <20160418154757.GA2059@nanopsycho.amit.cz>

From: Jiri Pirko <jiri@resnulli.us>
Date: Mon, 18 Apr 2016 17:47:57 +0200

> However, if for any reason the switchdev add operation fails, there is an
> abort function called (switchdev_fib_ipv4_abort). This function does two
> things which are both unfortunate in many usecases:
> 1) evicts all fib entries from HW leaving all processing done in kernel
>     - For Spectrum ASIC this means that all traffic running at 100G between
>       all ports is immediately downgraded to ~1-3Gbits
>     - Also this happens silently, user knows nothing about anything went wrong,
>       only forwarding performance suddenly sucks.
> 
> 2) sets net->ipv4.fib_offload_disabled = true
>     - That results in no other fib entry being offloaded, forever,
>       until net is removed and added again, machine reboot is required
>       in case if init_ns
> 
> These 2 issues makes fib offload completely unusable. So I propose
> to start thinking about fixing this.
> 
> I believe that although the current behaviour might be good for default,
> user should be able to change it by setting a different policy. This
> policy will allow to propagate offload error to user.

There were many length discussions about this.

It is extremely hard to load a partial table into the chip and
have it work correctly.  This is because with longest matching
prefix you have to pull out the least specific routing entires
and process them in software.

Also, there is no communication about what makes an entry not be
insertable or not.  So it may be the case that the shorter prefixes
all fit into the table, because those can be compressed and take
up less table space in the chip.

So it's extremely hard to know when "room" is available again.  Room
for what?  One 16-bit prefixed route?  Or room for one arbitrarily
prefixed route?  Which is it?

The user shouldn't need to know anything about this, and I will
be strongly against any design which puts the onus on the user
to configure a table that will fit into the chip.

^ permalink raw reply

* Re: [PATCH V2] net: ethernet: mellanox: correct page conversion
From: Sinan Kaya @ 2016-04-18 17:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eli Cohen, linux-rdma@vger.kernel.org, timur@codeaurora.org,
	cov@codeaurora.org, Yishai Hadas, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20160418154058.GA4228@infradead.org>

On 4/18/2016 11:40 AM, Christoph Hellwig wrote:
> On Mon, Apr 18, 2016 at 11:21:12AM -0400, Sinan Kaya wrote:
>> I was looking at the code. I don't see how removing virt_to_page + vmap 
>> would solve the issue.
>>
>> The code is trying to access the buffer space with direct.buf member
>> from the CPU side. This member would become NULL, when this code is 
>> removed and also in mlx4_en_map_buffer. 
>>
>> ...
>>
>> What am I missing?
> 
> As mentioned before you'll also need to enforce you hit the nbufs = 1
> case for these.  In fact most callers should simply switch to a plain
> dma_zalloc_coherent call without all these wrappers.  If we have a case
> where we really want multiple buffers that don't have to be contiguous
> (maybe the MTT case) I'd rather opencode that instead of building this
> confusing interface on top of it.
> 

I hit the first problem with CQE. The alloc routine is allocating pages
but CQE code is trying to do linear access with direct buf member. 


I see that this code implements page_list support. I'd like to do the same
thing for CQE. Let me know if I'm in the right path.

static struct mlx4_eqe *get_eqe(struct mlx4_eq *eq, u32 entry, u8 eqe_factor,
				u8 eqe_size)


-- 
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

^ permalink raw reply

* [PATCH net-next 0/3] qede: Bug fixes
From: Manish Chopra @ 2016-04-18 17:06 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

Hi David,

This series fixes -

* various memory allocation failure flows for fastpath
* issues with respect to driver GRO packets handling

Please note that I have intentionally targeted this for "net-next"
to avoid merge conflicts when net-next changes gets merged to net

Please consider applying this series to "net-next"

Thanks,
Manish

Manish Chopra (3):
  qede: Fix various memory allocation error flows for fastpath
  qede: Fix setting Skb network header
  qede: Fix single MTU sized packet from firmware GRO flow

 drivers/net/ethernet/qlogic/qede/qede_main.c | 157 +++++++++++++++++----------
 1 file changed, 100 insertions(+), 57 deletions(-)

-- 
2.7.2

^ permalink raw reply

* [PATCH net-next 1/3] qede: Fix various memory allocation error flows for fastpath
From: Manish Chopra @ 2016-04-18 17:06 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1460999167-3746-1-git-send-email-manish.chopra@qlogic.com>

This patch handles memory allocation failures for fastpath
gracefully in the driver.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 140 ++++++++++++++++-----------
 1 file changed, 85 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index e5dc35a..4d76ac0 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -774,6 +774,12 @@ static bool qede_has_tx_work(struct qede_fastpath *fp)
 	return false;
 }
 
+static inline void qede_rx_bd_ring_consume(struct qede_rx_queue *rxq)
+{
+	qed_chain_consume(&rxq->rx_bd_ring);
+	rxq->sw_rx_cons++;
+}
+
 /* This function reuses the buffer(from an offset) from
  * consumer index to producer index in the bd ring
  */
@@ -797,6 +803,21 @@ static inline void qede_reuse_page(struct qede_dev *edev,
 	curr_cons->data = NULL;
 }
 
+/* In case of allocation failures reuse buffers
+ * from consumer index to produce buffers for firmware
+ */
+static void qede_recycle_rx_bd_ring(struct qede_rx_queue *rxq,
+				    struct qede_dev *edev, u8 count)
+{
+	struct sw_rx_data *curr_cons;
+
+	for (; count > 0; count--) {
+		curr_cons = &rxq->sw_rx_ring[rxq->sw_rx_cons & NUM_RX_BDS_MAX];
+		qede_reuse_page(edev, rxq, curr_cons);
+		qede_rx_bd_ring_consume(rxq);
+	}
+}
+
 static inline int qede_realloc_rx_buffer(struct qede_dev *edev,
 					 struct qede_rx_queue *rxq,
 					 struct sw_rx_data *curr_cons)
@@ -805,8 +826,14 @@ static inline int qede_realloc_rx_buffer(struct qede_dev *edev,
 	curr_cons->page_offset += rxq->rx_buf_seg_size;
 
 	if (curr_cons->page_offset == PAGE_SIZE) {
-		if (unlikely(qede_alloc_rx_buffer(edev, rxq)))
+		if (unlikely(qede_alloc_rx_buffer(edev, rxq))) {
+			/* Since we failed to allocate new buffer
+			 * current buffer can be used again.
+			 */
+			curr_cons->page_offset -= rxq->rx_buf_seg_size;
+
 			return -ENOMEM;
+		}
 
 		dma_unmap_page(&edev->pdev->dev, curr_cons->mapping,
 			       PAGE_SIZE, DMA_FROM_DEVICE);
@@ -928,7 +955,10 @@ static int qede_fill_frag_skb(struct qede_dev *edev,
 			   len_on_bd);
 
 	if (unlikely(qede_realloc_rx_buffer(edev, rxq, current_bd))) {
-		tpa_info->agg_state = QEDE_AGG_STATE_ERROR;
+		/* Incr page ref count to reuse on allocation failure
+		 * so that it doesn't get freed while freeing SKB.
+		 */
+		atomic_inc(&current_bd->data->_count);
 		goto out;
 	}
 
@@ -942,6 +972,8 @@ static int qede_fill_frag_skb(struct qede_dev *edev,
 	return 0;
 
 out:
+	tpa_info->agg_state = QEDE_AGG_STATE_ERROR;
+	qede_recycle_rx_bd_ring(rxq, edev, 1);
 	return -ENOMEM;
 }
 
@@ -993,8 +1025,9 @@ static void qede_tpa_start(struct qede_dev *edev,
 	tpa_info->skb = netdev_alloc_skb(edev->ndev,
 					 le16_to_cpu(cqe->len_on_first_bd));
 	if (unlikely(!tpa_info->skb)) {
+		DP_NOTICE(edev, "Failed to allocate SKB for gro\n");
 		tpa_info->agg_state = QEDE_AGG_STATE_ERROR;
-		return;
+		goto cons_buf;
 	}
 
 	skb_put(tpa_info->skb, le16_to_cpu(cqe->len_on_first_bd));
@@ -1017,6 +1050,7 @@ static void qede_tpa_start(struct qede_dev *edev,
 	/* This is needed in order to enable forwarding support */
 	qede_set_gro_params(edev, tpa_info->skb, cqe);
 
+cons_buf: /* We still need to handle bd_len_list to consume buffers */
 	if (likely(cqe->ext_bd_len_list[0]))
 		qede_fill_frag_skb(edev, rxq, cqe->tpa_agg_index,
 				   le16_to_cpu(cqe->ext_bd_len_list[0]));
@@ -1313,17 +1347,17 @@ static int qede_rx_int(struct qede_fastpath *fp, int budget)
 				  "CQE in CONS = %u has error, flags = %x, dropping incoming packet\n",
 				  sw_comp_cons, parse_flag);
 			rxq->rx_hw_errors++;
-			qede_reuse_page(edev, rxq, sw_rx_data);
-			goto next_rx;
+			qede_recycle_rx_bd_ring(rxq, edev, fp_cqe->bd_num);
+			goto next_cqe;
 		}
 
 		skb = netdev_alloc_skb(edev->ndev, QEDE_RX_HDR_SIZE);
 		if (unlikely(!skb)) {
 			DP_NOTICE(edev,
 				  "Build_skb failed, dropping incoming packet\n");
-			qede_reuse_page(edev, rxq, sw_rx_data);
+			qede_recycle_rx_bd_ring(rxq, edev, fp_cqe->bd_num);
 			rxq->rx_alloc_errors++;
-			goto next_rx;
+			goto next_cqe;
 		}
 
 		/* Copy data into SKB */
@@ -1357,11 +1391,22 @@ static int qede_rx_int(struct qede_fastpath *fp, int budget)
 			if (unlikely(qede_realloc_rx_buffer(edev, rxq,
 							    sw_rx_data))) {
 				DP_ERR(edev, "Failed to allocate rx buffer\n");
+				/* Incr page ref count to reuse on allocation
+				 * failure so that it doesn't get freed while
+				 * freeing SKB.
+				 */
+
+				atomic_inc(&sw_rx_data->data->_count);
 				rxq->rx_alloc_errors++;
+				qede_recycle_rx_bd_ring(rxq, edev,
+							fp_cqe->bd_num);
+				dev_kfree_skb_any(skb);
 				goto next_cqe;
 			}
 		}
 
+		qede_rx_bd_ring_consume(rxq);
+
 		if (fp_cqe->bd_num != 1) {
 			u16 pkt_len = le16_to_cpu(fp_cqe->pkt_len);
 			u8 num_frags;
@@ -1372,18 +1417,27 @@ static int qede_rx_int(struct qede_fastpath *fp, int budget)
 			     num_frags--) {
 				u16 cur_size = pkt_len > rxq->rx_buf_size ?
 						rxq->rx_buf_size : pkt_len;
+				if (unlikely(!cur_size)) {
+					DP_ERR(edev,
+					       "Still got %d BDs for mapping jumbo, but length became 0\n",
+					       num_frags);
+					qede_recycle_rx_bd_ring(rxq, edev,
+								num_frags);
+					dev_kfree_skb_any(skb);
+					goto next_cqe;
+				}
 
-				WARN_ONCE(!cur_size,
-					  "Still got %d BDs for mapping jumbo, but length became 0\n",
-					  num_frags);
-
-				if (unlikely(qede_alloc_rx_buffer(edev, rxq)))
+				if (unlikely(qede_alloc_rx_buffer(edev, rxq))) {
+					qede_recycle_rx_bd_ring(rxq, edev,
+								num_frags);
+					dev_kfree_skb_any(skb);
 					goto next_cqe;
+				}
 
-				rxq->sw_rx_cons++;
 				sw_rx_index = rxq->sw_rx_cons & NUM_RX_BDS_MAX;
 				sw_rx_data = &rxq->sw_rx_ring[sw_rx_index];
-				qed_chain_consume(&rxq->rx_bd_ring);
+				qede_rx_bd_ring_consume(rxq);
+
 				dma_unmap_page(&edev->pdev->dev,
 					       sw_rx_data->mapping,
 					       PAGE_SIZE, DMA_FROM_DEVICE);
@@ -1399,7 +1453,7 @@ static int qede_rx_int(struct qede_fastpath *fp, int budget)
 				pkt_len -= cur_size;
 			}
 
-			if (pkt_len)
+			if (unlikely(pkt_len))
 				DP_ERR(edev,
 				       "Mapped all BDs of jumbo, but still have %d bytes\n",
 				       pkt_len);
@@ -1418,10 +1472,6 @@ static int qede_rx_int(struct qede_fastpath *fp, int budget)
 		skb_record_rx_queue(skb, fp->rss_id);
 
 		qede_skb_receive(edev, fp, skb, le16_to_cpu(fp_cqe->vlan_tag));
-
-		qed_chain_consume(&rxq->rx_bd_ring);
-next_rx:
-		rxq->sw_rx_cons++;
 next_rx_only:
 		rx_pkt++;
 
@@ -2432,7 +2482,7 @@ static void qede_free_sge_mem(struct qede_dev *edev,
 		struct qede_agg_info *tpa_info = &rxq->tpa_info[i];
 		struct sw_rx_data *replace_buf = &tpa_info->replace_buf;
 
-		if (replace_buf) {
+		if (replace_buf->data) {
 			dma_unmap_page(&edev->pdev->dev,
 				       dma_unmap_addr(replace_buf, mapping),
 				       PAGE_SIZE, DMA_FROM_DEVICE);
@@ -2552,7 +2602,7 @@ err:
 static int qede_alloc_mem_rxq(struct qede_dev *edev,
 			      struct qede_rx_queue *rxq)
 {
-	int i, rc, size, num_allocated;
+	int i, rc, size;
 
 	rxq->num_rx_buffers = edev->q_num_rx_buffers;
 
@@ -2569,6 +2619,7 @@ static int qede_alloc_mem_rxq(struct qede_dev *edev,
 	rxq->sw_rx_ring = kzalloc(size, GFP_KERNEL);
 	if (!rxq->sw_rx_ring) {
 		DP_ERR(edev, "Rx buffers ring allocation failed\n");
+		rc = -ENOMEM;
 		goto err;
 	}
 
@@ -2596,26 +2647,16 @@ static int qede_alloc_mem_rxq(struct qede_dev *edev,
 	/* Allocate buffers for the Rx ring */
 	for (i = 0; i < rxq->num_rx_buffers; i++) {
 		rc = qede_alloc_rx_buffer(edev, rxq);
-		if (rc)
-			break;
-	}
-	num_allocated = i;
-	if (!num_allocated) {
-		DP_ERR(edev, "Rx buffers allocation failed\n");
-		goto err;
-	} else if (num_allocated < rxq->num_rx_buffers) {
-		DP_NOTICE(edev,
-			  "Allocated less buffers than desired (%d allocated)\n",
-			  num_allocated);
+		if (rc) {
+			DP_ERR(edev,
+			       "Rx buffers allocation failed at index %d\n", i);
+			goto err;
+		}
 	}
 
-	qede_alloc_sge_mem(edev, rxq);
-
-	return 0;
-
+	rc = qede_alloc_sge_mem(edev, rxq);
 err:
-	qede_free_mem_rxq(edev, rxq);
-	return -ENOMEM;
+	return rc;
 }
 
 static void qede_free_mem_txq(struct qede_dev *edev,
@@ -2698,10 +2739,8 @@ static int qede_alloc_mem_fp(struct qede_dev *edev,
 	}
 
 	return 0;
-
 err:
-	qede_free_mem_fp(edev, fp);
-	return -ENOMEM;
+	return rc;
 }
 
 static void qede_free_mem_load(struct qede_dev *edev)
@@ -2724,22 +2763,13 @@ static int qede_alloc_mem_load(struct qede_dev *edev)
 		struct qede_fastpath *fp = &edev->fp_array[rss_id];
 
 		rc = qede_alloc_mem_fp(edev, fp);
-		if (rc)
-			break;
-	}
-
-	if (rss_id != QEDE_RSS_CNT(edev)) {
-		/* Failed allocating memory for all the queues */
-		if (!rss_id) {
+		if (rc) {
 			DP_ERR(edev,
-			       "Failed to allocate memory for the leading queue\n");
-			rc = -ENOMEM;
-		} else {
-			DP_NOTICE(edev,
-				  "Failed to allocate memory for all of RSS queues\n Desired: %d queues, allocated: %d queues\n",
-				  QEDE_RSS_CNT(edev), rss_id);
+			       "Failed to allocate memory for fastpath - rss id = %d\n",
+			       rss_id);
+			qede_free_mem_load(edev);
+			return rc;
 		}
-		edev->num_rss = rss_id;
 	}
 
 	return 0;
-- 
2.7.2

^ permalink raw reply related

* [PATCH net-next 3/3] qede: Fix single MTU sized packet from firmware GRO flow
From: Manish Chopra @ 2016-04-18 17:06 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1460999167-3746-1-git-send-email-manish.chopra@qlogic.com>

In firmware assisted GRO flow there could be a single MTU sized
segment arriving due to firmware aggregation timeout/last segment
in an aggregation flow, which is not expected to be an actual gro
packet. So If a skb has zero frags from the GRO flow then simply
push it in the stack as non gso skb.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 8208e1f..197ef85 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1096,6 +1096,17 @@ static void qede_gro_receive(struct qede_dev *edev,
 			     struct sk_buff *skb,
 			     u16 vlan_tag)
 {
+	/* FW can send a single MTU sized packet from gro flow
+	 * due to aggregation timeout/last segment etc. which
+	 * is not expected to be a gro packet. If a skb has zero
+	 * frags then simply push it in the stack as non gso skb.
+	 */
+	if (unlikely(!skb->data_len)) {
+		skb_shinfo(skb)->gso_type = 0;
+		skb_shinfo(skb)->gso_size = 0;
+		goto send_skb;
+	}
+
 #ifdef CONFIG_INET
 	if (skb_shinfo(skb)->gso_size) {
 		skb_set_network_header(skb, 0);
@@ -1114,6 +1125,8 @@ static void qede_gro_receive(struct qede_dev *edev,
 		}
 	}
 #endif
+
+send_skb:
 	skb_record_rx_queue(skb, fp->rss_id);
 	qede_skb_receive(edev, fp, skb, vlan_tag);
 }
-- 
2.7.2

^ permalink raw reply related

* [PATCH net-next 2/3] qede: Fix setting Skb network header
From: Manish Chopra @ 2016-04-18 17:06 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1460999167-3746-1-git-send-email-manish.chopra@qlogic.com>

Skb's network header needs to be set before extracting IPv4/IPv6
headers from it.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 4d76ac0..8208e1f 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1068,7 +1068,6 @@ static void qede_gro_ip_csum(struct sk_buff *skb)
 	const struct iphdr *iph = ip_hdr(skb);
 	struct tcphdr *th;
 
-	skb_set_network_header(skb, 0);
 	skb_set_transport_header(skb, sizeof(struct iphdr));
 	th = tcp_hdr(skb);
 
@@ -1083,7 +1082,6 @@ static void qede_gro_ipv6_csum(struct sk_buff *skb)
 	struct ipv6hdr *iph = ipv6_hdr(skb);
 	struct tcphdr *th;
 
-	skb_set_network_header(skb, 0);
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 	th = tcp_hdr(skb);
 
@@ -1100,6 +1098,8 @@ static void qede_gro_receive(struct qede_dev *edev,
 {
 #ifdef CONFIG_INET
 	if (skb_shinfo(skb)->gso_size) {
+		skb_set_network_header(skb, 0);
+
 		switch (skb->protocol) {
 		case htons(ETH_P_IP):
 			qede_gro_ip_csum(skb);
-- 
2.7.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox