Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v5 0/7] bpf: implement BPF_TASK_FD_QUERY
From: Daniel Borkmann @ 2018-05-25  0:27 UTC (permalink / raw)
  To: Yonghong Song, peterz, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180524182111.454612-1-yhs@fb.com>

On 05/24/2018 08:21 PM, Yonghong Song wrote:
> Currently, suppose a userspace application has loaded a bpf program
> and attached it to a tracepoint/kprobe/uprobe, and a bpf
> introspection tool, e.g., bpftool, wants to show which bpf program
> is attached to which tracepoint/kprobe/uprobe. Such attachment
> information will be really useful to understand the overall bpf
> deployment in the system.
> 
> There is a name field (16 bytes) for each program, which could
> be used to encode the attachment point. There are some drawbacks
> for this approaches. First, bpftool user (e.g., an admin) may not
> really understand the association between the name and the
> attachment point. Second, if one program is attached to multiple
> places, encoding a proper name which can imply all these
> attachments becomes difficult.
> 
> This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
> Given a pid and fd, this command will return bpf related information
> to user space. Right now it only supports tracepoint/kprobe/uprobe
> perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
>    . prog_id
>    . tracepoint name, or
>    . k[ret]probe funcname + offset or kernel addr, or
>    . u[ret]probe filename + offset
> to the userspace.
> The user can use "bpftool prog" to find more information about
> bpf program itself with prog_id.
> 
> Patch #1 adds function perf_get_event() in kernel/events/core.c.
> Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
> Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
> in the libbpf library for samples/selftests/bpftool to use.
> Patch #4 adds ksym_get_addr() utility function.
> Patch #5 add a test in samples/bpf for querying k[ret]probes and
> u[ret]probes.
> Patch #6 add a test in tools/testing/selftests/bpf for querying
> raw_tracepoint and tracepoint.
> Patch #7 add a new subcommand "perf" to bpftool.
> 
> Changelogs:
>   v4 -> v5:
>      . return strlen(buf) instead of strlen(buf) + 1 
>        in the attr.buf_len. As long as user provides
>        non-empty buffer, it will be filed with empty
>        string, truncated string, or full string
>        based on the buffer size and the length of
>        to-be-copied string.
>   v3 -> v4:
>      . made attr buf_len input/output. The length of
>        actual buffter is written to buf_len so user space knows
>        what is actually needed. If user provides a buffer
>        with length >= 1 but less than required, do partial
>        copy and return -ENOSPC.
>      . code simplification with put_user.
>      . changed query result attach_info to fd_type.
>      . add tests at selftests/bpf to test zero len, null buf and
>        insufficient buf.
>   v2 -> v3:
>      . made perf_get_event() return perf_event pointer const.
>        this was to ensure that event fields are not meddled.
>      . detect whether newly BPF_TASK_FD_QUERY is supported or
>        not in "bpftool perf" and warn users if it is not.
>   v1 -> v2:
>      . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
>        to BPF_TASK_FD_QUERY.
>      . fixed various "bpftool perf" issues and added documentation
>        and auto-completion.
> 
> Yonghong Song (7):
>   perf/core: add perf_get_event() to return perf_event given a struct
>     file
>   bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
>   tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in
>     libbpf
>   tools/bpf: add ksym_get_addr() in trace_helpers
>   samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
>   tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
>   tools/bpftool: add perf subcommand
> 
>  include/linux/perf_event.h                       |   5 +
>  include/linux/trace_events.h                     |  17 +
>  include/uapi/linux/bpf.h                         |  26 ++
>  kernel/bpf/syscall.c                             | 131 ++++++++
>  kernel/events/core.c                             |   8 +
>  kernel/trace/bpf_trace.c                         |  48 +++
>  kernel/trace/trace_kprobe.c                      |  29 ++
>  kernel/trace/trace_uprobe.c                      |  22 ++
>  samples/bpf/Makefile                             |   4 +
>  samples/bpf/task_fd_query_kern.c                 |  19 ++
>  samples/bpf/task_fd_query_user.c                 | 382 +++++++++++++++++++++++
>  tools/bpf/bpftool/Documentation/bpftool-perf.rst |  81 +++++
>  tools/bpf/bpftool/Documentation/bpftool.rst      |   5 +-
>  tools/bpf/bpftool/bash-completion/bpftool        |   9 +
>  tools/bpf/bpftool/main.c                         |   3 +-
>  tools/bpf/bpftool/main.h                         |   1 +
>  tools/bpf/bpftool/perf.c                         | 246 +++++++++++++++
>  tools/include/uapi/linux/bpf.h                   |  26 ++
>  tools/lib/bpf/bpf.c                              |  23 ++
>  tools/lib/bpf/bpf.h                              |   3 +
>  tools/testing/selftests/bpf/test_progs.c         | 158 ++++++++++
>  tools/testing/selftests/bpf/trace_helpers.c      |  12 +
>  tools/testing/selftests/bpf/trace_helpers.h      |   1 +
>  23 files changed, 1257 insertions(+), 2 deletions(-)
>  create mode 100644 samples/bpf/task_fd_query_kern.c
>  create mode 100644 samples/bpf/task_fd_query_user.c
>  create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
>  create mode 100644 tools/bpf/bpftool/perf.c

LGTM, series:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net] vhost: synchronize IOTLB message with dev cleanup
From: Michael S. Tsirkin @ 2018-05-25  0:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel
In-Reply-To: <1526990337-24892-1-git-send-email-jasowang@redhat.com>

On Tue, May 22, 2018 at 07:58:57PM +0800, Jason Wang wrote:
> DaeRyong Jeong reports a race between vhost_dev_cleanup() and
> vhost_process_iotlb_msg():
> 
> Thread interleaving:
> CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
> (In the case of both VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE)
> =====						=====
> 						vhost_umem_clean(dev->iotlb);
> if (!dev->iotlb) {
> 	        ret = -EFAULT;
> 		        break;
> }
> 						dev->iotlb = NULL;
> 
> The reason is we don't synchronize between them, fixing by protecting
> vhost_process_iotlb_msg() with dev mutex.
> 
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

We should think of a way to have a per-vq lock here, but for now:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/vhost/vhost.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..f0be5f3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  {
>  	int ret = 0;
>  
> +	mutex_lock(&dev->mutex);
>  	vhost_dev_lock_vqs(dev);
>  	switch (msg->type) {
>  	case VHOST_IOTLB_UPDATE:
> @@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  	}
>  
>  	vhost_dev_unlock_vqs(dev);
> +	mutex_unlock(&dev->mutex);
> +
>  	return ret;
>  }
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH v3 net] stmmac: Added support for 802.1ad S-TAG stripping
From: Toshiaki Makita @ 2018-05-25  0:34 UTC (permalink / raw)
  To: Elad Nachman, Jose Abreu, Florian Fainelli, David Miller
  Cc: netdev, peppe.cavallaro, alexandre.torgue
In-Reply-To: <c9c605d9-dff6-4909-e90f-e3b7e179edb6@gmail.com>

On 2018/05/25 1:56, Elad Nachman wrote:
> stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before calling napi_gro_receive().
> 
> The function assumes VLAN tagged frames are always tagged with 802.1Q protocol,
> and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call to __vlan_hwaccel_put_tag() .
> 
> This causes packets not to be passed to the VLAN slave if it was created with 802.1AD protocol
> (ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).
> 
> This fix passes the protocol from the VLAN header into __vlan_hwaccel_put_tag()
> instead of using the hard-coded value of ETH_P_8021Q.
> NETIF_F_HW_VLAN_STAG_RX was added to the net device features to reflect this new support.
> 
> Signed-off-by: Elad Nachman <eladn@gilat.com>
> 
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index b65e2d1..2d2f37f 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3293,17 +3293,19 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
>  {
> -	struct ethhdr *ehdr;
> +	struct vlan_ethhdr *veth;
>  	u16 vlanid;
> +	__be16 vlan_proto;
>  
> -	if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
> -	    NETIF_F_HW_VLAN_CTAG_RX &&
> +	if ((dev->features & (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX)) ==
> +	    (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX) &&

This is basically not a correct condition since you cannot strip CTAG if
HW_VLAN_STAG_RX is disabled even when HW_VLAN_CTAG_RX is enabled.

The correct behavior is stripping CTAG when CTAG_RX is enabled and
stripping STAG when STAG_RX is enabled, so this code cannot be
protocol-agnostic. I suggested handling only CTAG in this driver because
I thought adding STAG support will make this unnecessarily complicated.

But I now actually noticed that this driver seems not able to toggle
CTAG_RX nor STAG_RX because hw_features does not include them. So this
code should work even if the condition is wrong, but in the first place
why we need to check if dev->features includes CTAG_RX here... it's
always included. It seems removing this check will be sufficient.

>  	    !__vlan_get_tag(skb, &vlanid)) {
>  		/* pop the vlan tag */
> -		ehdr = (struct ethhdr *)skb->data;
> -		memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
> +		veth = (struct vlan_ethhdr *)skb->data;
> +		vlan_proto = veth->h_vlan_proto;
> +		memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
>  		skb_pull(skb, VLAN_HLEN);
> -		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
> +		__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
>  	}
>  }
>  
> @@ -4344,7 +4346,7 @@ int stmmac_dvr_probe(struct device *device,
>  	ndev->watchdog_timeo = msecs_to_jiffies(watchdog);
>  #ifdef STMMAC_VLAN_TAG_USED
>  	/* Both mac100 and gmac support receive VLAN tag detection */
> -	ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
> +	ndev->features |= (NETIF_F_HW_VLAN_CTAG_RX|NETIF_F_HW_VLAN_STAG_RX);
>  #endif
>  	priv->msg_enable = netif_msg_init(debug, default_msg_level);
>  
> 

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [Bridge] [PATCH net-next] net: bridge: add support for port isolation
From: Toshiaki Makita @ 2018-05-25  0:47 UTC (permalink / raw)
  To: Nikolay Aleksandrov, netdev; +Cc: roopa, bridge, davem
In-Reply-To: <20180524085648.5934-1-nikolay@cumulusnetworks.com>

On 2018/05/24 17:56, Nikolay Aleksandrov wrote:
> This patch adds support for a new port flag - BR_ISOLATED. If it is set
> then isolated ports cannot communicate between each other, but they can
> still communicate with non-isolated ports. The same can be achieved via
> ACLs but they can't scale with large number of ports and also the
> complexity of the rules grows. This feature can be used to achieve
> isolated vlan functionality (similar to pvlan) as well, though currently
> it will be port-wide (for all vlans on the port). The new test in
> should_deliver uses data that is already cache hot and the new boolean
> is used to avoid an additional source port test in should_deliver.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Sometimes I need this kind of configuration and used vlan for such
cases. I guess it does not scale for your case so added this feature.
I wonder if this kind of feature is common in hardware switches.

FWIW,

Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

> ---
>  include/linux/if_bridge.h    | 1 +
>  include/uapi/linux/if_link.h | 1 +
>  net/bridge/br_forward.c      | 3 ++-
>  net/bridge/br_input.c        | 1 +
>  net/bridge/br_netlink.c      | 9 ++++++++-
>  net/bridge/br_private.h      | 9 +++++++++
>  net/bridge/br_sysfs_if.c     | 2 ++
>  7 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
> index 585d27182425..7843b98e1c6e 100644
> --- a/include/linux/if_bridge.h
> +++ b/include/linux/if_bridge.h
> @@ -50,6 +50,7 @@ struct br_ip_list {
>  #define BR_VLAN_TUNNEL		BIT(13)
>  #define BR_BCAST_FLOOD		BIT(14)
>  #define BR_NEIGH_SUPPRESS	BIT(15)
> +#define BR_ISOLATED		BIT(16)
>  
>  #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
>  
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index b85266420bfb..cf01b6824244 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -333,6 +333,7 @@ enum {
>  	IFLA_BRPORT_BCAST_FLOOD,
>  	IFLA_BRPORT_GROUP_FWD_MASK,
>  	IFLA_BRPORT_NEIGH_SUPPRESS,
> +	IFLA_BRPORT_ISOLATED,
>  	__IFLA_BRPORT_MAX
>  };
>  #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index 7a7fd672ccf2..9019f326fe81 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -30,7 +30,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
>  	vg = nbp_vlan_group_rcu(p);
>  	return ((p->flags & BR_HAIRPIN_MODE) || skb->dev != p->dev) &&
>  		br_allowed_egress(vg, skb) && p->state == BR_STATE_FORWARDING &&
> -		nbp_switchdev_allowed_egress(p, skb);
> +		nbp_switchdev_allowed_egress(p, skb) &&
> +		!br_skb_isolated(p, skb);
>  }
>  
>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 7f98a7d25866..72074276c088 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -114,6 +114,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>  		goto drop;
>  
>  	BR_INPUT_SKB_CB(skb)->brdev = br->dev;
> +	BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED);
>  
>  	if (IS_ENABLED(CONFIG_INET) &&
>  	    (skb->protocol == htons(ETH_P_ARP) ||
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 015f465c514b..9f5eb05b0373 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -139,6 +139,7 @@ static inline size_t br_port_info_size(void)
>  		+ nla_total_size(1)	/* IFLA_BRPORT_PROXYARP_WIFI */
>  		+ nla_total_size(1)	/* IFLA_BRPORT_VLAN_TUNNEL */
>  		+ nla_total_size(1)	/* IFLA_BRPORT_NEIGH_SUPPRESS */
> +		+ nla_total_size(1)	/* IFLA_BRPORT_ISOLATED */
>  		+ nla_total_size(sizeof(struct ifla_bridge_id))	/* IFLA_BRPORT_ROOT_ID */
>  		+ nla_total_size(sizeof(struct ifla_bridge_id))	/* IFLA_BRPORT_BRIDGE_ID */
>  		+ nla_total_size(sizeof(u16))	/* IFLA_BRPORT_DESIGNATED_PORT */
> @@ -213,7 +214,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
>  							BR_VLAN_TUNNEL)) ||
>  	    nla_put_u16(skb, IFLA_BRPORT_GROUP_FWD_MASK, p->group_fwd_mask) ||
>  	    nla_put_u8(skb, IFLA_BRPORT_NEIGH_SUPPRESS,
> -		       !!(p->flags & BR_NEIGH_SUPPRESS)))
> +		       !!(p->flags & BR_NEIGH_SUPPRESS)) ||
> +	    nla_put_u8(skb, IFLA_BRPORT_ISOLATED, !!(p->flags & BR_ISOLATED)))
>  		return -EMSGSIZE;
>  
>  	timerval = br_timer_value(&p->message_age_timer);
> @@ -660,6 +662,7 @@ static const struct nla_policy br_port_policy[IFLA_BRPORT_MAX + 1] = {
>  	[IFLA_BRPORT_VLAN_TUNNEL] = { .type = NLA_U8 },
>  	[IFLA_BRPORT_GROUP_FWD_MASK] = { .type = NLA_U16 },
>  	[IFLA_BRPORT_NEIGH_SUPPRESS] = { .type = NLA_U8 },
> +	[IFLA_BRPORT_ISOLATED]	= { .type = NLA_U8 },
>  };
>  
>  /* Change the state of the port and notify spanning tree */
> @@ -810,6 +813,10 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
>  	if (err)
>  		return err;
>  
> +	err = br_set_port_flag(p, tb, IFLA_BRPORT_ISOLATED, BR_ISOLATED);
> +	if (err)
> +		return err;
> +
>  	br_port_flags_change(p, old_flags ^ p->flags);
>  	return 0;
>  }
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 742f40aefdaf..11520ed528b0 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -423,6 +423,7 @@ struct br_input_skb_cb {
>  #endif
>  
>  	bool proxyarp_replied;
> +	bool src_port_isolated;
>  
>  #ifdef CONFIG_BRIDGE_VLAN_FILTERING
>  	bool vlan_filtered;
> @@ -574,6 +575,14 @@ int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
>  void br_flood(struct net_bridge *br, struct sk_buff *skb,
>  	      enum br_pkt_type pkt_type, bool local_rcv, bool local_orig);
>  
> +/* return true if both source port and dest port are isolated */
> +static inline bool br_skb_isolated(const struct net_bridge_port *to,
> +				   const struct sk_buff *skb)
> +{
> +	return BR_INPUT_SKB_CB(skb)->src_port_isolated &&
> +	       (to->flags & BR_ISOLATED);
> +}
> +
>  /* br_if.c */
>  void br_port_carrier_check(struct net_bridge_port *p, bool *notified);
>  int br_add_bridge(struct net *net, const char *name);
> diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
> index fd31ad83ec7b..f99c5bf5c906 100644
> --- a/net/bridge/br_sysfs_if.c
> +++ b/net/bridge/br_sysfs_if.c
> @@ -192,6 +192,7 @@ BRPORT_ATTR_FLAG(proxyarp_wifi, BR_PROXYARP_WIFI);
>  BRPORT_ATTR_FLAG(multicast_flood, BR_MCAST_FLOOD);
>  BRPORT_ATTR_FLAG(broadcast_flood, BR_BCAST_FLOOD);
>  BRPORT_ATTR_FLAG(neigh_suppress, BR_NEIGH_SUPPRESS);
> +BRPORT_ATTR_FLAG(isolated, BR_ISOLATED);
>  
>  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
>  static ssize_t show_multicast_router(struct net_bridge_port *p, char *buf)
> @@ -243,6 +244,7 @@ static const struct brport_attribute *brport_attrs[] = {
>  	&brport_attr_broadcast_flood,
>  	&brport_attr_group_fwd_mask,
>  	&brport_attr_neigh_suppress,
> +	&brport_attr_isolated,
>  	NULL
>  };
>  
> 

-- 
Toshiaki Makita

^ permalink raw reply

* [PATCH net-next v5 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-25  0:56 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v4->v5:
  - Addresses comments from Parvin that include log error msg in
    ovs_ct_limit_init(), handle deletion for default limit, and
    add a common helper for get zone limit.
  - Rebases to master.

v3->v4:
  - Addresses comments from Parvin that include simplify netlink API,
    and remove unncessary RCU lockings.
  - Rebases to master.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
    if ovs_ct_limit features is used, only check ct_limit when a ct entry
    is unconfirmed, and reports rate limited warning messages when the ct
    limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  28 ++
 net/openvswitch/Kconfig          |   3 +-
 net/openvswitch/conntrack.c      | 551 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h      |   9 +-
 net/openvswitch/datapath.c       |   7 +-
 net/openvswitch/datapath.h       |   3 +
 6 files changed, 595 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next v5 1/2] openvswitch: Add conntrack limit netlink definition
From: Yi-Hung Wei @ 2018-05-25  0:56 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1527209803-48274-1-git-send-email-yihung.wei@gmail.com>

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 include/uapi/linux/openvswitch.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..863aabaa5cc9 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,32 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+	OVS_CT_LIMIT_CMD_UNSPEC,
+	OVS_CT_LIMIT_CMD_SET,		/* Add or modify ct limit. */
+	OVS_CT_LIMIT_CMD_DEL,		/* Delete ct limit. */
+	OVS_CT_LIMIT_CMD_GET		/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+	OVS_CT_LIMIT_ATTR_UNSPEC,
+	OVS_CT_LIMIT_ATTR_ZONE_LIMIT,	/* Nested struct ovs_zone_limit. */
+	__OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+#define OVS_ZONE_LIMIT_DEFAULT_ZONE -1
+
+struct ovs_zone_limit {
+	int zone_id;
+	__u32 limit;
+	__u32 count;
+};
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v5 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-25  0:56 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1527209803-48274-1-git-send-email-yihung.wei@gmail.com>

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
    * set default connection limit for all zones
    * set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
    * remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
    * get the default connection limit for all zones
    * get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 net/openvswitch/Kconfig     |   3 +-
 net/openvswitch/conntrack.c | 551 +++++++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   3 +
 5 files changed, 567 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
 		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 				     (!NF_NAT || NF_NAT) && \
 				     (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-				     (!NF_NAT_IPV6 || NF_NAT_IPV6)))
+				     (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+				     (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
 	select LIBCRC32C
 	select MPLS
 	select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 02fc343feb66..284aca2a252d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include <linux/tcp.h>
 #include <linux/udp.h>
 #include <linux/sctp.h>
+#include <linux/static_key.h>
 #include <net/ip.h>
+#include <net/genetlink.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_count.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_labels.h>
 #include <net/netfilter/nf_conntrack_seqadj.h>
@@ -76,6 +79,31 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED	0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+	/* Elements in ovs_ct_limit_info->limits hash table */
+	struct hlist_node hlist_node;
+	struct rcu_head rcu;
+	u16 zone;
+	u32 limit;
+};
+
+struct ovs_ct_limit_info {
+	u32 default_limit;
+	struct hlist_head *limits;
+	struct nf_conncount_data *data;
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+	[OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1064,89 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
 	return false;
 }
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+	const struct ovs_ct_limit_info *info, u16 zone)
+{
+	return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+			 struct ovs_ct_limit *new_ct_limit)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, new_ct_limit->zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == new_ct_limit->zone) {
+			hlist_replace_rcu(&ct_limit->hlist_node,
+					  &new_ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+
+	hlist_add_head_rcu(&new_ct_limit->hlist_node, head);
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+	struct hlist_node *n;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_safe(ct_limit, n, head, hlist_node) {
+		if (ct_limit->zone == zone) {
+			hlist_del_rcu(&ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+}
+
+/* Call with RCU read lock */
+static u32 ct_limit_get(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone)
+			return ct_limit->limit;
+	}
+
+	return info->default_limit;
+}
+
+static int ovs_ct_check_limit(struct net *net,
+			      const struct ovs_conntrack_info *info,
+			      const struct nf_conntrack_tuple *tuple)
+{
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	u32 per_zone_limit, connections;
+	u32 conncount_key;
+
+	conncount_key = info->zone.id;
+
+	per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
+	if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED)
+		return 0;
+
+	connections = nf_conncount_count(net, ct_limit_info->data,
+					 &conncount_key, tuple, &info->zone);
+	if (connections > per_zone_limit)
+		return -ENOMEM;
+
+	return 0;
+}
+#endif
+
 /* Lookup connection and confirm if unconfirmed. */
 static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 			 const struct ovs_conntrack_info *info,
@@ -1054,6 +1165,21 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 	if (!ct)
 		return 0;
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	if (static_branch_unlikely(&ovs_ct_limit_enabled)) {
+		if (!nf_ct_is_confirmed(ct)) {
+			err = ovs_ct_check_limit(net, info,
+				&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+			if (err) {
+				net_warn_ratelimited("openvswitch: zone: %u "
+					"execeeds conntrack limit\n",
+					info->zone.id);
+				return err;
+			}
+		}
+	}
+#endif
+
 	/* Set the conntrack event mask if given.  NEW and DELETE events have
 	 * their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
 	 * typically would receive many kinds of updates.  Setting the event
@@ -1655,7 +1781,420 @@ static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info)
 		nf_ct_tmpl_free(ct_info->ct);
 }
 
-void ovs_ct_init(struct net *net)
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static int ovs_ct_limit_init(struct net *net, struct ovs_net *ovs_net)
+{
+	int i, err;
+
+	ovs_net->ct_limit_info = kmalloc(sizeof(*ovs_net->ct_limit_info),
+					 GFP_KERNEL);
+	if (!ovs_net->ct_limit_info)
+		return -ENOMEM;
+
+	ovs_net->ct_limit_info->default_limit = OVS_CT_LIMIT_DEFAULT;
+	ovs_net->ct_limit_info->limits =
+		kmalloc_array(CT_LIMIT_HASH_BUCKETS, sizeof(struct hlist_head),
+			      GFP_KERNEL);
+	if (!ovs_net->ct_limit_info->limits) {
+		kfree(ovs_net->ct_limit_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; i++)
+		INIT_HLIST_HEAD(&ovs_net->ct_limit_info->limits[i]);
+
+	ovs_net->ct_limit_info->data =
+		nf_conncount_init(net, NFPROTO_INET, sizeof(u32));
+
+	if (IS_ERR(ovs_net->ct_limit_info->data)) {
+		err = PTR_ERR(ovs_net->ct_limit_info->data);
+		kfree(ovs_net->ct_limit_info->limits);
+		kfree(ovs_net->ct_limit_info);
+		pr_err("openvswitch: failed to init nf_conncount %d\n", err);
+		return err;
+	}
+	return 0;
+}
+
+static void ovs_ct_limit_exit(struct net *net, struct ovs_net *ovs_net)
+{
+	const struct ovs_ct_limit_info *info = ovs_net->ct_limit_info;
+	int i;
+
+	nf_conncount_destroy(net, NFPROTO_INET, info->data);
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		struct hlist_head *head = &info->limits[i];
+		struct ovs_ct_limit *ct_limit;
+
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node)
+			kfree_rcu(ct_limit, rcu);
+	}
+	kfree(ovs_net->ct_limit_info->limits);
+	kfree(ovs_net->ct_limit_info);
+}
+
+static struct sk_buff *
+ovs_ct_limit_cmd_reply_start(struct genl_info *info, u8 cmd,
+			     struct ovs_header **ovs_reply_header)
+{
+	struct ovs_header *ovs_header = info->userhdr;
+	struct sk_buff *skb;
+
+	skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	*ovs_reply_header = genlmsg_put(skb, info->snd_portid,
+					info->snd_seq,
+					&dp_ct_limit_genl_family, 0, cmd);
+
+	if (!*ovs_reply_header) {
+		nlmsg_free(skb);
+		return ERR_PTR(-EMSGSIZE);
+	}
+	(*ovs_reply_header)->dp_ifindex = ovs_header->dp_ifindex;
+
+	return skb;
+}
+
+static bool check_zone_id(int zone_id, u16 *pzone)
+{
+	if (zone_id >= 0 && zone_id <= 65535) {
+		*pzone = (u16)zone_id;
+		return true;
+	}
+	return false;
+}
+
+static int ovs_ct_limit_set_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id ==
+				OVS_ZONE_LIMIT_DEFAULT_ZONE)) {
+			ovs_lock();
+			info->default_limit = zone_limit->limit;
+			ovs_unlock();
+		} else if (unlikely(!check_zone_id(
+				zone_limit->zone_id, &zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			struct ovs_ct_limit *ct_limit;
+
+			ct_limit = kmalloc(sizeof(*ct_limit), GFP_KERNEL);
+			if (!ct_limit)
+				return -ENOMEM;
+
+			ct_limit->zone = zone;
+			ct_limit->limit = zone_limit->limit;
+
+			ovs_lock();
+			ct_limit_set(info, ct_limit);
+			ovs_unlock();
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "set zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_del_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id ==
+				OVS_ZONE_LIMIT_DEFAULT_ZONE)) {
+			ovs_lock();
+			info->default_limit = OVS_CT_LIMIT_DEFAULT;
+			ovs_unlock();
+		} else if (unlikely(!check_zone_id(
+				zone_limit->zone_id, &zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			ovs_lock();
+			ct_limit_del(info, zone);
+			ovs_unlock();
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "del zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
+					  struct sk_buff *reply)
+{
+	struct ovs_zone_limit zone_limit;
+	int err;
+
+	zone_limit.zone_id = OVS_ZONE_LIMIT_DEFAULT_ZONE;
+	zone_limit.limit = info->default_limit;
+	err = nla_put_nohdr(reply, sizeof(zone_limit), &zone_limit);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static int __ovs_ct_limit_get_zone_limit(struct net *net,
+					 struct nf_conncount_data *data,
+					 u16 zone_id, u32 limit,
+					 struct sk_buff *reply)
+{
+	struct nf_conntrack_zone ct_zone;
+	struct ovs_zone_limit zone_limit;
+	u32 conncount_key = zone_id;
+
+	zone_limit.zone_id = zone_id;
+	zone_limit.limit = limit;
+	nf_ct_zone_init(&ct_zone, zone_id, NF_CT_DEFAULT_ZONE_DIR, 0);
+
+	zone_limit.count = nf_conncount_count(net, data, &conncount_key, NULL,
+					      &ct_zone);
+	return nla_put_nohdr(reply, sizeof(zone_limit), &zone_limit);
+}
+
+static int ovs_ct_limit_get_zone_limit(struct net *net,
+				       struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info,
+				       struct sk_buff *reply)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem, err;
+	u32 limit;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id ==
+				OVS_ZONE_LIMIT_DEFAULT_ZONE)) {
+			err = ovs_ct_limit_get_default_limit(info, reply);
+			if (err)
+				return err;
+		} else if (unlikely(!check_zone_id(zone_limit->zone_id,
+							&zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			rcu_read_lock();
+			limit = ct_limit_get(info, zone);
+			rcu_read_unlock();
+
+			err = __ovs_ct_limit_get_zone_limit(
+				net, info->data, zone, limit, reply);
+			if (err)
+				return err;
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "get zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_all_zone_limit(struct net *net,
+					   struct ovs_ct_limit_info *info,
+					   struct sk_buff *reply)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+	int i, err = 0;
+
+	err = ovs_ct_limit_get_default_limit(info, reply);
+	if (err)
+		return err;
+
+	rcu_read_lock();
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		head = &info->limits[i];
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+			err = __ovs_ct_limit_get_zone_limit(net, info->data,
+				ct_limit->zone, ct_limit->limit, reply);
+			if (err)
+				goto exit_err;
+		}
+	}
+
+exit_err:
+	rcu_read_unlock();
+	return err;
+}
+
+static int ovs_ct_limit_cmd_set(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_SET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = -EINVAL;
+		goto exit_err;
+	}
+
+	err = ovs_ct_limit_set_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	static_branch_enable(&ovs_ct_limit_enabled);
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_DEL,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = -EINVAL;
+		goto exit_err;
+	}
+
+	err = ovs_ct_limit_del_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct nlattr *nla_reply;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct net *net = sock_net(skb->sk);
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_GET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_ZONE_LIMIT);
+
+	if (a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = ovs_ct_limit_get_zone_limit(
+			net, a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT], ct_limit_info,
+			reply);
+		if (err)
+			goto exit_err;
+	} else {
+		err = ovs_ct_limit_get_all_zone_limit(net, ct_limit_info,
+						      reply);
+		if (err)
+			goto exit_err;
+	}
+
+	nla_nest_end(reply, nla_reply);
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static struct genl_ops ct_limit_genl_ops[] = {
+	{ .cmd = OVS_CT_LIMIT_CMD_SET,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_set,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_DEL,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_del,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_GET,
+		.flags = 0,		  /* OK for unprivileged users. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_get,
+	},
+};
+
+static const struct genl_multicast_group ovs_ct_limit_multicast_group = {
+	.name = OVS_CT_LIMIT_MCGROUP,
+};
+
+struct genl_family dp_ct_limit_genl_family __ro_after_init = {
+	.hdrsize = sizeof(struct ovs_header),
+	.name = OVS_CT_LIMIT_FAMILY,
+	.version = OVS_CT_LIMIT_VERSION,
+	.maxattr = OVS_CT_LIMIT_ATTR_MAX,
+	.netnsok = true,
+	.parallel_ops = true,
+	.ops = ct_limit_genl_ops,
+	.n_ops = ARRAY_SIZE(ct_limit_genl_ops),
+	.mcgrps = &ovs_ct_limit_multicast_group,
+	.n_mcgrps = 1,
+	.module = THIS_MODULE,
+};
+#endif
+
+int ovs_ct_init(struct net *net)
 {
 	unsigned int n_bits = sizeof(struct ovs_key_ct_labels) * BITS_PER_BYTE;
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
@@ -1666,12 +2205,22 @@ void ovs_ct_init(struct net *net)
 	} else {
 		ovs_net->xt_label = true;
 	}
+
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	return ovs_ct_limit_init(net, ovs_net);
+#else
+	return 0;
+#endif
 }
 
 void ovs_ct_exit(struct net *net)
 {
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	ovs_ct_limit_exit(net, ovs_net);
+#endif
+
 	if (ovs_net->xt_label)
 		nf_connlabels_put(net);
 }
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 399dfdd2c4f9..900dadd70974 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -17,10 +17,11 @@
 #include "flow.h"
 
 struct ovs_conntrack_info;
+struct ovs_ct_limit_info;
 enum ovs_key_attr;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-void ovs_ct_init(struct net *);
+int ovs_ct_init(struct net *);
 void ovs_ct_exit(struct net *);
 bool ovs_ct_verify(struct net *, enum ovs_key_attr attr);
 int ovs_ct_copy_action(struct net *, const struct nlattr *,
@@ -44,7 +45,7 @@ void ovs_ct_free_action(const struct nlattr *a);
 #else
 #include <linux/errno.h>
 
-static inline void ovs_ct_init(struct net *net) { }
+static inline int ovs_ct_init(struct net *net) { return 0; }
 
 static inline void ovs_ct_exit(struct net *net) { }
 
@@ -104,4 +105,8 @@ static inline void ovs_ct_free_action(const struct nlattr *a) { }
 
 #define CT_SUPPORTED_MASK 0
 #endif /* CONFIG_NF_CONNTRACK */
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+extern struct genl_family dp_ct_limit_genl_family;
+#endif
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 015e24e08909..a61818e94396 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2288,6 +2288,9 @@ static struct genl_family * const dp_genl_families[] = {
 	&dp_flow_genl_family,
 	&dp_packet_genl_family,
 	&dp_meter_genl_family,
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	&dp_ct_limit_genl_family,
+#endif
 };
 
 static void dp_unregister_genl(int n_families)
@@ -2323,8 +2326,7 @@ static int __net_init ovs_init_net(struct net *net)
 
 	INIT_LIST_HEAD(&ovs_net->dps);
 	INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
-	ovs_ct_init(net);
-	return 0;
+	return ovs_ct_init(net);
 }
 
 static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
@@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 523d65526766..c9eb267c6f7e 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -144,6 +144,9 @@ struct dp_upcall_info {
 struct ovs_net {
 	struct list_head dps;
 	struct work_struct dp_notify_work;
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	struct ovs_ct_limit_info *ct_limit_info;
+#endif
 
 	/* Module reference for configuring conntrack. */
 	bool xt_label;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: Daniel Borkmann @ 2018-05-25  0:59 UTC (permalink / raw)
  To: Andrey Ignatov, netdev; +Cc: davem, kafai, ast, kernel-team
In-Reply-To: <ba05de3cd3b6af6d3300d9c5623976f4aec161b0.1527031931.git.rdna@fb.com>

On 05/23/2018 01:40 AM, Andrey Ignatov wrote:
[...]
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ff4d4ba..a1f9ba2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -900,6 +900,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  {
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct udp_sock *up = udp_sk(sk);
> +	DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
>  	struct flowi4 fl4_stack;
>  	struct flowi4 *fl4;
>  	int ulen = len;
> @@ -954,8 +955,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	/*
>  	 *	Get and verify the address.
>  	 */
> -	if (msg->msg_name) {
> -		DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
> +	if (usin) {
>  		if (msg->msg_namelen < sizeof(*usin))
>  			return -EINVAL;
>  		if (usin->sin_family != AF_INET) {
> @@ -1009,6 +1009,22 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		rcu_read_unlock();
>  	}
>  
> +	if (!connected) {
> +		err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> +					    (struct sockaddr *)usin, &ipc.addr);
> +		if (err)
> +			goto out_free;
> +		if (usin) {
> +			if (usin->sin_port == 0) {
> +				/* BPF program set invalid port. Reject it. */
> +				err = -EINVAL;
> +				goto out_free;
> +			}
> +			daddr = usin->sin_addr.s_addr;
> +			dport = usin->sin_port;
> +		}
> +	}
> +
>  	saddr = ipc.addr;
>  	ipc.addr = faddr = daddr;
>  
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2839c1b..67c44b5 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1315,6 +1315,29 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		fl6.saddr = np->saddr;
>  	fl6.fl6_sport = inet->inet_sport;
>  
> +	if (!connected) {
> +		err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
> +					   (struct sockaddr *)sin6, &fl6.saddr);
> +		if (err)
> +			goto out_no_dst;
> +		if (sin6) {
> +			if (ipv6_addr_v4mapped(&sin6->sin6_addr)) {
> +				/* BPF program rewrote IPv6-only by IPv4-mapped
> +				 * IPv6. It's currently unsupported.
> +				 */
> +				err = -ENOTSUPP;
> +				goto out_no_dst;
> +			}
> +			if (sin6->sin6_port == 0) {
> +				/* BPF program set invalid port. Reject it. */
> +				err = -EINVAL;
> +				goto out_no_dst;
> +			}
> +			fl6.fl6_dport = sin6->sin6_port;
> +			fl6.daddr = sin6->sin6_addr;
> +		}

Hmm, this extra work here and in v4 case should probably all be done under
the static key? Otherwise we'll do the extra work for checking sin6 and
setting up fl6 twice? Also, when not enabled, couldn't we run into the case
of ipv6_addr_v4mapped() as well? If I'm spotting this right, then we would
bail out though we shouldn't normally?

> +	}
> +
>  	final_p = fl6_update_dst(&fl6, opt, &final);
>  	if (final_p)
>  		connected = false;
> @@ -1394,6 +1417,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  
>  out:
>  	dst_release(dst);
> +out_no_dst:
>  	fl6_sock_release(flowlabel);
>  	txopt_put(opt_to_free);
>  	if (!err)
> 

^ permalink raw reply

* Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
From: Jakub Kicinski @ 2018-05-25  1:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, linux-pci, netdev, Sathya Perla, Felix Manlunas,
	alexander.duyck, john.fastabend, Jacob Keller, Donald Dutile,
	oss-drivers, Christoph Hellwig
In-Reply-To: <20180524235748.GD15320@bhelgaas-glaptop.roam.corp.google.com>

Hi Bjorn!

On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > Some user space depends on enabling sriov_totalvfs number of VFs
> > to not fail, e.g.:
> > 
> > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > 
> > For devices which VF support depends on loaded FW we have the
> > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > to 0.  Remove the special values completely and simply initialize
> > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > Add a helper for drivers to reset the VF limit back to total.  
> 
> I still can't really make sense out of the changelog.
>
> I think part of the reason it's confusing is because there are two
> things going on:
> 
>   1) You want this:
>   
>        pci_sriov_set_totalvfs(dev, 0);
>        x = pci_sriov_get_totalvfs(dev) 
> 
>      to return 0 instead of total_VFs.  That seems to connect with
>      your subject line.  It means "sriov_totalvfs" in sysfs could be
>      0, but I don't know how that is useful (I'm sure it is; just
>      educate me :))

Let me just quote the bug report that got filed on our internal bug
tracker :)

  When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
  errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
  then tries to set that as the sriov_numvfs parameter.

  For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
  but it's set to max.  When FW is switched to flower*, the correct 
  sriov_totalvfs value is presented.

* flower is a project name

My understanding is OpenStack uses sriov_totalvfs to determine how many
VFs can be enabled, looks like this is the code:

http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464

>   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
>      sure what you intend for this.  Is *every* driver supposed to
>      call it in .remove()?  Could/should this be done in the core
>      somehow instead of depending on every driver?

Good question, I was just thinking yesterday we may want to call it
from the core, but I don't think it's strictly necessary nor always
sufficient (we may reload FW without re-probing).

We have a device which supports different number of VFs based on the FW
loaded.  Some legacy FWs does not inform the driver how many VFs it can
support, because it supports max.  So the flow in our driver is this:

load_fw(dev);
...
max_vfs = ask_fw_for_max_vfs(dev);
if (max_vfs >= 0)
	return pci_sriov_set_totalvfs(dev, max_vfs);
else /* FW didn't tell us, assume max */
	return pci_sriov_reset_totalvfs(dev); 

We also reset the max on device remove, but that's not strictly
necessary.

Other users of pci_sriov_set_totalvfs() always know the value to set
the total to (either always get it from FW or it's a constant).

If you prefer we can work out the correct max for those legacy cases in
the driver as well, although it seemed cleaner to just ask the core,
since it already has total_VFs value handy :)

> I'm also having a hard time connecting your user-space command example
> with the rest of this.  Maybe it will make more sense to me tomorrow
> after some coffee.

OpenStack assumes it will always be able to set sriov_numvfs to
sriov_totalvfs, see this 'if':

http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512

I tried to morph that into an concise bash command, but clearly failed.
Sorry about the lack of clarity! :(

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/7] bpf: implement BPF_TASK_FD_QUERY
From: Alexei Starovoitov @ 2018-05-25  1:32 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song, peterz, netdev; +Cc: kernel-team
In-Reply-To: <fde7c055-9409-e7de-1576-acd17e65dae1@iogearbox.net>

On 5/24/18 5:27 PM, Daniel Borkmann wrote:
> On 05/24/2018 08:21 PM, Yonghong Song wrote:
>> Currently, suppose a userspace application has loaded a bpf program
>> and attached it to a tracepoint/kprobe/uprobe, and a bpf
>> introspection tool, e.g., bpftool, wants to show which bpf program
>> is attached to which tracepoint/kprobe/uprobe. Such attachment
>> information will be really useful to understand the overall bpf
>> deployment in the system.
>>
>> There is a name field (16 bytes) for each program, which could
>> be used to encode the attachment point. There are some drawbacks
>> for this approaches. First, bpftool user (e.g., an admin) may not
>> really understand the association between the name and the
>> attachment point. Second, if one program is attached to multiple
>> places, encoding a proper name which can imply all these
>> attachments becomes difficult.
>>
>> This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
>> Given a pid and fd, this command will return bpf related information
>> to user space. Right now it only supports tracepoint/kprobe/uprobe
>> perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
>>    . prog_id
>>    . tracepoint name, or
>>    . k[ret]probe funcname + offset or kernel addr, or
>>    . u[ret]probe filename + offset
>> to the userspace.
>> The user can use "bpftool prog" to find more information about
>> bpf program itself with prog_id.
>>
>> Patch #1 adds function perf_get_event() in kernel/events/core.c.
>> Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
>> Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
>> in the libbpf library for samples/selftests/bpftool to use.
>> Patch #4 adds ksym_get_addr() utility function.
>> Patch #5 add a test in samples/bpf for querying k[ret]probes and
>> u[ret]probes.
>> Patch #6 add a test in tools/testing/selftests/bpf for querying
>> raw_tracepoint and tracepoint.
>> Patch #7 add a new subcommand "perf" to bpftool.
>>
>> Changelogs:
>>   v4 -> v5:
>>      . return strlen(buf) instead of strlen(buf) + 1
>>        in the attr.buf_len. As long as user provides
>>        non-empty buffer, it will be filed with empty
>>        string, truncated string, or full string
>>        based on the buffer size and the length of
>>        to-be-copied string.
>>   v3 -> v4:
>>      . made attr buf_len input/output. The length of
>>        actual buffter is written to buf_len so user space knows
>>        what is actually needed. If user provides a buffer
>>        with length >= 1 but less than required, do partial
>>        copy and return -ENOSPC.
>>      . code simplification with put_user.
>>      . changed query result attach_info to fd_type.
>>      . add tests at selftests/bpf to test zero len, null buf and
>>        insufficient buf.
>>   v2 -> v3:
>>      . made perf_get_event() return perf_event pointer const.
>>        this was to ensure that event fields are not meddled.
>>      . detect whether newly BPF_TASK_FD_QUERY is supported or
>>        not in "bpftool perf" and warn users if it is not.
>>   v1 -> v2:
>>      . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
>>        to BPF_TASK_FD_QUERY.
>>      . fixed various "bpftool perf" issues and added documentation
>>        and auto-completion.
>>
>> Yonghong Song (7):
>>   perf/core: add perf_get_event() to return perf_event given a struct
>>     file
>>   bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
>>   tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in
>>     libbpf
>>   tools/bpf: add ksym_get_addr() in trace_helpers
>>   samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
>>   tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
>>   tools/bpftool: add perf subcommand
>>
>>  include/linux/perf_event.h                       |   5 +
>>  include/linux/trace_events.h                     |  17 +
>>  include/uapi/linux/bpf.h                         |  26 ++
>>  kernel/bpf/syscall.c                             | 131 ++++++++
>>  kernel/events/core.c                             |   8 +
>>  kernel/trace/bpf_trace.c                         |  48 +++
>>  kernel/trace/trace_kprobe.c                      |  29 ++
>>  kernel/trace/trace_uprobe.c                      |  22 ++
>>  samples/bpf/Makefile                             |   4 +
>>  samples/bpf/task_fd_query_kern.c                 |  19 ++
>>  samples/bpf/task_fd_query_user.c                 | 382 +++++++++++++++++++++++
>>  tools/bpf/bpftool/Documentation/bpftool-perf.rst |  81 +++++
>>  tools/bpf/bpftool/Documentation/bpftool.rst      |   5 +-
>>  tools/bpf/bpftool/bash-completion/bpftool        |   9 +
>>  tools/bpf/bpftool/main.c                         |   3 +-
>>  tools/bpf/bpftool/main.h                         |   1 +
>>  tools/bpf/bpftool/perf.c                         | 246 +++++++++++++++
>>  tools/include/uapi/linux/bpf.h                   |  26 ++
>>  tools/lib/bpf/bpf.c                              |  23 ++
>>  tools/lib/bpf/bpf.h                              |   3 +
>>  tools/testing/selftests/bpf/test_progs.c         | 158 ++++++++++
>>  tools/testing/selftests/bpf/trace_helpers.c      |  12 +
>>  tools/testing/selftests/bpf/trace_helpers.h      |   1 +
>>  23 files changed, 1257 insertions(+), 2 deletions(-)
>>  create mode 100644 samples/bpf/task_fd_query_kern.c
>>  create mode 100644 samples/bpf/task_fd_query_user.c
>>  create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
>>  create mode 100644 tools/bpf/bpftool/perf.c
>
> LGTM, series:
>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Applied to bpf-next, Thanks everyone.

^ permalink raw reply

* linux-next: manual merge of the scsi tree with the net-next tree
From: Mark Brown @ 2018-05-25  1:38 UTC (permalink / raw)
  To: James Bottomley, Chad Dupuis, Martin K. Petersen, linux-scsi,
	David S. Miller, netdev
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]

Hi James,

Today's linux-next merge of the scsi tree got a conflict in:

  drivers/scsi/qedf/qedf.h

between commit:

  8673daf4f55bf3b91 ("qedf: Add get_generic_tlv_data handler.")

from the net-next tree and commit:

  4b9b7fabb39b3e9d7 ("scsi: qedf: Improve firmware debug dump handling")

from the scsi tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

diff --cc drivers/scsi/qedf/qedf.h
index cabb6af60fb8,2372a40326f8..000000000000
--- a/drivers/scsi/qedf/qedf.h
+++ b/drivers/scsi/qedf/qedf.h
@@@ -501,9 -499,8 +504,10 @@@ extern int qedf_post_io_req(struct qedf
  extern void qedf_process_seq_cleanup_compl(struct qedf_ctx *qedf,
  	struct fcoe_cqe *cqe, struct qedf_ioreq *io_req);
  extern int qedf_send_flogi(struct qedf_ctx *qedf);
 +extern void qedf_get_protocol_tlv_data(void *dev, void *data);
  extern void qedf_fp_io_handler(struct work_struct *work);
 +extern void qedf_get_generic_tlv_data(void *dev, struct qed_generic_tlvs *data);
+ extern void qedf_wq_grcdump(struct work_struct *work);
  
  #define FCOE_WORD_TO_BYTE  4
  #define QEDF_MAX_TASK_NUM	0xFFFF

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: Check for kvzalloc allocation failure
From: YueHaibing @ 2018-05-25  1:39 UTC (permalink / raw)
  To: David Miller; +Cc: ganeshgr, linux-kernel, netdev
In-Reply-To: <20180524.110743.522760687215216591.davem@davemloft.net>

On 2018/5/24 23:07, David Miller wrote:
> From: YueHaibing <yuehaibing@huawei.com>
> Date: Tue, 22 May 2018 15:07:18 +0800
> 
>> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> index 130d1ee..019cffe 100644
>> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> @@ -4135,6 +4135,10 @@ static int adap_init0(struct adapter *adap)
>>  		 * card
>>  		 */
>>  		card_fw = kvzalloc(sizeof(*card_fw), GFP_KERNEL);
>> +		if (!card_fw) {
>> +			ret = -ENOMEM;
>> +			goto bye;
>> +		}
>>  
> 
> On error, this leaks fw_info.

Hi David,

I checked fw_info is an element of fw_info_array，there all members of struct fw_info no need free.

It likes this :

static struct fw_info fw_info_array[] = {
	{
		.chip = CHELSIO_T4,
		.fs_name = FW4_CFNAME,
		.fw_mod_name = FW4_FNAME,
		.fw_hdr = {
			.chip = FW_HDR_CHIP_T4,
			.fw_ver = __cpu_to_be32(FW_VERSION(T4)),
			.intfver_nic = FW_INTFVER(T4, NIC),
			.intfver_vnic = FW_INTFVER(T4, VNIC),
			.intfver_ri = FW_INTFVER(T4, RI),
			.intfver_iscsi = FW_INTFVER(T4, ISCSI),
			.intfver_fcoe = FW_INTFVER(T4, FCOE),
		},
	}, {
		........

Am I missing something?
> 
> .
> 

^ permalink raw reply

* Re: [bpf-next V5 PATCH 0/8] xdp: introduce bulking for ndo_xdp_xmit API
From: Alexei Starovoitov @ 2018-05-25  1:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Christoph Hellwig, BjörnTöpel,
	John Fastabend, Magnus Karlsson, makita.toshiaki
In-Reply-To: <152717306303.4777.4205616217877503311.stgit@firesoul>

On Thu, May 24, 2018 at 04:45:41PM +0200, Jesper Dangaard Brouer wrote:
> This patchset change ndo_xdp_xmit API to take a bulk of xdp frames.
> 
> When kernel is compiled with CONFIG_RETPOLINE, every indirect function
> pointer (branch) call hurts performance. For XDP this have a huge
> negative performance impact.
> 
> This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but
> also prepares for further optimizations.  The DMA APIs use of indirect
> function pointer calls is the primary source the regression.  It is
> left for a followup patchset, to use bulking calls towards the DMA API
> (via the scatter-gatter calls).
> 
> The other advantage of this API change is that drivers can easier
> amortize the cost of any sync/locking scheme, over the bulk of
> packets.  The assumption of the current API is that the driver
> implemementing the NDO will also allocate a dedicated XDP TX queue for
> every CPU in the system.  Which is not always possible or practical to
> configure. E.g. ixgbe cannot load an XDP program on a machine with
> more than 96 CPUs, due to limited hardware TX queues.  E.g. virtio_net
> is hard to configure as it requires manually increasing the
> queues. E.g. tun driver chooses to use a per XDP frame producer lock
> modulo smp_processor_id over avail queues.
> 
> I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of
> this patchset.  This will be a followup patchset, once we know if this
> will be needed (e.g. for non-map xdp_redirect flush-flag, and if
> AF_XDP chooses to use ndo_xdp_xmit for TX).
> 
> ---
> V5: Fixed up issues spotted by Daniel and John
> 
> V4: Splitout the patches from 4 to 8 patches.  I cannot split the
> driver changes from the NDO change, but I've tried to isolated the NDO
> change together with the driver change as much as possible.

The patch 6/8 would have benefited from a review from Intel folks,
but the series have been pending for too long already, hence
Applied to bpf-next.
Please address any follow up reviews if/when they come.
Thanks Jesper.

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: Check for kvzalloc allocation failure
From: David Miller @ 2018-05-25  1:52 UTC (permalink / raw)
  To: yuehaibing; +Cc: ganeshgr, linux-kernel, netdev
In-Reply-To: <f26c9a69-2d48-ddc7-4f12-53d0cf65b4f1@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Fri, 25 May 2018 09:39:20 +0800

> On 2018/5/24 23:07, David Miller wrote:
>> From: YueHaibing <yuehaibing@huawei.com>
>> Date: Tue, 22 May 2018 15:07:18 +0800
>> 
>>> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>>> index 130d1ee..019cffe 100644
>>> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>>> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>>> @@ -4135,6 +4135,10 @@ static int adap_init0(struct adapter *adap)
>>>  		 * card
>>>  		 */
>>>  		card_fw = kvzalloc(sizeof(*card_fw), GFP_KERNEL);
>>> +		if (!card_fw) {
>>> +			ret = -ENOMEM;
>>> +			goto bye;
>>> +		}
>>>  
>> 
>> On error, this leaks fw_info.
> 
> Hi David,
> 
> I checked fw_info is an element of fw_info_array，there all members of struct fw_info no need free.

Aha, I misread the code, sorry.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] packet: fix reserve calculation
From: David Miller @ 2018-05-25  1:56 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb
In-Reply-To: <20180524221030.158150-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Thu, 24 May 2018 18:10:30 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> Commit b84bbaf7a6c8 ("packet: in packet_snd start writing at link
> layer allocation") ensures that packet_snd always starts writing
> the link layer header in reserved headroom allocated for this
> purpose.
> 
> This is needed because packets may be shorter than hard_header_len,
> in which case the space up to hard_header_len may be zeroed. But
> that necessary padding is not accounted for in skb->len.
> 
> The fix, however, is buggy. It calls skb_push, which grows skb->len
> when moving skb->data back. But in this case packet length should not
> change.
> 
> Instead, call skb_reserve, which moves both skb->data and skb->tail
> back, without changing length.
> 
> Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation")
> Reported-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [pull request][net 0/2] Mellanox, mlx5 fixes 2018-05-24
From: David Miller @ 2018-05-25  2:02 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20180524215313.7605-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 24 May 2018 14:53:11 -0700

> This series includes two mlx5 fixes.
> 
> 1) add FCS data to checksum complete when required, from Eran Ben
> Elisha.
> 
> 2) Fix A race in IPSec sandbox QP commands, from Yossi Kuperman.
> 
> Please pull and let me know if there's any problem.

Pulled.

> for -stable v4.15
> ("net/mlx5e: When RXFCS is set, add FCS data into checksum calculation")

Queued up.

^ permalink raw reply

* pull-request: bpf-next 2018-05-24
From: Alexei Starovoitov @ 2018-05-25  2:03 UTC (permalink / raw)
  To: davem; +Cc: daniel, kernel-team, netdev

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).

2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.

3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.

4) Jiong Wang adds support for indirect and arithmetic shifts to NFP

5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.

6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
   to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.

7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.

8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.

9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit b9f672af148bf7a08a6031743156faffd58dbc7e:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2018-05-16 22:47:11 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 10f678683e4026e43524b0492068a371d00fdeed:

  Merge branch 'xdp_xmit-bulking' (2018-05-24 18:36:16 -0700)

----------------------------------------------------------------
Alexei Starovoitov (2):
      Merge branch 'bpf-task-fd-query'
      Merge branch 'xdp_xmit-bulking'

Björn Töpel (11):
      xsk: clean up SPDX headers
      xsk: remove newline at end of file
      xsk: fixed some cases of unnecessary parentheses
      xsk: proper '=' alignment
      xsk: remove rebind support
      xsk: fill hole in struct sockaddr_xdp
      xsk: remove explicit ring structure from uapi
      samples/bpf: adapt xdpsock to the new uapi
      xsk: add missing write- and data-dependency barrier
      xsk: simplified umem setup
      xsk: convert atomic_t to refcount_t

Daniel Borkmann (8):
      Merge branch 'bpf-af-xdp-cleanups'
      Merge branch 'bpf-nfp-shift-insns'
      Merge branch 'bpf-sk-msg-fields'
      Merge branch 'bpf-af-xdp-cleanups'
      Merge branch 'bpf-fib-mtu-check'
      Merge branch 'btf-uapi-cleanups'
      Merge branch 'bpf-multi-prog-improvements'
      Merge branch 'bpf-ipv6-seg6-bpf-action'

David Ahern (3):
      net/ipv4: Add helper to return path MTU based on fib result
      net/ipv6: Add helper to return path MTU based on fib result
      bpf: Add mtu checking to FIB forwarding helper

Gustavo A. R. Silva (2):
      bpf: sockmap, fix uninitialized variable
      bpf: sockmap, fix double-free

Jesper Dangaard Brouer (8):
      bpf: devmap introduce dev_map_enqueue
      bpf: devmap prepare xdp frames for bulking
      xdp: add tracepoint for devmap like cpumap have
      samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmit
      xdp: introduce xdp_return_frame_rx_napi
      xdp: change ndo_xdp_xmit API to support bulking
      xdp/trace: extend tracepoint in devmap with an err
      samples/bpf: xdp_monitor use err code from tracepoint xdp:xdp_devmap_xmit

Jiong Wang (3):
      nfp: bpf: support logic indirect shifts (BPF_[L|R]SH | BPF_X)
      nfp: bpf: support arithmetic right shift by constant (BPF_ARSH | BPF_K)
      nfp: bpf: support arithmetic indirect right shift (BPF_ARSH | BPF_X)

John Fastabend (2):
      bpf: allow sk_msg programs to read sock fields
      bpf: add sk_msg prog sk access tests to test_verifier

Magnus Karlsson (1):
      xsk: proper queue id check at bind

Martin KaFai Lau (8):
      bpf: Expose check_uarg_tail_zero()
      bpf: btf: Change how section is supported in btf_header
      bpf: btf: Check array->index_type
      bpf: btf: Remove unused bits from uapi/linux/btf.h
      bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info
      bpf: btf: Sync bpf.h and btf.h to tools
      bpf: btf: Add tests for the btf uapi changes
      bpf: btf: Avoid variable length array

Mathieu Xhonneux (6):
      ipv6: sr: make seg6.h includable without IPv6
      ipv6: sr: export function lookup_nexthop
      bpf: Add IPv6 Segment Routing helpers
      bpf: Split lwt inout verifier structures
      ipv6: sr: Add seg6local action End.BPF
      selftests/bpf: test for seg6local End.BPF action

Quentin Monnet (1):
      bpf: change eBPF helper doc parsing script to allow for smaller indent

Sandipan Das (10):
      bpf: support 64-bit offsets for bpf function calls
      bpf: powerpc64: pad function address loads with NOPs
      bpf: powerpc64: add JIT support for multi-function programs
      bpf: get kernel symbol addresses via syscall
      tools: bpf: sync bpf uapi header
      tools: bpftool: resolve calls without using imm field
      bpf: fix multi-function JITed dump obtained via syscall
      bpf: get JITed image lengths of functions via syscall
      tools: bpf: sync bpf uapi header
      tools: bpftool: add delimiters to multi-function JITed dumps

Sirio Balmelli (2):
      selftests/bpf: Makefile fix "missing" headers on build with -idirafter
      tools/lib/libbpf.c: fix string format to allow build on arm32

Yonghong Song (7):
      perf/core: add perf_get_event() to return perf_event given a struct file
      bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
      tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in libbpf
      tools/bpf: add ksym_get_addr() in trace_helpers
      samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
      tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
      tools/bpftool: add perf subcommand

 arch/powerpc/net/bpf_jit_comp64.c                 | 110 ++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       |  26 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h       |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  21 +-
 drivers/net/ethernet/netronome/nfp/bpf/jit.c      | 410 ++++++++++++++--
 drivers/net/ethernet/netronome/nfp/bpf/main.h     |  28 ++
 drivers/net/ethernet/netronome/nfp/bpf/offload.c  |   2 +
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |   8 +
 drivers/net/ethernet/netronome/nfp/nfp_asm.h      |  18 +-
 drivers/net/tun.c                                 |  37 +-
 drivers/net/virtio_net.c                          |  66 ++-
 include/linux/bpf.h                               |  24 +-
 include/linux/bpf_types.h                         |   5 +-
 include/linux/filter.h                            |   1 +
 include/linux/netdevice.h                         |  14 +-
 include/linux/perf_event.h                        |   5 +
 include/linux/trace_events.h                      |  17 +
 include/net/addrconf.h                            |   2 +
 include/net/ip6_fib.h                             |   6 +
 include/net/ip6_route.h                           |   3 +
 include/net/ip_fib.h                              |   2 +
 include/net/page_pool.h                           |   5 +-
 include/net/seg6.h                                |   7 +-
 include/net/seg6_local.h                          |  32 ++
 include/net/xdp.h                                 |   1 +
 include/net/xdp_sock.h                            |  13 +-
 include/trace/events/xdp.h                        |  50 +-
 include/uapi/linux/bpf.h                          | 143 +++++-
 include/uapi/linux/btf.h                          |  37 +-
 include/uapi/linux/if_xdp.h                       |  59 +--
 include/uapi/linux/seg6_local.h                   |  12 +
 kernel/bpf/arraymap.c                             |   2 +-
 kernel/bpf/btf.c                                  | 334 +++++++++----
 kernel/bpf/cpumap.c                               |   2 +-
 kernel/bpf/devmap.c                               | 131 ++++-
 kernel/bpf/sockmap.c                              |   4 +-
 kernel/bpf/syscall.c                              | 245 ++++++++-
 kernel/bpf/verifier.c                             |  23 +-
 kernel/bpf/xskmap.c                               |   9 -
 kernel/events/core.c                              |   8 +
 kernel/trace/bpf_trace.c                          |  48 ++
 kernel/trace/trace_kprobe.c                       |  29 ++
 kernel/trace/trace_uprobe.c                       |  22 +
 net/core/filter.c                                 | 572 +++++++++++++++++++---
 net/core/xdp.c                                    |  20 +-
 net/ipv4/route.c                                  |  31 ++
 net/ipv6/Kconfig                                  |   5 +
 net/ipv6/addrconf_core.c                          |   8 +
 net/ipv6/af_inet6.c                               |   1 +
 net/ipv6/route.c                                  |  48 ++
 net/ipv6/seg6_local.c                             | 190 ++++++-
 net/xdp/Makefile                                  |   1 -
 net/xdp/xdp_umem.c                                |  96 ++--
 net/xdp/xdp_umem.h                                |  18 +-
 net/xdp/xdp_umem_props.h                          |  13 +-
 net/xdp/xsk.c                                     | 150 +++---
 net/xdp/xsk_queue.c                               |  12 +-
 net/xdp/xsk_queue.h                               |  34 +-
 samples/bpf/Makefile                              |   4 +
 samples/bpf/task_fd_query_kern.c                  |  19 +
 samples/bpf/task_fd_query_user.c                  | 382 +++++++++++++++
 samples/bpf/xdp_monitor_kern.c                    |  49 ++
 samples/bpf/xdp_monitor_user.c                    |  69 ++-
 samples/bpf/xdpsock_user.c                        | 135 ++---
 scripts/bpf_helpers_doc.py                        |   8 +-
 tools/bpf/bpftool/Documentation/bpftool-perf.rst  |  81 +++
 tools/bpf/bpftool/Documentation/bpftool.rst       |   5 +-
 tools/bpf/bpftool/bash-completion/bpftool         |   9 +
 tools/bpf/bpftool/main.c                          |   3 +-
 tools/bpf/bpftool/main.h                          |   1 +
 tools/bpf/bpftool/perf.c                          | 246 ++++++++++
 tools/bpf/bpftool/prog.c                          |  97 +++-
 tools/bpf/bpftool/xlated_dumper.c                 |  14 +-
 tools/bpf/bpftool/xlated_dumper.h                 |   3 +
 tools/include/uapi/linux/bpf.h                    | 143 +++++-
 tools/include/uapi/linux/btf.h                    |  37 +-
 tools/lib/bpf/bpf.c                               |  27 +-
 tools/lib/bpf/bpf.h                               |   7 +-
 tools/lib/bpf/btf.c                               |   5 +-
 tools/lib/bpf/libbpf.c                            |  43 +-
 tools/lib/bpf/libbpf.h                            |   4 +-
 tools/testing/selftests/bpf/Makefile              |  16 +-
 tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
 tools/testing/selftests/bpf/test_btf.c            | 521 ++++++++++++++++----
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 437 +++++++++++++++++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 ++++++
 tools/testing/selftests/bpf/test_progs.c          | 158 ++++++
 tools/testing/selftests/bpf/test_verifier.c       | 115 +++++
 tools/testing/selftests/bpf/trace_helpers.c       |  12 +
 tools/testing/selftests/bpf/trace_helpers.h       |   1 +
 90 files changed, 5213 insertions(+), 812 deletions(-)
 create mode 100644 include/net/seg6_local.h
 create mode 100644 samples/bpf/task_fd_query_kern.c
 create mode 100644 samples/bpf/task_fd_query_user.c
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
 create mode 100644 tools/bpf/bpftool/perf.c
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

^ permalink raw reply

* Re: [PATCH net] vhost: synchronize IOTLB message with dev cleanup
From: David Miller @ 2018-05-25  2:10 UTC (permalink / raw)
  To: jasowang; +Cc: mst, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <1526990337-24892-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue, 22 May 2018 19:58:57 +0800

> DaeRyong Jeong reports a race between vhost_dev_cleanup() and
> vhost_process_iotlb_msg():
> 
> Thread interleaving:
> CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
> (In the case of both VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE)
> =====						=====
> 						vhost_umem_clean(dev->iotlb);
> if (!dev->iotlb) {
> 	        ret = -EFAULT;
> 		        break;
> }
> 						dev->iotlb = NULL;
> 
> The reason is we don't synchronize between them, fixing by protecting
> vhost_process_iotlb_msg() with dev mutex.
> 
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] net : sched: cls_api: deal with egdev path only if needed
From: David Miller @ 2018-05-25  2:13 UTC (permalink / raw)
  To: ogerlitz; +Cc: jiri, jakub.kicinski, paulb, netdev, ASAP_Direct_Dev
In-Reply-To: <1527092688-27496-1-git-send-email-ogerlitz@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Wed, 23 May 2018 19:24:48 +0300

> When dealing with ingress rule on a netdev, if we did fine through the
> conventional path, there's no need to continue into the egdev route,
> and we can stop right there.
> 
> Not doing so may cause a 2nd rule to be added by the cls api layer
> with the ingress being the egdev.
> 
> For example, under sriov switchdev scheme, a user rule of VFR A --> VFR B
> will end up with two HW rules (1) VF A --> VF B and (2) uplink --> VF B
> 
> Fixes: 208c0f4b5237 ('net: sched: use tc_setup_cb_call to call per-block callbacks')
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

Applied and queued up for -stable.

> As I wrote in [1], we are asking this patch to go into net and 
> stable >= 4.15 but not carried into net-next.

Please send me a revert with a detailed commit message when this
gets merged into net-next.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests: forwarding: Additions to mirror-to-gretap tests
From: David Miller @ 2018-05-25  2:14 UTC (permalink / raw)
  To: petrm; +Cc: netdev, linux-kselftest, shuah, idosch
In-Reply-To: <cover.1527093017.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Wed, 23 May 2018 18:34:49 +0200

> This patchset is for a handful of edge cases in mirror-to-gretap
> scenarios: removal of mirrored-to netdevice (#1), removal of underlay
> route for tunnel remote endpoint (#2) and cessation of mirroring upon
> removal of flower mirroring rule (#3).

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net] ipv4: remove warning in ip_recv_error
From: David Miller @ 2018-05-25  2:18 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb
In-Reply-To: <20180523182952.77006-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 23 May 2018 14:29:52 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> A precondition check in ip_recv_error triggered on an otherwise benign
> race. Remove the warning.
> 
> The warning triggers when passing an ipv6 socket to this ipv4 error
> handling function. RaceFuzzer was able to trigger it due to a race
> in setsockopt IPV6_ADDRFORM.
 ...
> This socket option converts a v6 socket that is connected to a v4 peer
> to an v4 socket. It updates the socket on the fly, changing fields in
> sk as well as other structs. This is inherently non-atomic. It races
> with the lockless udp_recvmsg path.
> 
> No other code makes an assumption that these fields are updated
> atomically. It is benign here, too, as ip_recv_error cares only about
> the protocol of the skbs enqueued on the error queue, for which
> sk_family is not a precise predictor (thanks to another isue with
> IPV6_ADDRFORM).
> 
> Link: http://lkml.kernel.org/r/20180518120826.GA19515@dragonet.kaist.ac.kr
> Fixes: ("7ce875e5ecb8 ipv4: warn once on passing AF_INET6 socket to ip_recv_error")
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied and queued up for -stable.

The SHA1_ID doesn't go inside the (" ") of the Fixes tag, I fixed
it up this time.

^ permalink raw reply

* Re: [PATCH net-next 0/8] ibmvnic: Failover hardening
From: David Miller @ 2018-05-25  2:19 UTC (permalink / raw)
  To: tlfalcon; +Cc: netdev, nfont, jallen, linuxppc-dev
In-Reply-To: <1527100682-23099-1-git-send-email-tlfalcon@linux.vnet.ibm.com>

From: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Date: Wed, 23 May 2018 13:37:54 -0500

> Introduce additional transport event hardening to handle
> events during device reset. In the driver's current state,
> if a transport event is received during device reset, it can
> cause the device to become unresponsive as invalid operations
> are processed as the backing device context changes. After
> a transport event, the device expects a request to begin the
> initialization process. If the driver is still processing
> a previously queued device reset in this state, it is likely
> to fail as firmware will reject any commands other than the
> one to initialize the client driver's Command-Response Queue.
> 
> Instead of failing and becoming dormant, the driver will make
> one more attempt to recover and continue operation. This is
> achieved by setting a state flag, which if true will direct
> the driver to clean up all allocated resources and perform
> a hard reset in an attempt to bring the driver back to an
> operational state.

Series applied.

^ permalink raw reply

* Re: pull-request: bpf-next 2018-05-24
From: David Miller @ 2018-05-25  2:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, kernel-team, netdev
In-Reply-To: <20180525020351.3634582-1-ast@kernel.org>

From: Alexei Starovoitov <ast@kernel.org>
Date: Thu, 24 May 2018 19:03:51 -0700

> The following pull-request contains BPF updates for your *net-next*
> tree.

Pulled, thanks Alexei.

^ permalink raw reply

* [RFC net-next 0/4] net: sched: support replay of filter offload when binding to block
From: Jakub Kicinski @ 2018-05-25  2:25 UTC (permalink / raw)
  To: netdev
  Cc: jiri, gerlitz.or, sridhar.samudrala, oss-drivers, john.hurley,
	Jakub Kicinski

Hi!

This series from John adds the ability to replay filter offload requests
when new offload callback is being registered on a TC block.  This is most
likely to take place for shared blocks today, when a block which already
has rules is bound to another interface.  Prior to this patch set if any
of the rules were offloaded the block bind would fail.

A new tcf_proto_op is added to generate a filter-specific offload request.
The new 'offload' op is supporting extack from day 0, hence we need to
propagate extack to .ndo_setup_tc TC_BLOCK_BIND/TC_BLOCK_UNBIND and
through tcf_block_cb_register() to tcf_block_playback_offloads().

The immediate use of this patch set is to simplify life of drivers which
require duplicating rules when sharing blocks.  Switch drivers (mlxsw)
can bind ports to rule lists dynamically, NIC drivers generally don't
have that ability and need the rules to be duplicated for each ingress
they match on.  In code terms this means that switch drivers don't
register multiple callbacks for each port.  NIC drivers do, and get a
separate request and hance rule per-port, as if the block was not shared.
The registration would fail, however, if some rules were already present.

As John notes in description of patch 2, drivers which register multiple
callbacks to shared blocks will likely need to flush the rules on block
unbind.  This set makes the core not only replay the the offload add
requests but also offload remove requests when callback is unregistered.

John Hurley (4):
  net: sched: pass extack pointer to block binds and cb registration
  net: sched: add tcf_proto_op to offload a rule
  net: sched: cls_flower: implement offload tcf_proto_op
  net: sched: cls_matchall: implement offload tcf_proto_op

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |  2 +-
 .../net/ethernet/chelsio/cxgb4/cxgb4_main.c   |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 14 ++--
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  2 +-
 .../ethernet/netronome/nfp/flower/offload.c   |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +-
 drivers/net/netdevsim/netdev.c                |  2 +-
 include/net/act_api.h                         |  3 -
 include/net/pkt_cls.h                         | 12 ++-
 include/net/sch_generic.h                     |  6 ++
 net/dsa/slave.c                               |  2 +-
 net/sched/cls_api.c                           | 74 ++++++++++++++-----
 net/sched/cls_flower.c                        | 51 +++++++++++++
 net/sched/cls_matchall.c                      | 40 ++++++++++
 21 files changed, 184 insertions(+), 44 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [RFC net-next 1/4] net: sched: pass extack pointer to block binds and cb registration
From: Jakub Kicinski @ 2018-05-25  2:25 UTC (permalink / raw)
  To: netdev; +Cc: jiri, gerlitz.or, sridhar.samudrala, oss-drivers, john.hurley
In-Reply-To: <20180525022539.6799-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.

Signed-off-by: John Hurley <john.hurley@netronome.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |  2 +-
 .../net/ethernet/chelsio/cxgb4/cxgb4_main.c   |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 10 ++++----
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  2 +-
 .../ethernet/netronome/nfp/flower/offload.c   |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +-
 drivers/net/netdevsim/netdev.c                |  2 +-
 include/net/pkt_cls.h                         |  6 +++--
 net/dsa/slave.c                               |  2 +-
 net/sched/cls_api.c                           | 24 ++++++++++++-------
 17 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index dfa0839f6656..a9187b0d97e3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7984,7 +7984,7 @@ static int bnxt_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, bnxt_setup_tc_block_cb,
-					     bp, bp);
+					     bp, bp, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, bnxt_setup_tc_block_cb, bp);
 		return 0;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 38f635cf8408..6cd3976f9920 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -173,7 +173,7 @@ static int bnxt_vf_rep_setup_tc_block(struct net_device *dev,
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
 					     bnxt_vf_rep_setup_tc_block_cb,
-					     vf_rep, vf_rep);
+					     vf_rep, vf_rep, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block,
 					bnxt_vf_rep_setup_tc_block_cb, vf_rep);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 513e1d356384..191ac686c5fb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3015,7 +3015,7 @@ static int cxgb_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, cxgb_setup_tc_block_cb,
-					     pi, dev);
+					     pi, dev, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, cxgb_setup_tc_block_cb, pi);
 		return 0;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b5daa5c9c7de..07dc46bbe508 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7554,7 +7554,7 @@ static int i40e_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, i40e_setup_tc_block_cb,
-					     np, np);
+					     np, np, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, i40e_setup_tc_block_cb, np);
 		return 0;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index a7b87f935411..3f8bb0d61f63 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2926,7 +2926,7 @@ static int i40evf_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, i40evf_setup_tc_block_cb,
-					     adapter, adapter);
+					     adapter, adapter, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, i40evf_setup_tc_block_cb,
 					adapter);
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 78574c06635b..b1ba426d2b40 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2727,7 +2727,7 @@ static int igb_setup_tc_block(struct igb_adapter *adapter,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, igb_setup_tc_block_cb,
-					     adapter, adapter);
+					     adapter, adapter, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb,
 					adapter);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a52d92e182ee..81553a602f0f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9376,7 +9376,7 @@ static int ixgbe_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, ixgbe_setup_tc_block_cb,
-					     adapter, adapter);
+					     adapter, adapter, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, ixgbe_setup_tc_block_cb,
 					adapter);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b5a7580b12fe..2e0029173a64 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3178,7 +3178,7 @@ static int mlx5e_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, mlx5e_setup_tc_block_cb,
-					     priv, priv);
+					     priv, priv, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, mlx5e_setup_tc_block_cb,
 					priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index c3034f58aa33..d9fe432f18f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -803,7 +803,7 @@ static int mlx5e_rep_setup_tc_block(struct net_device *dev,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, mlx5e_rep_setup_tc_cb,
-					     priv, priv);
+					     priv, priv, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, mlx5e_rep_setup_tc_cb, priv);
 		return 0;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index bb252b36994d..6887c7faaaba 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1503,7 +1503,8 @@ static int mlxsw_sp_setup_tc_block_cb_flower(enum tc_setup_type type,
 
 static int
 mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port,
-				    struct tcf_block *block, bool ingress)
+				    struct tcf_block *block, bool ingress,
+				    struct netlink_ext_ack *extack)
 {
 	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
 	struct mlxsw_sp_acl_block *acl_block;
@@ -1518,7 +1519,7 @@ mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port,
 			return -ENOMEM;
 		block_cb = __tcf_block_cb_register(block,
 						   mlxsw_sp_setup_tc_block_cb_flower,
-						   mlxsw_sp, acl_block);
+						   mlxsw_sp, acl_block, extack);
 		if (IS_ERR(block_cb)) {
 			err = PTR_ERR(block_cb);
 			goto err_cb_register;
@@ -1596,11 +1597,12 @@ static int mlxsw_sp_setup_tc_block(struct mlxsw_sp_port *mlxsw_sp_port,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		err = tcf_block_cb_register(f->block, cb, mlxsw_sp_port,
-					    mlxsw_sp_port);
+					    mlxsw_sp_port, f->extack);
 		if (err)
 			return err;
 		err = mlxsw_sp_setup_tc_block_flower_bind(mlxsw_sp_port,
-							  f->block, ingress);
+							  f->block, ingress,
+							  f->extack);
 		if (err) {
 			tcf_block_cb_unregister(f->block, cb, mlxsw_sp_port);
 			return err;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index fcdfb8e7fdea..bf46f7bff912 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -206,7 +206,7 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev,
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
 					     nfp_bpf_setup_tc_block_cb,
-					     nn, nn);
+					     nn, nn, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block,
 					nfp_bpf_setup_tc_block_cb,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index c42e64f32333..7abefed1efe9 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -627,7 +627,7 @@ static int nfp_flower_setup_tc_block(struct net_device *netdev,
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
 					     nfp_flower_setup_tc_block_cb,
-					     repr, repr);
+					     repr, repr, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block,
 					nfp_flower_setup_tc_block_cb,
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c32de53a00d3..f5bee8da084e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3759,7 +3759,7 @@ static int stmmac_setup_tc_block(struct stmmac_priv *priv,
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, stmmac_setup_tc_block_cb,
-				priv, priv);
+				priv, priv, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, stmmac_setup_tc_block_cb, priv);
 		return 0;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index ec68f38213d9..c9dacc6fcd59 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -260,7 +260,7 @@ nsim_setup_tc_block(struct net_device *dev, struct tc_block_offload *f)
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block, nsim_setup_tc_block_cb,
-					     ns, ns);
+					     ns, ns, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, nsim_setup_tc_block_cb, ns);
 		return 0;
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 0005f0b40fe9..b2b5cbe13086 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -73,10 +73,11 @@ void tcf_block_cb_incref(struct tcf_block_cb *block_cb);
 unsigned int tcf_block_cb_decref(struct tcf_block_cb *block_cb);
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 					     tc_setup_cb_t *cb, void *cb_ident,
-					     void *cb_priv);
+					     void *cb_priv,
+					     struct netlink_ext_ack *extack);
 int tcf_block_cb_register(struct tcf_block *block,
 			  tc_setup_cb_t *cb, void *cb_ident,
-			  void *cb_priv);
+			  void *cb_priv, struct netlink_ext_ack *extack);
 void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb);
 void tcf_block_cb_unregister(struct tcf_block *block,
 			     tc_setup_cb_t *cb, void *cb_ident);
@@ -596,6 +597,7 @@ struct tc_block_offload {
 	enum tc_block_command command;
 	enum tcf_block_binder_type binder_type;
 	struct tcf_block *block;
+	struct netlink_ext_ack *extack;
 };
 
 struct tc_cls_common_offload {
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 1e3b6a6d8a40..71536c435132 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -900,7 +900,7 @@ static int dsa_slave_setup_tc_block(struct net_device *dev,
 
 	switch (f->command) {
 	case TC_BLOCK_BIND:
-		return tcf_block_cb_register(f->block, cb, dev, dev);
+		return tcf_block_cb_register(f->block, cb, dev, dev, f->extack);
 	case TC_BLOCK_UNBIND:
 		tcf_block_cb_unregister(f->block, cb, dev);
 		return 0;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 963e4bf0aab8..95a83aa798d1 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -276,7 +276,8 @@ static bool tcf_block_offload_in_use(struct tcf_block *block)
 static int tcf_block_offload_cmd(struct tcf_block *block,
 				 struct net_device *dev,
 				 struct tcf_block_ext_info *ei,
-				 enum tc_block_command command)
+				 enum tc_block_command command,
+				 struct netlink_ext_ack *extack)
 {
 	struct tc_block_offload bo = {};
 
@@ -287,7 +288,8 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
 }
 
 static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
-				  struct tcf_block_ext_info *ei)
+				  struct tcf_block_ext_info *ei,
+				  struct netlink_ext_ack *extack)
 {
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
@@ -298,10 +300,12 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	/* If tc offload feature is disabled and the block we try to bind
 	 * to already has some offloaded filters, forbid to bind.
 	 */
-	if (!tc_can_offload(dev) && tcf_block_offload_in_use(block))
+	if (!tc_can_offload(dev) && tcf_block_offload_in_use(block)) {
+		NL_SET_ERR_MSG(extack, "Bind to offloaded block failed as dev has offload disabled");
 		return -EOPNOTSUPP;
+	}
 
-	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_BIND);
+	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_BIND, extack);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_inc;
 	return err;
@@ -321,7 +325,7 @@ static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_dec;
-	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_UNBIND);
+	err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_UNBIND, NULL);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_dec;
 	return;
@@ -539,7 +543,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 	if (err)
 		goto err_chain_head_change_cb_add;
 
-	err = tcf_block_offload_bind(block, q, ei);
+	err = tcf_block_offload_bind(block, q, ei, extack);
 	if (err)
 		goto err_block_offload_bind;
 
@@ -675,7 +679,8 @@ EXPORT_SYMBOL(tcf_block_cb_decref);
 
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 					     tc_setup_cb_t *cb, void *cb_ident,
-					     void *cb_priv)
+					     void *cb_priv,
+					     struct netlink_ext_ack *extack)
 {
 	struct tcf_block_cb *block_cb;
 
@@ -699,11 +704,12 @@ EXPORT_SYMBOL(__tcf_block_cb_register);
 
 int tcf_block_cb_register(struct tcf_block *block,
 			  tc_setup_cb_t *cb, void *cb_ident,
-			  void *cb_priv)
+			  void *cb_priv, struct netlink_ext_ack *extack)
 {
 	struct tcf_block_cb *block_cb;
 
-	block_cb = __tcf_block_cb_register(block, cb, cb_ident, cb_priv);
+	block_cb = __tcf_block_cb_register(block, cb, cb_ident, cb_priv,
+					   extack);
 	return IS_ERR(block_cb) ? PTR_ERR(block_cb) : 0;
 }
 EXPORT_SYMBOL(tcf_block_cb_register);
-- 
2.17.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox