Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: kbuild test robot @ 2018-06-27  2:18 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Eric Dumazet
In-Reply-To: <20180627015222.3269067-1-brakmo@fb.com>

[-- Attachment #1: Type: text/plain, Size: 2947 bytes --]

Hi Lawrence,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/tcp-force-cwnd-at-least-2-in-tcp_cwnd_reduction/20180627-095533
config: x86_64-randconfig-x003-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:18:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from net/ipv4/tcp_input.c:67:
   net/ipv4/tcp_input.c: In function 'tcp_cwnd_reduction':
   include/linux/kernel.h:812:29: warning: comparison of distinct pointer types lacks a cast
      (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                ^
   include/linux/kernel.h:826:4: note: in expansion of macro '__typecheck'
      (__typecheck(x, y) && __no_side_effects(x, y))
       ^~~~~~~~~~~
   include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
     __builtin_choose_expr(__safe_cmp(x, y), \
                           ^~~~~~~~~~
   include/linux/kernel.h:852:19: note: in expansion of macro '__careful_cmp'
    #define max(x, y) __careful_cmp(x, y, >)
                      ^~~~~~~~~~~~~
>> net/ipv4/tcp_input.c:2480:17: note: in expansion of macro 'max'
     tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
                    ^~~

vim +/max +2480 net/ipv4/tcp_input.c

  2455	
  2456	void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
  2457	{
  2458		struct tcp_sock *tp = tcp_sk(sk);
  2459		int sndcnt = 0;
  2460		int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
  2461	
  2462		if (newly_acked_sacked <= 0 || WARN_ON_ONCE(!tp->prior_cwnd))
  2463			return;
  2464	
  2465		tp->prr_delivered += newly_acked_sacked;
  2466		if (delta < 0) {
  2467			u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
  2468				       tp->prior_cwnd - 1;
  2469			sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
  2470		} else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
  2471			   !(flag & FLAG_LOST_RETRANS)) {
  2472			sndcnt = min_t(int, delta,
  2473				       max_t(int, tp->prr_delivered - tp->prr_out,
  2474					     newly_acked_sacked) + 1);
  2475		} else {
  2476			sndcnt = min(delta, newly_acked_sacked);
  2477		}
  2478		/* Force a fast retransmit upon entering fast recovery */
  2479		sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
> 2480		tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
  2481	}
  2482	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32495 bytes --]

^ permalink raw reply

* [PATCH net-next] net: qmi_wwan: Add pass through mode
From: Subash Abhinov Kasiviswanathan @ 2018-06-27  2:30 UTC (permalink / raw)
  To: bjorn, dnlplm, dcbw, davem, netdev; +Cc: Subash Abhinov Kasiviswanathan

Pass through mode is to allow packets in MAP format to be passed
on to the stack. rmnet driver can be used to process and demultiplex
these packets. Note that pass through mode can be enabled when the
device is in raw ip mode only.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/usb/qmi_wwan.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 8fac8e1..6eeec92 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -59,6 +59,7 @@ struct qmi_wwan_state {
 enum qmi_wwan_flags {
 	QMI_WWAN_FLAG_RAWIP = 1 << 0,
 	QMI_WWAN_FLAG_MUX = 1 << 1,
+	QMI_WWAN_FLAG_PASS_THROUGH = 1 << 2,
 };
 
 enum qmi_wwan_quirks {
@@ -425,14 +426,82 @@ static ssize_t del_mux_store(struct device *d,  struct device_attribute *attr, c
 	return ret;
 }
 
+static ssize_t pass_through_show(struct device *d,
+				 struct device_attribute *attr,
+				 char *buf)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+
+	info = (void *)&dev->data;
+	return sprintf(buf, "%c\n",
+		       info->flags & QMI_WWAN_FLAG_PASS_THROUGH ? 'Y' : 'N');
+}
+
+static ssize_t pass_through_store(struct device *d,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+	bool enable;
+	int ret;
+
+	if (strtobool(buf, &enable))
+		return -EINVAL;
+
+	info = (void *)&dev->data;
+
+	/* no change? */
+	if (enable == (info->flags & QMI_WWAN_FLAG_PASS_THROUGH))
+		return len;
+
+	/* pass through mode can be set for raw ip devices only */
+	if (!(info->flags & QMI_WWAN_FLAG_RAWIP)) {
+		netdev_err(dev->net, "Cannot set pass through mode on non ip device\n");
+		return -EINVAL;
+	}
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	/* we don't want to modify a running netdev */
+	if (netif_running(dev->net)) {
+		netdev_err(dev->net, "Cannot change a running device\n");
+		ret = -EBUSY;
+		goto err;
+	}
+
+	/* let other drivers deny the change */
+	ret = call_netdevice_notifiers(NETDEV_PRE_TYPE_CHANGE, dev->net);
+	ret = notifier_to_errno(ret);
+	if (ret) {
+		netdev_err(dev->net, "Type change was refused\n");
+		goto err;
+	}
+
+	if (enable)
+		info->flags |= QMI_WWAN_FLAG_PASS_THROUGH;
+	else
+		info->flags &= ~QMI_WWAN_FLAG_PASS_THROUGH;
+	qmi_wwan_netdev_setup(dev->net);
+	call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev->net);
+	ret = len;
+err:
+	rtnl_unlock();
+	return ret;
+}
+
 static DEVICE_ATTR_RW(raw_ip);
 static DEVICE_ATTR_RW(add_mux);
 static DEVICE_ATTR_RW(del_mux);
+static DEVICE_ATTR_RW(pass_through);
 
 static struct attribute *qmi_wwan_sysfs_attrs[] = {
 	&dev_attr_raw_ip.attr,
 	&dev_attr_add_mux.attr,
 	&dev_attr_del_mux.attr,
+	&dev_attr_pass_through.attr,
 	NULL,
 };
 
@@ -479,6 +548,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 	if (info->flags & QMI_WWAN_FLAG_MUX)
 		return qmimux_rx_fixup(dev, skb);
 
+	if (rawip && (info->flags & QMI_WWAN_FLAG_PASS_THROUGH)) {
+		skb->protocol = htons(ETH_P_MAP);
+		return (netif_rx(skb) == NET_RX_SUCCESS);
+	}
+
 	switch (skb->data[0] & 0xf0) {
 	case 0x40:
 		proto = htons(ETH_P_IP);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Eric Dumazet @ 2018-06-27  2:32 UTC (permalink / raw)
  To: Cong Wang, Flavio Leitner
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpVmbuVK+pfhad7HrQtaSxFbe7K=59_34f0BwxoP_=W0VQ@mail.gmail.com>



On 06/26/2018 05:29 PM, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
>>
>> It is still isolated, the sk carries the netns info and it is
>> orphaned when it re-enters the stack.
> 
> Then what difference does your patch make?
> 
> Before your patch:
> veth orphans skb in its xmit
> 
> After your patch:
> RX orphans it when re-entering stack (as you claimed, I don't know)
> 
> And for veth pair:
> xmit from one side is RX for the other side
> 
> So, where is the queueing? Where is the buffer bloat? GRO list??
> 

By re-entering the stack, Flavio probably meant storing this skb in
a socket receive queue, or anything that should already modify skb->destructor
(and thus call skb_orphan() before the modification)

If skb sits in some qdisc, like fq on ipvlan master device, we do not want skb->sk to be scrubbed,
just because ipvlan slave and master might be in different netns.

^ permalink raw reply

* [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Lawrence Brakmo @ 2018-06-27  2:34 UTC (permalink / raw)
  To: netdev; +Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Eric Dumazet

When using dctcp and doing RPCs, if the last packet of a request is
ECN marked as having seen congestion (CE), the sender can decrease its
cwnd to 1. As a result, it will only send one packet when a new request
is sent. In some instances this results in high tail latencies.

For example, in one setup there are 3 hosts sending to a 4th one, with
each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB
back-to-back RPCs). The following table shows the 99% and 99.9%
latencies for both Cubic and dctcp

           Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
 1MB RPCs    3.5ms       6.0ms         43ms          208ms
10KB RPCs    1.0ms       2.5ms         53ms          212ms

On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there. Forcing cwnd to be at least 2 in tcp_cwnd_reduction
fixes the problem with the high tail latencies. The latencies now look
like this:

             dctcp 99%       dctcp 99.9%
 1MB RPCs      3.8ms             4.4ms
10KB RPCs      168us             211us

Another group working with dctcp saw the same issue with production
traffic and it was solved with this patch.

The only issue is if it is safe to always use 2 or if it is better to
use min(2, snd_ssthresh) (which could still trigger the problem).

v2: fixed compiler warning in max function arguments

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..282bd85322b0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2477,7 +2477,7 @@ void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
 	}
 	/* Force a fast retransmit upon entering fast recovery */
 	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
-	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
 }

 static inline void tcp_end_cwnd_reduction(struct sock *sk)
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Eric Dumazet @ 2018-06-27  2:35 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Flavio Leitner, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpUOvavz580tJHCXhAR6AMprmFvakqiaiVMYe7NLFvFJ0Q@mail.gmail.com>



On 06/26/2018 05:44 PM, Cong Wang wrote:
 
> With this, a netns could totally throttle a TCP socket in a different
> netns by holding the packets infinitely (e.g. putting them in a loop).
> This is where the isolation breaks.
> 

That is fine, really. Admin error -> Working as intended.

The current scrubbing is simply wrong, not documented, and added by someone
who had absolutely not intended all the side effects.

^ permalink raw reply

* [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Hello,

Although it is far from being close to completion, this series provides
the means and infrastructure to enable device drivers to share packets
meta data information and accelerations with XDP programs, 
such as hash, stripped vlan, checksum, flow mark, packet header types,
etc ..

The series is still work in progress and sending it now as RFC in order
to get early feedback and to discuss the design, structures and UAPI.

For now the general idea is to help XDP programs accelerate, and provide
them with meta data that already available from the HW or device driver
to save cpu cycles on packet headers and data processing and decision
making, aside of that we want to avoid having a fixed size meta data
structure that wastes a lot of buffer space for stuff that the xdp program
might not need, like the current "elephant" SKB fields, kidding :) ..

So my idea here is to provide a dynamic mechanism to allow XDP programs
on xdp load to request only the specific meta data they need, and
the device driver or netdevice will provide them in a predefined order
in the xdp_buff/xdp_md data meta section.

And here is how it is done and what i would like to discuss:

1. In the first patch: added the infrastructure to request predefined meta data
flags on xdp load indicating that the XDP program is going to need them.

1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
TODO: this needs to be improved in order to allow more meta data flags,
maybe use a new dedicated flags ?

1.2) Device driver that want to support xdp meta data should implement
new XDP command XDP_QUERY_META_FLAGS and report the meta data flags they
can support.

1.3) the kernel will cross check the requested flags with the device's
supported flags and will fail the xdp load in case of discrepancy.

Question : do we want this ? or is it better to return to the program
the actual supported flags and let it decide ?

2. This is the interesting part: in the 2nd patch Added the meta data
set/get infrastructure to allow device drivers populate them into xdp_buff
data meta in a well defined and structured format and let the xdp program
read them according to the predefined format/structure, the idea here is
that the xdp program and the device driver will share a static information
offsets array that will define where each meta data will sit inside
xdp_buff data meta, such structure will be decided once on xdp load.

Enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : for XDP prgram that requested the following flags:
flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

the offsets array will be:

xdp_md_info_arr mdi = {
        [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
        [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
        [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
        [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

For this example: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.
The user space XDP program will use a similar xdp_md_info_arr to
statically know what is the offset of each meta data.

*For future meta data they will be added to the end of the array with
higher flags value.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Question: currently the XDP program is responsible to build the static
meta data offsets array "xdp_md_info_arr" and the kernel will build it
for the netdevice, if we are going to choose this direction, should we
somehow share the same xdp_md_info_arr built by the kernel with the xdp
program ?

3. The last mlx5e patch is the actual show case of how the device driver
will add the support for xdp meta data, it is pretty simple and straight
forward ! The first two mlx5e patches are just small refactoring to make
the xdp_md_info_arr and packet completion information available in the rx
xdp handlers data path.

4. Just added a small example to demonstrate how XDP program can request
meta data on xdp load and how it can read them on the critical path.
of course more examples are needed and some performance numbers.
Exciting use cases such as:
  - using flow mark to allow fast white/black listing lookups.
  - flow mark to accelerate DDos prevention.
  - Generic Data path: Use the meta data from the xdp_buff to build SKBs !(Jesper's Idea)
  - using hash type to know the packet headers and encapsulation without
    touching the packet data at all.
  - using packet hash to do RPS and XPS like cpu redirecting.
  - and many more.

More ideas:

>From Jesper: 
=========================
Give XDP more rich information about the hash,
By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.

I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
IPSEC. (And you have another bit telling of this is IPv6 with extension
headers).

If we don't want to invent our own xdp_hash_types, we can simply adopt
the RSS Hashing Types defined by Microsoft:
 https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

An interesting part of the RSS standard, is that the hash type can help
identify if this is a fragment. (XDP could use this info to avoid
touching payload and react, e.g. drop fragments, or redirect all
fragments to another CPU, or skip parsing in XDP and defer to network
stack via XDP_PASS).

By using the RSS standard, we do e.g. loose the ability to say this is
IPSEC traffic, even-though your HW supports this.

I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
 https://marc.info/?l=linux-netdev&m=149512213531769
 https://marc.info/?l=linux-netdev&m=149512213631774
=========================

Thanks,
Saeed.

Saeed Mahameed (6):
  net: xdp: Add support for meta data flags requests
  net: xdp: RX meta data infrastructure
  net/mlx5e: Store xdp flags and meta data info
  net/mlx5e: Pass CQE to RX handlers
  net/mlx5e: Add XDP RX meta data support
  samples/bpf: Add meta data hash example to xdp_redirect_cpu

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
 include/linux/netdevice.h                     |  10 ++
 include/net/xdp.h                             |   6 +
 include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
 include/uapi/linux/if_link.h                  |  20 ++-
 net/core/dev.c                                |  41 +++++++
 samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
 samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
 10 files changed, 354 insertions(+), 35 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

A user space application can request to enable a specific set of meta data
to be reported in every xdp buffer provided to the xdp program.

When meta_data flags are required, XDP devices must respond to
XDP_QUERY_META_FLAGS command with all the meta data flags the device
actually supports, and the kernel will cross check them with the
requested flags, in case of discrepancy the xdp install will fail.

If the flags are supported, the device must guarantee to deliver all RX
packets with only the meta data requested in meta data_flags on
xdp_install operation.

The following flags are added, and can be provided by the netlink xdp
flags field.

+#define XDP_FLAGS_META_HASH            (1U << 16)
+#define XDP_FLAGS_META_FLOW_MARK       (1U << 17)
+#define XDP_FLAGS_META_VLAN            (1U << 18)
+#define XDP_FLAGS_META_CSUM_COMPLETE   (1U << 19)

The format, device delivery methods and XDP program access to such meta
data is discussed in a later patch.

TODO: use a different flags field for XDP meta data, to make sure we
have more free bits.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h    |  9 +++++++++
 include/uapi/linux/if_link.h | 16 +++++++++++++++-
 net/core/dev.c               | 20 ++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..fc8b6ce48a0f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -812,6 +812,10 @@ enum bpf_netdev_command {
 	 * is equivalent to XDP_ATTACHED_DRV.
 	 */
 	XDP_QUERY_PROG,
+	/* Query Supported XDP_FLAGS_META_*, Called before new XDP program setup
+	 * and only if meta_flags were requested by the user to validate if the
+	 * device supports the requested flags, if not program setup will fail */
+	XDP_QUERY_META_FLAGS,
 	/* BPF program for offload callbacks, invoked at program load time. */
 	BPF_OFFLOAD_VERIFIER_PREP,
 	BPF_OFFLOAD_TRANSLATE,
@@ -842,6 +846,11 @@ struct netdev_bpf {
 			/* flags with which program was installed */
 			u32 prog_flags;
 		};
+		/* XDP_QUERY_META_FLAGS */
+		struct {
+			/* TODO u64 */
+			u32 meta_flags;
+		};
 		/* BPF_OFFLOAD_VERIFIER_PREP */
 		struct {
 			struct bpf_prog *prog;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cf01b6824244..dfb1e26bacef 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -911,8 +911,22 @@ enum {
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
 					 XDP_FLAGS_HW_MODE)
+
+/* TODO : add new netlink xdp u64 meta_flags
+ * for meta data only
+ */
+#define XDP_FLAGS_META_HASH		(1U << 16)
+#define XDP_FLAGS_META_MARK		(1U << 17)
+#define XDP_FLAGS_META_VLAN		(1U << 18)
+#define XDP_FLAGS_META_CSUM_COMPLETE	(1U << 19)
+#define XDP_FLAGS_META_ALL		(XDP_FLAGS_META_HASH      | \
+					 XDP_FLAGS_META_MARK      | \
+					 XDP_FLAGS_META_VLAN      | \
+					 XDP_FLAGS_META_CSUM_COMPLETE)
+
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
-					 XDP_FLAGS_MODES)
+					 XDP_FLAGS_MODES             | \
+					 XDP_FLAGS_META_ALL)
 
 /* These are stored into IFLA_XDP_ATTACHED on dump. */
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7444e6..8a5cc2c731ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7295,6 +7295,21 @@ static u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op)
 	return xdp.prog_attached;
 }
 
+static bool __dev_xdp_meta_supported(struct net_device *dev,
+				     bpf_op_t bpf_op, u32 meta_flags)
+{
+	struct netdev_bpf xdp = {};
+
+	 /* Backward compatible, all devices support no meta_flags */
+	if (!meta_flags)
+		return true;
+
+	xdp.command = XDP_QUERY_META_FLAGS;
+	bpf_op(dev, &xdp);
+
+	return ((xdp.meta_flags & meta_flags) == meta_flags);
+}
+
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
 			   struct bpf_prog *prog)
@@ -7362,12 +7377,17 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		bpf_chk = generic_xdp_install;
 
 	if (fd >= 0) {
+		u32 meta_flags = (flags & XDP_FLAGS_META_ALL);
+
 		if (bpf_chk && __dev_xdp_attached(dev, bpf_chk))
 			return -EEXIST;
 		if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) &&
 		    __dev_xdp_attached(dev, bpf_op))
 			return -EBUSY;
 
+		if (!__dev_xdp_meta_supported(dev, bpf_op, meta_flags))
+			return -EINVAL;
+
 		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
 					     bpf_op == ops->ndo_bpf);
 		if (IS_ERR(prog))
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

The idea from this patch is to define a well known structure for XDP meta
data fields format and offset placement inside the xdp data meta buffer.

For that driver will need some static information to know what to
provide and where, enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

xdp_md_info_arr mdi = {
	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
	[XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
	[XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

i.e: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h    |   1 +
 include/net/xdp.h            |   6 ++
 include/uapi/linux/bpf.h     | 114 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/if_link.h |  16 +++--
 net/core/dev.c               |  21 +++++++
 5 files changed, 152 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fc8b6ce48a0f..172363310ade 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -838,6 +838,7 @@ struct netdev_bpf {
 			u32 flags;
 			struct bpf_prog *prog;
 			struct netlink_ext_ack *extack;
+			xdp_md_info_arr md_info;
 		};
 		/* XDP_QUERY_PROG */
 		struct {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 2deea7166a34..afe302613ae1 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff *xdp)
 	xdp->data_meta = xdp->data + 1;
 }
 
+static __always_inline void
+xdp_reset_data_meta(struct xdp_buff *xdp)
+{
+	xdp->data_meta = xdp->data_hard_start;
+}
+
 static __always_inline bool
 xdp_data_meta_unsupported(const struct xdp_buff *xdp)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..e320e7421565 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2353,6 +2353,120 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 };
 
+enum {
+	XDP_DATA_META_HASH = 0,
+	XDP_DATA_META_MARK,
+	XDP_DATA_META_VLAN,
+	XDP_DATA_META_CSUM_COMPLETE,
+
+	XDP_DATA_META_MAX,
+};
+
+struct xdp_md_hash {
+	__u32 hash;
+	__u8  type;
+} __attribute__((aligned(4)));
+
+struct xdp_md_mark {
+	__u32 mark;
+} __attribute__((aligned(4)));
+
+struct xdp_md_vlan {
+	__u16 vlan;
+} __attribute__((aligned(4)));
+
+struct xdp_md_csum {
+	__u16 csum;
+} __attribute__((aligned(4)));
+
+static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
+	sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
+	sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
+	sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
+	sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE */
+};
+
+struct xdp_md_info {
+	__u16 present:1;
+	__u16 offset:15; /* offset from data_meta in xdp_md buffer */
+};
+
+/* XDP meta data offsets info array
+ * present bit describes if a meta data is or will be present in xdp_md buff
+ * offset describes where a meta data is or should be placed in xdp_md buff
+ *
+ * Kernel builds this array using xdp_md_info_build helper on demand.
+ * User space builds it statically in the xdp program.
+ */
+typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
+
+static __always_inline __u8
+xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
+{
+	return mdi[meta_data].present;
+}
+
+static __always_inline void
+xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32 hash, __u8 htype)
+{
+	struct xdp_md_hash *hash_md;
+
+	hash_md = (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
+	hash_md->hash = hash;
+	hash_md->type = htype;
+}
+
+static __always_inline struct xdp_md_hash *
+xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_mark(xdp_md_info_arr mdi, void *data_meta, __u32 mark)
+{
+	struct xdp_md_mark *mark_md;
+
+	mark_md = (struct xdp_md_mark *)((char*)data_meta + mdi[XDP_DATA_META_MARK].offset);
+	mark_md->mark = mark;
+}
+
+static __always_inline struct xdp_md_mark *
+xdp_data_meta_get_mark(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_mark *)((char*)data_meta + mdi[XDP_DATA_META_MARK].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_vlan(xdp_md_info_arr mdi, void *data_meta, __u16 vlan_tci)
+{
+	struct xdp_md_vlan *vlan_md;
+
+	vlan_md = (struct xdp_md_vlan *)((char*)data_meta + mdi[XDP_DATA_META_VLAN].offset);
+	vlan_md->vlan = vlan_tci;
+}
+
+static __always_inline struct xdp_md_vlan *
+xdp_data_meta_get_vlan(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_vlan *)((char*)data_meta + mdi[XDP_DATA_META_VLAN].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_csum_complete(xdp_md_info_arr mdi, void *data_meta, __u16 csum)
+{
+	struct xdp_md_csum *csum_md;
+
+	csum_md = (struct xdp_md_csum *)((char*)data_meta + mdi[XDP_DATA_META_CSUM_COMPLETE].offset);
+	csum_md->csum = csum;
+}
+
+static __always_inline struct xdp_md_csum *
+xdp_data_meta_get_csum_complete(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_csum *)((char*)data_meta + mdi[XDP_DATA_META_CSUM_COMPLETE].offset);
+}
+
 enum sk_action {
 	SK_DROP = 0,
 	SK_PASS,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index dfb1e26bacef..5bdc07bfad35 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/netlink.h>
+#include <linux/bpf.h>
 
 /* This struct should be in sync with struct rtnl_link_stats64 */
 struct rtnl_link_stats {
@@ -913,13 +914,16 @@ enum {
 					 XDP_FLAGS_HW_MODE)
 
 /* TODO : add new netlink xdp u64 meta_flags
- * for meta data only
+ * for meta data only & build according XDP_DATA_META_* enums
  */
-#define XDP_FLAGS_META_HASH		(1U << 16)
-#define XDP_FLAGS_META_MARK		(1U << 17)
-#define XDP_FLAGS_META_VLAN		(1U << 18)
-#define XDP_FLAGS_META_CSUM_COMPLETE	(1U << 19)
-#define XDP_FLAGS_META_ALL		(XDP_FLAGS_META_HASH      | \
+
+#define XDP_FLAGS_META_SHIFT            (16)
+#define XDP_FLAG_META(FLAG)             ((1U << XDP_DATA_META_##FLAG) << XDP_FLAGS_META_SHIFT)
+#define XDP_FLAGS_META_HASH             XDP_FLAG_META(HASH)
+#define XDP_FLAGS_META_MARK             XDP_FLAG_META(MARK)
+#define XDP_FLAGS_META_VLAN             XDP_FLAG_META(VLAN)
+#define XDP_FLAGS_META_CSUM_COMPLETE    XDP_FLAG_META(CSUM_COMPLETE)
+#define XDP_FLAGS_META_ALL              (XDP_FLAGS_META_HASH      | \
 					 XDP_FLAGS_META_MARK      | \
 					 XDP_FLAGS_META_VLAN      | \
 					 XDP_FLAGS_META_CSUM_COMPLETE)
diff --git a/net/core/dev.c b/net/core/dev.c
index 8a5cc2c731ec..141dd6632b00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7310,6 +7310,25 @@ static bool __dev_xdp_meta_supported(struct net_device *dev,
 	return ((xdp.meta_flags & meta_flags) == meta_flags);
 }
 
+static void xdp_md_info_build(xdp_md_info_arr mdi, u32 xdp_flags_meta)
+{
+	u16 offset = 0;
+	int i;
+
+	for (i = 0; i < XDP_DATA_META_MAX; i++) {
+		u32 meta_data_flag = BIT(i) << XDP_FLAGS_META_SHIFT;
+
+		if (!(meta_data_flag & xdp_flags_meta)) {
+			mdi[i].present = 0;
+			continue;
+		}
+
+		mdi[i].present = 1;
+		mdi[i].offset = offset;
+		offset += xdp_md_size[i];
+	}
+}
+
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
 			   struct bpf_prog *prog)
@@ -7325,6 +7344,8 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 	xdp.flags = flags;
 	xdp.prog = prog;
 
+	xdp_md_info_build(xdp.md_info, flags & XDP_FLAGS_META_ALL);
+
 	return bpf_op(dev, &xdp);
 }
 
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 3/6] net/mlx5e: Store xdp flags and meta data info
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

Make xdp flags and meta data info avaiable to RQs data path,
to enable xdp meta data offloads in next two patches.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 10 +++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 52 +++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  2 +-
 3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index eb9eb7aa953a..5893acfae307 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -236,6 +236,12 @@ enum mlx5e_priv_flag {
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
 #endif
 
+struct mlx5e_xdp_info {
+	struct bpf_prog      *prog;
+	u32                   flags;
+	xdp_md_info_arr       md_info;
+};
+
 struct mlx5e_params {
 	u8  log_sq_size;
 	u8  rq_wq_type;
@@ -257,7 +263,7 @@ struct mlx5e_params {
 	bool tx_dim_enabled;
 	u32 lro_timeout;
 	u32 pflags;
-	struct bpf_prog *xdp_prog;
+	struct mlx5e_xdp_info xdp;
 	unsigned int sw_mtu;
 	int hard_mtu;
 };
@@ -566,7 +572,7 @@ struct mlx5e_rq {
 	struct net_dim         dim; /* Dynamic Interrupt Moderation */
 
 	/* XDP */
-	struct bpf_prog       *xdp_prog;
+	struct mlx5e_xdp_info  xdp;
 	unsigned int           hw_mtu;
 	struct mlx5e_xdpsq     xdpsq;
 	DECLARE_BITMAP(flags, 8);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 56c1b6f5593e..8debae6b9cab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -96,7 +96,7 @@ bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
 
 static u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params)
 {
-	if (!params->xdp_prog) {
+	if (!params->xdp.prog) {
 		u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 		u16 rq_headroom = MLX5_RX_HEADROOM + NET_IP_ALIGN;
 
@@ -170,7 +170,7 @@ static u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
 static u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
 				 struct mlx5e_params *params)
 {
-	u16 linear_rq_headroom = params->xdp_prog ?
+	u16 linear_rq_headroom = params->xdp.prog ?
 		XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
 	bool is_linear_skb;
 
@@ -205,7 +205,7 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
 {
 	return mlx5e_check_fragmented_striding_rq_cap(mdev) &&
 		!MLX5_IPSEC_DEV(mdev) &&
-		!(params->xdp_prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params));
+		!(params->xdp.prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params));
 }
 
 void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
@@ -489,11 +489,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->mdev    = mdev;
 	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 	rq->stats   = &c->priv->channel_stats[c->ix].rq;
+	rq->xdp     = params->xdp;
 
-	rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL;
-	if (IS_ERR(rq->xdp_prog)) {
-		err = PTR_ERR(rq->xdp_prog);
-		rq->xdp_prog = NULL;
+	rq->xdp.prog = params->xdp.prog ? bpf_prog_inc(params->xdp.prog) : NULL;
+	if (IS_ERR(rq->xdp.prog)) {
+		err = PTR_ERR(rq->xdp.prog);
+		rq->xdp.prog = NULL;
 		goto err_rq_wq_destroy;
 	}
 
@@ -501,7 +502,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (err < 0)
 		goto err_rq_wq_destroy;
 
-	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
+	rq->buff.map_dir = rq->xdp.prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
 	pool_size = 1 << params->log_rq_mtu_frames;
 
@@ -679,8 +680,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	}
 
 err_rq_wq_destroy:
-	if (rq->xdp_prog)
-		bpf_prog_put(rq->xdp_prog);
+	if (rq->xdp.prog)
+		bpf_prog_put(rq->xdp.prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	if (rq->page_pool)
 		page_pool_destroy(rq->page_pool);
@@ -693,8 +694,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 {
 	int i;
 
-	if (rq->xdp_prog)
-		bpf_prog_put(rq->xdp_prog);
+	if (rq->xdp.prog)
+		bpf_prog_put(rq->xdp.prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	if (rq->page_pool)
@@ -1906,7 +1907,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->netdev   = priv->netdev;
 	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
 	c->num_tc   = params->num_tc;
-	c->xdp      = !!params->xdp_prog;
+	c->xdp      = !!params->xdp.prog;
 	c->stats    = &priv->channel_stats[ix].ch;
 
 	mlx5_vector2eqn(priv->mdev, ix, &eqn, &irq);
@@ -4090,12 +4091,12 @@ static void mlx5e_tx_timeout(struct net_device *dev)
 	queue_work(priv->wq, &priv->tx_timeout_work);
 }
 
-static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+static int mlx5e_xdp_set(struct net_device *netdev, struct mlx5e_xdp_info *xdp)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
-	struct bpf_prog *old_prog;
-	int err = 0;
+	struct bpf_prog *old_prog, *prog = xdp->prog;
 	bool reset, was_opened;
+	int err = 0;
 	int i;
 
 	mutex_lock(&priv->state_lock);
@@ -4114,7 +4115,7 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 
 	was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
 	/* no need for full reset when exchanging programs */
-	reset = (!priv->channels.params.xdp_prog || !prog);
+	reset = (!priv->channels.params.xdp.prog || !prog);
 
 	if (was_opened && reset)
 		mlx5e_close_locked(netdev);
@@ -4132,10 +4133,12 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	/* exchange programs, extra prog reference we got from caller
 	 * as long as we don't fail from this point onwards.
 	 */
-	old_prog = xchg(&priv->channels.params.xdp_prog, prog);
+	old_prog = xchg(&priv->channels.params.xdp.prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
+	priv->channels.params.xdp = *xdp;
+
 	if (reset) /* change RQ type according to priv->xdp_prog */
 		mlx5e_set_rq_type(priv->mdev, &priv->channels.params);
 
@@ -4155,7 +4158,8 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 		napi_synchronize(&c->napi);
 		/* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
 
-		old_prog = xchg(&c->rq.xdp_prog, prog);
+		old_prog = xchg(&c->rq.xdp.prog, prog);
+		c->rq.xdp = *xdp;
 
 		set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state);
 		/* napi_schedule in case we have missed anything */
@@ -4177,7 +4181,7 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 	u32 prog_id = 0;
 
 	mutex_lock(&priv->state_lock);
-	xdp_prog = priv->channels.params.xdp_prog;
+	xdp_prog = priv->channels.params.xdp.prog;
 	if (xdp_prog)
 		prog_id = xdp_prog->aux->id;
 	mutex_unlock(&priv->state_lock);
@@ -4187,9 +4191,15 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 
 static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
+	struct mlx5e_xdp_info xdp_info;
+
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
-		return mlx5e_xdp_set(dev, xdp->prog);
+		xdp_info.prog  = xdp->prog;
+		xdp_info.flags = xdp->flags;
+		memcpy(xdp_info.md_info, xdp->md_info, sizeof(xdp->md_info));
+
+		return mlx5e_xdp_set(dev, &xdp_info);
 	case XDP_QUERY_PROG:
 		xdp->prog_id = mlx5e_xdp_query(dev);
 		xdp->prog_attached = !!xdp->prog_id;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d3a1dd20e41d..d12577c17011 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -925,7 +925,7 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
 				    void *va, u16 *rx_headroom, u32 *len)
 {
-	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
 	u32 act;
 	int err;
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

CQE has all the meta data information from HW.
Make it available to the driver xdp handlers.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 15 +++++++++------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5893acfae307..98bb315fc8a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -505,7 +505,8 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+			       struct mlx5_cqe64 *cqe);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			 struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
@@ -901,10 +902,12 @@ void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			  struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d12577c17011..e37f9747a0e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -923,7 +923,8 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
-				    void *va, u16 *rx_headroom, u32 *len)
+				    void *va, u16 *rx_headroom,
+				    u32 *len, struct mlx5_cqe64 *cqe)
 {
 	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
@@ -1012,7 +1013,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	}
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, cqe);
 	rcu_read_unlock();
 	if (consumed)
 		return NULL; /* page/packet was consumed by XDP */
@@ -1155,7 +1156,8 @@ void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe)
 {
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
@@ -1202,7 +1204,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -1221,7 +1224,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	prefetch(data);
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, cqe);
 	rcu_read_unlock();
 	if (consumed) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1268,7 +1271,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
 
 	skb = rq->mpwqe.skb_from_cqe_mpwrq(rq, wi, cqe_bcnt, head_offset,
-					   page_idx);
+					   page_idx, cqe);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

Implement XDP meta data hash and vlan support.

1. on xdp setup ndo: add support for XDP_QUERY_META_FLAGS and return the
two supported flags
2. use xdp_data_meta_set_hash and xdp_data_meta_set_vlan helpers to fill
in the meta data fileds.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 ++++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 30 ++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8debae6b9cab..3d1066a953cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4189,6 +4189,9 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 	return prog_id;
 }
 
+#define MLX5E_SUPPORTED_XDP_META_FLAGS  \
+             (XDP_FLAGS_META_HASH  | XDP_FLAGS_META_VLAN)
+
 static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct mlx5e_xdp_info xdp_info;
@@ -4204,6 +4207,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 		xdp->prog_id = mlx5e_xdp_query(dev);
 		xdp->prog_attached = !!xdp->prog_id;
 		return 0;
+	case XDP_QUERY_META_FLAGS:
+		xdp->meta_flags = MLX5E_SUPPORTED_XDP_META_FLAGS;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index e37f9747a0e3..1f3e934d0dd8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -920,6 +920,29 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 	return true;
 }
 
+static void
+mlx5e_xdp_fill_data_meta(xdp_md_info_arr mdi,  void *data_meta, struct mlx5_cqe64 *cqe)
+{
+	if (xdp_data_meta_present(mdi, XDP_DATA_META_HASH))
+	{
+		u8 cht = cqe->rss_hash_type;
+		int ht = (cht & CQE_RSS_HTYPE_L4) ? PKT_HASH_TYPE_L4 :
+			 (cht & CQE_RSS_HTYPE_IP) ? PKT_HASH_TYPE_L3 :
+						    PKT_HASH_TYPE_NONE;
+		u32 hash = be32_to_cpu(cqe->rss_hash_result);
+
+		xdp_data_meta_set_hash(mdi, data_meta, hash, ht);
+	}
+
+	if (xdp_data_meta_present(mdi, XDP_DATA_META_VLAN))
+	{
+		u16 vlan = (!!cqe_has_vlan(cqe) * VLAN_TAG_PRESENT) |
+			   be16_to_cpu(cqe->vlan_info);
+
+		xdp_data_meta_set_vlan(mdi, data_meta, vlan);
+	}
+}
+
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
@@ -935,11 +958,16 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		return false;
 
 	xdp.data = va + *rx_headroom;
-	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
 	xdp.data_hard_start = va;
 	xdp.rxq = &rq->xdp_rxq;
 
+	if (rq->xdp.flags & XDP_FLAGS_META_ALL) {
+		xdp_reset_data_meta(&xdp);
+		mlx5e_xdp_fill_data_meta(rq->xdp.md_info, xdp.data_meta, cqe);
+	} else
+		xdp_set_data_meta_invalid(&xdp);
+
 	act = bpf_prog_run_xdp(prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

Add a new program (prog_num = 4) that will not parse packets and will
use the meta data hash to spread/redirect traffic into different cpus.

For the new program we set on bpf_set_link_xdp_fd:
	xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

On mlx5 it will succeed since mlx5 already supports these flags.

The new program will read the value of the hash from the data_meta
pointer from the xdp_md and will use it to compute the destination cpu.

Note: I didn't test this patch to show redirect works with the hash!
I only used it to see that the hash and vlan values are set correctly
by the driver and can be seen by the xdp program.

* I faced some difficulties to read the hash value using the helper
functions defined in the previous patches, but once i used the same logic
with out these functions it worked ! Will have to figure this out later.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 samples/bpf/xdp_redirect_cpu_kern.c | 67 +++++++++++++++++++++++++++++
 samples/bpf/xdp_redirect_cpu_user.c |  7 +++
 2 files changed, 74 insertions(+)

diff --git a/samples/bpf/xdp_redirect_cpu_kern.c b/samples/bpf/xdp_redirect_cpu_kern.c
index 303e9e7161f3..d6b3f55f342a 100644
--- a/samples/bpf/xdp_redirect_cpu_kern.c
+++ b/samples/bpf/xdp_redirect_cpu_kern.c
@@ -376,6 +376,73 @@ int  xdp_prognum3_proto_separate(struct xdp_md *ctx)
 	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
 }
 
+#if 0
+xdp_md_info_arr mdi = {
+	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
+	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
+};
+#endif
+
+SEC("xdp_cpu_map4_hash_separate")
+int  xdp_prognum4_hash_separate(struct xdp_md *ctx)
+{
+	void *data_meta = (void *)(long)ctx->data_meta;
+	void *data_end  = (void *)(long)ctx->data_end;
+	void *data      = (void *)(long)ctx->data;
+	struct xdp_md_hash *hash;
+	struct xdp_md_vlan *vlan;
+	struct datarec *rec;
+	u32 cpu_dest = 0;
+	u32 cpu_idx = 0;
+	u32 *cpu_lookup;
+	u32 key = 0;
+
+	/* Count RX packet in map */
+	rec = bpf_map_lookup_elem(&rx_cnt, &key);
+	if (!rec)
+		return XDP_ABORTED;
+	rec->processed++;
+
+	/* for some reason this code fails to be verified */
+#if 0
+	hash = xdp_data_meta_get_hash(mdi, data_meta);
+	if (hash + 1 > data)
+		return XDP_ABORTED;
+
+	vlan = xdp_data_meta_get_vlan(mdi, data_meta);
+	if (vlan + 1 > data)
+		return XDP_ABORTED;
+#endif
+
+	/* Work around for the above code */
+	hash = data_meta; /* since we know hash will appear first */
+        if (hash + 1 > data)
+		return XDP_ABORTED;
+
+#if 0
+	// Just for testing
+	/* We know that vlan will appear after the hash */
+	vlan = (void *)((char *)data_meta + sizeof(*hash));
+	if (vlan + 1 > data) {
+		return XDP_ABORTED;
+	}
+#endif
+
+	cpu_idx = reciprocal_scale(hash->hash, MAX_CPUS);
+
+	cpu_lookup = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
+	if (!cpu_lookup)
+		return XDP_ABORTED;
+	cpu_dest = *cpu_lookup;
+
+	if (cpu_dest >= MAX_CPUS) {
+		rec->issue++;
+		return XDP_ABORTED;
+	}
+
+	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
+}
+
 SEC("xdp_cpu_map4_ddos_filter_pktgen")
 int  xdp_prognum4_ddos_filter_pktgen(struct xdp_md *ctx)
 {
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index f6efaefd485b..3429215d5a7b 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -679,6 +679,13 @@ int main(int argc, char **argv)
 		return EXIT_FAIL_OPTION;
 	}
 
+	/*
+	 * prog_num 4 requires xdp meta data hash
+	 * Vlan is not required but added just for testing..
+	 */
+	if (prog_num == 4)
+		xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
+
 	/* Remove XDP program when program is interrupted */
 	signal(SIGINT, int_exit);
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH bpf-next] nfp: bpf: allow source ptr type be map ptr in memcpy optimization
From: Jakub Kicinski @ 2018-06-27  2:48 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev, oss-drivers, Jiong Wang

From: Jiong Wang <jiong.wang@netronome.com>

Map read has been supported on NFP, this patch enables optimization
for memcpy from map to packet.

This patch also fixed one latent bug which will cause copying from
unexpected address once memcpy for map pointer enabled.  The fixed
code path was not exercised before.

Reported-by: Mary Pham <mary.pham@netronome.com>
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
Reposting separately from the mul/div patches.

 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 8a92088df0d7..33111739b210 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 	xfer_num = round_up(len, 4) / 4;
 
 	if (src_40bit_addr)
-		addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
+		addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, &src_base,
 			      &off);
 
 	/* Setup PREV_ALU fields to override memory read length. */
@@ -3299,7 +3299,8 @@ curr_pair_is_memcpy(struct nfp_insn_meta *ld_meta,
 	if (!is_mbpf_load(ld_meta) || !is_mbpf_store(st_meta))
 		return false;
 
-	if (ld_meta->ptr.type != PTR_TO_PACKET)
+	if (ld_meta->ptr.type != PTR_TO_PACKET &&
+	    ld_meta->ptr.type != PTR_TO_MAP_VALUE)
 		return false;
 
 	if (st_meta->ptr.type != PTR_TO_PACKET)
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next 3/4] net: check tunnel option type in tunnel flags
From: kbuild test robot @ 2018-06-27  3:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kbuild-all, davem, jbenc, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	daniel, oss-drivers, netdev, Pieter Jansen van Vuuren,
	Jakub Kicinski
In-Reply-To: <20180626185308.3605-4-jakub.kicinski@netronome.com>

Hi Pieter,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jakub-Kicinski/net-Geneve-options-support-for-TC-act_tunnel_key/20180627-030036
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/openvswitch/flow_netlink.c:2532:38: sparse: incorrect type in assignment (different base types) @@    expected int [signed] [assigned] dst_opt_type @@    got restrint [signed] [assigned] dst_opt_type @@
   net/openvswitch/flow_netlink.c:2532:38:    expected int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:2532:38:    got restricted __be16 [usertype] <noident>
   net/openvswitch/flow_netlink.c:2535:38: sparse: incorrect type in assignment (different base types) @@    expected int [signed] [assigned] dst_opt_type @@    got restrint [signed] [assigned] dst_opt_type @@
   net/openvswitch/flow_netlink.c:2535:38:    expected int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:2535:38:    got restricted __be16 [usertype] <noident>
   net/openvswitch/flow_netlink.c:2538:38: sparse: incorrect type in assignment (different base types) @@    expected int [signed] [assigned] dst_opt_type @@    got restrint [signed] [assigned] dst_opt_type @@
   net/openvswitch/flow_netlink.c:2538:38:    expected int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:2538:38:    got restricted __be16 [usertype] <noident>
>> net/openvswitch/flow_netlink.c:2581:51: sparse: incorrect type in argument 4 (different base types) @@    expected restricted __be16 [usertype] flags @@    got icted __be16 [usertype] flags @@
   net/openvswitch/flow_netlink.c:2581:51:    expected restricted __be16 [usertype] flags
   net/openvswitch/flow_netlink.c:2581:51:    got int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:3064:39: sparse: expression using sizeof(void)

vim +2532 net/openvswitch/flow_netlink.c

  2508	
  2509	static int validate_and_copy_set_tun(const struct nlattr *attr,
  2510					     struct sw_flow_actions **sfa, bool log)
  2511	{
  2512		struct sw_flow_match match;
  2513		struct sw_flow_key key;
  2514		struct metadata_dst *tun_dst;
  2515		struct ip_tunnel_info *tun_info;
  2516		struct ovs_tunnel_info *ovs_tun;
  2517		struct nlattr *a;
  2518		int err = 0, start, opts_type, dst_opt_type;
  2519	
  2520		dst_opt_type = 0;
  2521		ovs_match_init(&match, &key, true, NULL);
  2522		opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log);
  2523		if (opts_type < 0)
  2524			return opts_type;
  2525	
  2526		if (key.tun_opts_len) {
  2527			switch (opts_type) {
  2528			case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
  2529				err = validate_geneve_opts(&key);
  2530				if (err < 0)
  2531					return err;
> 2532				dst_opt_type = TUNNEL_GENEVE_OPT;
  2533				break;
  2534			case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
  2535				dst_opt_type = TUNNEL_VXLAN_OPT;
  2536				break;
  2537			case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
  2538				dst_opt_type = TUNNEL_ERSPAN_OPT;
  2539				break;
  2540			}
  2541		}
  2542	
  2543		start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
  2544		if (start < 0)
  2545			return start;
  2546	
  2547		tun_dst = metadata_dst_alloc(key.tun_opts_len, METADATA_IP_TUNNEL,
  2548					     GFP_KERNEL);
  2549	
  2550		if (!tun_dst)
  2551			return -ENOMEM;
  2552	
  2553		err = dst_cache_init(&tun_dst->u.tun_info.dst_cache, GFP_KERNEL);
  2554		if (err) {
  2555			dst_release((struct dst_entry *)tun_dst);
  2556			return err;
  2557		}
  2558	
  2559		a = __add_action(sfa, OVS_KEY_ATTR_TUNNEL_INFO, NULL,
  2560				 sizeof(*ovs_tun), log);
  2561		if (IS_ERR(a)) {
  2562			dst_release((struct dst_entry *)tun_dst);
  2563			return PTR_ERR(a);
  2564		}
  2565	
  2566		ovs_tun = nla_data(a);
  2567		ovs_tun->tun_dst = tun_dst;
  2568	
  2569		tun_info = &tun_dst->u.tun_info;
  2570		tun_info->mode = IP_TUNNEL_INFO_TX;
  2571		if (key.tun_proto == AF_INET6)
  2572			tun_info->mode |= IP_TUNNEL_INFO_IPV6;
  2573		tun_info->key = key.tun_key;
  2574	
  2575		/* We need to store the options in the action itself since
  2576		 * everything else will go away after flow setup. We can append
  2577		 * it to tun_info and then point there.
  2578		 */
  2579		ip_tunnel_info_opts_set(tun_info,
  2580					TUN_METADATA_OPTS(&key, key.tun_opts_len),
> 2581					key.tun_opts_len, dst_opt_type);
  2582		add_nested_action_end(*sfa, start);
  2583	
  2584		return err;
  2585	}
  2586	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH net] bpfilter: include bpfilter_umh in assembly instead of using objcopy
From: Alexei Starovoitov @ 2018-06-27  3:13 UTC (permalink / raw)
  To: David S . Miller
  Cc: daniel, torvalds, mcroce, yamada.masahiro, linux, netdev,
	kernel-team

From: Masahiro Yamada <yamada.masahiro@socionext.com>

What we want here is to embed a user-space program into the kernel.
Instead of the complex ELF magic, let's simply wrap it in the assembly
with the '.incbin' directive.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
I think this patch should 'fix' bpfilter build issue on all archs.
cflags for cross CC may still be incorrect and embedded blob
may fail to execute via fork_usermode_blob()
(like in case of 'make ARCH=i386 net/bpfilter/' CC will build and link 64-bit
binary that will be included into bpfilter.o or vmlinux and that binary
will fail to run on 32-bit kernel),
but that is separate issue that will be addressed in net-next time frame.
Long term we've discussed to switch to something like klibc and keep it
as part of the kernel to avoid relying on glibc and cc-can-link.sh.

 net/bpfilter/Makefile            | 17 ++---------------
 net/bpfilter/bpfilter_kern.c     | 11 +++++------
 net/bpfilter/bpfilter_umh_blob.S |  7 +++++++
 3 files changed, 14 insertions(+), 21 deletions(-)
 create mode 100644 net/bpfilter/bpfilter_umh_blob.S

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 051dc18b8ccb..39c6980b5d99 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -15,20 +15,7 @@ ifeq ($(CONFIG_BPFILTER_UMH), y)
 HOSTLDFLAGS += -static
 endif
 
-# a bit of elf magic to convert bpfilter_umh binary into a binary blob
-# inside bpfilter_umh.o elf file referenced by
-# _binary_net_bpfilter_bpfilter_umh_start symbol
-# which bpfilter_kern.c passes further into umh blob loader at run-time
-quiet_cmd_copy_umh = GEN $@
-      cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
-      $(OBJCOPY) -I binary \
-          `LC_ALL=C $(OBJDUMP) -f net/bpfilter/bpfilter_umh \
-          |awk -F' |,' '/file format/{print "-O",$$NF} \
-          /^architecture:/{print "-B",$$2}'` \
-      --rename-section .data=.init.rodata $< $@
-
-$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
-	$(call cmd,copy_umh)
+$(obj)/bpfilter_umh_blob.o: $(obj)/bpfilter_umh
 
 obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
-bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
+bpfilter-objs += bpfilter_kern.o bpfilter_umh_blob.o
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index 09522573f611..f0fc182d3db7 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -10,11 +10,8 @@
 #include <linux/file.h>
 #include "msgfmt.h"
 
-#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
-#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
-
-extern char UMH_start;
-extern char UMH_end;
+extern char bpfilter_umh_start;
+extern char bpfilter_umh_end;
 
 static struct umh_info info;
 /* since ip_getsockopt() can run in parallel, serialize access to umh */
@@ -93,7 +90,9 @@ static int __init load_umh(void)
 	int err;
 
 	/* fork usermode process */
-	err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+	err = fork_usermode_blob(&bpfilter_umh_start,
+				 &bpfilter_umh_end - &bpfilter_umh_start,
+				 &info);
 	if (err)
 		return err;
 	pr_info("Loaded bpfilter_umh pid %d\n", info.pid);
diff --git a/net/bpfilter/bpfilter_umh_blob.S b/net/bpfilter/bpfilter_umh_blob.S
new file mode 100644
index 000000000000..40311d10d2f2
--- /dev/null
+++ b/net/bpfilter/bpfilter_umh_blob.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+	.section .init.rodata, "a"
+	.global bpfilter_umh_start
+bpfilter_umh_start:
+	.incbin "net/bpfilter/bpfilter_umh"
+	.global bpfilter_umh_end
+bpfilter_umh_end:
-- 
2.17.1

^ permalink raw reply related

* [PATCH 1/1] esp6: fix memleak on error path in esp6_input
From: Zhen Lei @ 2018-06-27  3:49 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, netdev, linux-kernel
  Cc: Zhen Lei, Hanjun Guo, Libin, YueHaibing

This ought to be an omission in e6194923237 ("esp: Fix memleaks on error
paths."). The memleak on error path in esp6_input is similar to esp_input
of esp4.

Fixes: e6194923237 ("esp: Fix memleaks on error paths.")
Fixes: 3f29770723f ("ipsec: check return value of skb_to_sgvec always")

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 net/ipv6/esp6.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 97513f3..88a7579 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -669,8 +669,10 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)

 	sg_init_table(sg, nfrags);
 	ret = skb_to_sgvec(skb, sg, 0, skb->len);
-	if (unlikely(ret < 0))
+	if (unlikely(ret < 0)) {
+		kfree(tmp);
 		goto out;
+	}

 	skb->ip_summed = CHECKSUM_NONE;

--
1.8.3

^ permalink raw reply related

* Re: [PATCH net-next] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: kbuild test robot @ 2018-06-27  4:22 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Eric Dumazet
In-Reply-To: <20180627015222.3269067-1-brakmo@fb.com>

Hi Lawrence,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/tcp-force-cwnd-at-least-2-in-tcp_cwnd_reduction/20180627-095533
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/ipv4/tcp_input.c:168:42: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:168:42: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:213:21: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:213:21: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:329:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:329:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:336:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:337:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:337:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:347:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:347:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:412:32: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:412:32: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:413:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:413:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:436:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:436:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:464:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:464:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:473:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:473:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:475:28: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:475:28: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:492:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:492:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:496:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:496:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:509:29: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:509:29: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:511:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:511:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:512:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:513:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:652:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:652:26: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:790:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:790:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:794:23: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:818:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:818:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:827:9: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:827:9: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:867:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:867:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:902:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:902:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1654:25: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1848:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1849:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1849:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1869:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1869:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1896:34: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1995:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1995:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2024:39: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2024:39: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2400:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2476:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2476:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2479:18: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2479:18: sparse: expression using sizeof(void)
>> net/ipv4/tcp_input.c:2480:24: sparse: incompatible types in comparison expression (different signedness)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2990:48: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:3195:46: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:3195:46: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:1206:16: sparse: expression using sizeof(void)
   include/net/tcp.h:1215:31: sparse: expression using sizeof(void)
   include/net/tcp.h:1215:31: sparse: too many warnings
   In file included from include/asm-generic/bug.h:18:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from net/ipv4/tcp_input.c:67:
   net/ipv4/tcp_input.c: In function 'tcp_cwnd_reduction':
   include/linux/kernel.h:812:29: warning: comparison of distinct pointer types lacks a cast
      (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                ^
   include/linux/kernel.h:826:4: note: in expansion of macro '__typecheck'
      (__typecheck(x, y) && __no_side_effects(x, y))
       ^~~~~~~~~~~
   include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
     __builtin_choose_expr(__safe_cmp(x, y), 117-                        ^~~~~~~~~~
   include/linux/kernel.h:852:19: note: in expansion of macro '__careful_cmp'
    #define max(x, y) __careful_cmp(x, y, >)
                      ^~~~~~~~~~~~~
   net/ipv4/tcp_input.c:2480:17: note: in expansion of macro 'max'
     tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
                    ^~~

vim +2480 net/ipv4/tcp_input.c

  2455	
  2456	void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
  2457	{
  2458		struct tcp_sock *tp = tcp_sk(sk);
  2459		int sndcnt = 0;
  2460		int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
  2461	
  2462		if (newly_acked_sacked <= 0 || WARN_ON_ONCE(!tp->prior_cwnd))
  2463			return;
  2464	
  2465		tp->prr_delivered += newly_acked_sacked;
  2466		if (delta < 0) {
  2467			u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
  2468				       tp->prior_cwnd - 1;
  2469			sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
  2470		} else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
  2471			   !(flag & FLAG_LOST_RETRANS)) {
> 2472			sndcnt = min_t(int, delta,
  2473				       max_t(int, tp->prr_delivered - tp->prr_out,
  2474					     newly_acked_sacked) + 1);
  2475		} else {
  2476			sndcnt = min(delta, newly_acked_sacked);
  2477		}
  2478		/* Force a fast retransmit upon entering fast recovery */
  2479		sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
> 2480		tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
  2481	}
  2482	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH net-next v2 0/4] net: Geneve options support for TC act_tunnel_key
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Jakub Kicinski

Hi,

Simon & Pieter say:

This set adds Geneve Options support to the TC tunnel key action.
It provides the plumbing required to configure Geneve variable length
options.  The options can be configured in the form CLASS:TYPE:DATA,
where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.
Additionally multiple options may be listed using a comma delimiter.

v2:
 - fix sparse warnings in patches 3 and 4 (first one reported by
   build bot).

Pieter Jansen van Vuuren (1):
  net: check tunnel option type in tunnel flags

Simon Horman (3):
  net/sched: act_tunnel_key: disambiguate metadata dst error cases
  net/sched: act_tunnel_key: add extended ack support
  net/sched: add tunnel option support to act_tunnel_key

 drivers/net/geneve.c                      |   6 +-
 drivers/net/vxlan.c                       |   3 +-
 include/net/ip_tunnels.h                  |   8 +-
 include/uapi/linux/tc_act/tc_tunnel_key.h |  26 +++
 net/core/filter.c                         |   2 +-
 net/ipv4/ip_gre.c                         |   2 +
 net/ipv6/ip6_gre.c                        |   2 +
 net/openvswitch/flow_netlink.c            |   7 +-
 net/sched/act_tunnel_key.c                | 246 +++++++++++++++++++++-
 9 files changed, 284 insertions(+), 18 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH net-next v2 1/4] net/sched: act_tunnel_key: disambiguate metadata dst error cases
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Simon Horman
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Metadata may be NULL for one of two reasons:
* Missing user input
* Failure to allocate the metadata dst

Disambiguate these case by returning -EINVAL for the former and -ENOMEM
for the latter rather than -EINVAL for both cases.

This is in preparation for using extended ack to provide more information
to users when parsing their input.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/act_tunnel_key.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 626dac81a48a..2edd389e7c92 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -143,10 +143,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			metadata = __ipv6_tun_set_dst(&saddr, &daddr, 0, 0, dst_port,
 						      0, flags,
 						      key_id, 0);
+		} else {
+			ret = -EINVAL;
+			goto err_out;
 		}
 
 		if (!metadata) {
-			ret = -EINVAL;
+			ret = -ENOMEM;
 			goto err_out;
 		}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 2/4] net/sched: act_tunnel_key: add extended ack support
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Simon Horman, Alexander Aring, Pieter Jansen van Vuuren
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
during validation of user input.

Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
---
 net/sched/act_tunnel_key.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 2edd389e7c92..20e98ed8d498 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -86,16 +86,22 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	int ret = 0;
 	int err;
 
-	if (!nla)
+	if (!nla) {
+		NL_SET_ERR_MSG(extack, "Tunnel requires attributes to be passed");
 		return -EINVAL;
+	}
 
 	err = nla_parse_nested(tb, TCA_TUNNEL_KEY_MAX, nla, tunnel_key_policy,
-			       NULL);
-	if (err < 0)
+			       extack);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Failed to parse nested tunnel key attributes");
 		return err;
+	}
 
-	if (!tb[TCA_TUNNEL_KEY_PARMS])
+	if (!tb[TCA_TUNNEL_KEY_PARMS]) {
+		NL_SET_ERR_MSG(extack, "Missing tunnel key parameters");
 		return -EINVAL;
+	}
 
 	parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]);
 	exists = tcf_idr_check(tn, parm->index, a, bind);
@@ -107,6 +113,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		break;
 	case TCA_TUNNEL_KEY_ACT_SET:
 		if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
+			NL_SET_ERR_MSG(extack, "Missing tunnel key id");
 			ret = -EINVAL;
 			goto err_out;
 		}
@@ -144,11 +151,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 						      0, flags,
 						      key_id, 0);
 		} else {
+			NL_SET_ERR_MSG(extack, "Missing either ipv4 or ipv6 src and dst");
 			ret = -EINVAL;
 			goto err_out;
 		}
 
 		if (!metadata) {
+			NL_SET_ERR_MSG(extack, "Cannot allocate tunnel metadata dst");
 			ret = -ENOMEM;
 			goto err_out;
 		}
@@ -156,6 +165,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
+		NL_SET_ERR_MSG(extack, "Unknown tunnel key action");
 		ret = -EINVAL;
 		goto err_out;
 	}
@@ -163,14 +173,18 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_tunnel_key_ops, bind, true);
-		if (ret)
+		if (ret) {
+			NL_SET_ERR_MSG(extack, "Cannot create TC IDR");
 			return ret;
+		}
 
 		ret = ACT_P_CREATED;
 	} else {
 		tcf_idr_release(*a, bind);
-		if (!ovr)
+		if (!ovr) {
+			NL_SET_ERR_MSG(extack, "TC IDR already exists");
 			return -EEXIST;
+		}
 	}
 
 	t = to_tunnel_key(*a);
@@ -180,6 +194,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	if (unlikely(!params_new)) {
 		if (ret == ACT_P_CREATED)
 			tcf_idr_release(*a, bind);
+		NL_SET_ERR_MSG(extack, "Cannot allocate tunnel key parameters");
 		return -ENOMEM;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Pieter Jansen van Vuuren, Jakub Kicinski, Daniel Borkmann
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>

Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.

Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
CC: Daniel Borkmann <daniel@iogearbox.net>

v2:
 - use __be16 for dst_opt_type in net/openvswitch/flow_netlink.c (build bot).
---
 drivers/net/geneve.c           | 6 ++++--
 drivers/net/vxlan.c            | 3 ++-
 include/net/ip_tunnels.h       | 8 ++++++--
 net/core/filter.c              | 2 +-
 net/ipv4/ip_gre.c              | 2 ++
 net/ipv6/ip6_gre.c             | 2 ++
 net/openvswitch/flow_netlink.c | 7 ++++++-
 7 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 3e94375b9b01..471edd76ff55 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -236,7 +236,8 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs,
 		}
 		/* Update tunnel dst according to Geneve options. */
 		ip_tunnel_info_opts_set(&tun_dst->u.tun_info,
-					gnvh->options, gnvh->opt_len * 4);
+					gnvh->options, gnvh->opt_len * 4,
+					TUNNEL_GENEVE_OPT);
 	} else {
 		/* Drop packets w/ critical options,
 		 * since we don't support any...
@@ -675,7 +676,8 @@ static void geneve_build_header(struct genevehdr *geneveh,
 	geneveh->proto_type = htons(ETH_P_TEB);
 	geneveh->rsvd2 = 0;
 
-	ip_tunnel_info_opts_get(geneveh->options, info);
+	if (info->key.tun_flags & TUNNEL_GENEVE_OPT)
+		ip_tunnel_info_opts_get(geneveh->options, info);
 }
 
 static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index cc14e0cd5647..7eb30d7c8bd7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2122,7 +2122,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		vni = tunnel_id_to_key32(info->key.tun_id);
 		ifindex = 0;
 		dst_cache = &info->dst_cache;
-		if (info->options_len)
+		if (info->options_len &&
+		    info->key.tun_flags & TUNNEL_VXLAN_OPT)
 			md = ip_tunnel_info_opts(info);
 		ttl = info->key.ttl;
 		tos = info->key.tos;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 90ff430f5e9d..b0d022ff6ea1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -466,10 +466,12 @@ static inline void ip_tunnel_info_opts_get(void *to,
 }
 
 static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
-					   const void *from, int len)
+					   const void *from, int len,
+					   __be16 flags)
 {
 	memcpy(ip_tunnel_info_opts(info), from, len);
 	info->options_len = len;
+	info->key.tun_flags |= flags;
 }
 
 static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate)
@@ -511,9 +513,11 @@ static inline void ip_tunnel_info_opts_get(void *to,
 }
 
 static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
-					   const void *from, int len)
+					   const void *from, int len,
+					   __be16 flags)
 {
 	info->options_len = 0;
+	info->key.tun_flags |= flags;
 }
 
 #endif /* CONFIG_INET */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..dade922678f6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3582,7 +3582,7 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
 	if (unlikely(size > IP_TUNNEL_OPTS_MAX))
 		return -ENOMEM;
 
-	ip_tunnel_info_opts_set(info, from, size);
+	ip_tunnel_info_opts_set(info, from, size, TUNNEL_OPTIONS_PRESENT);
 
 	return 0;
 }
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2d8efeecf619..c8ca5d8f0f75 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -587,6 +587,8 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto err_free_skb;
 
 	key = &tun_info->key;
+	if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+		goto err_free_rt;
 	md = ip_tunnel_info_opts(tun_info);
 	if (!md)
 		goto err_free_rt;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index c8cf2fdbb13b..367177786e34 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -990,6 +990,8 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
 		fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
 		dsfield = key->tos;
+		if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+			goto tx_err;
 		md = ip_tunnel_info_opts(tun_info);
 		if (!md)
 			goto tx_err;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 492ab0c36f7c..391c4073a6dc 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2516,7 +2516,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	struct ovs_tunnel_info *ovs_tun;
 	struct nlattr *a;
 	int err = 0, start, opts_type;
+	__be16 dst_opt_type;
 
+	dst_opt_type = 0;
 	ovs_match_init(&match, &key, true, NULL);
 	opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log);
 	if (opts_type < 0)
@@ -2528,10 +2530,13 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 			err = validate_geneve_opts(&key);
 			if (err < 0)
 				return err;
+			dst_opt_type = TUNNEL_GENEVE_OPT;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			dst_opt_type = TUNNEL_VXLAN_OPT;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
+			dst_opt_type = TUNNEL_ERSPAN_OPT;
 			break;
 		}
 	}
@@ -2574,7 +2579,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	 */
 	ip_tunnel_info_opts_set(tun_info,
 				TUN_METADATA_OPTS(&key, key.tun_opts_len),
-				key.tun_opts_len);
+				key.tun_opts_len, dst_opt_type);
 	add_nested_action_end(*sfa, start);
 
 	return err;
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 4/4] net/sched: add tunnel option support to act_tunnel_key
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Simon Horman, Pieter Jansen van Vuuren
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Allow setting tunnel options using the act_tunnel_key action.

Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            geneve_opts 0102:80:00800022,0102:80:00800022 \
    action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
v2:
 - use nla_get_be16()/nla_put_be16() for TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS
   as struct geneve_opt :: opt_class is __be16.
---
 include/uapi/linux/tc_act/tc_tunnel_key.h |  26 +++
 net/sched/act_tunnel_key.c                | 214 +++++++++++++++++++++-
 2 files changed, 236 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h b/include/uapi/linux/tc_act/tc_tunnel_key.h
index 72bbefe5d1d1..e284fec8c467 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -36,9 +36,35 @@ enum {
 	TCA_TUNNEL_KEY_PAD,
 	TCA_TUNNEL_KEY_ENC_DST_PORT,	/* be16 */
 	TCA_TUNNEL_KEY_NO_CSUM,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPTS,	/* Nested TCA_TUNNEL_KEY_ENC_OPTS_
+					 * attributes
+					 */
 	__TCA_TUNNEL_KEY_MAX,
 };
 
 #define TCA_TUNNEL_KEY_MAX (__TCA_TUNNEL_KEY_MAX - 1)
 
+enum {
+	TCA_TUNNEL_KEY_ENC_OPTS_UNSPEC,
+	TCA_TUNNEL_KEY_ENC_OPTS_GENEVE,		/* Nested
+						 * TCA_TUNNEL_KEY_ENC_OPTS_
+						 * attributes
+						 */
+	__TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPTS_MAX (__TCA_TUNNEL_KEY_ENC_OPTS_MAX - 1)
+
+enum {
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_UNSPEC,
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,		/* be16 */
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,		/* 4 to 128 bytes */
+
+	__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX \
+	(__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX - 1)
+
 #endif
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 20e98ed8d498..ea203e386a92 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -13,6 +13,7 @@
 #include <linux/kernel.h>
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
+#include <net/geneve.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -57,6 +58,135 @@ static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a,
 	return action;
 }
 
+static const struct nla_policy
+enc_opts_policy[TCA_TUNNEL_KEY_ENC_OPTS_MAX + 1] = {
+	[TCA_TUNNEL_KEY_ENC_OPTS_GENEVE]	= { .type = NLA_NESTED },
+};
+
+static const struct nla_policy
+geneve_opt_policy[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1] = {
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]	   = { .type = NLA_U16 },
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]	   = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]	   = { .type = NLA_BINARY,
+						       .len = 128 },
+};
+
+static int
+tunnel_key_copy_geneve_opt(const struct nlattr *nla, void *dst, int dst_len,
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1];
+	int err, data_len, opt_len;
+	u8 *data;
+
+	err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+			       nla, geneve_opt_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] ||
+	    !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] ||
+	    !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) {
+		NL_SET_ERR_MSG(extack, "Missing tunnel key geneve option class, type or data");
+		return -EINVAL;
+	}
+
+	data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+	data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+	if (data_len < 4) {
+		NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is less than 4 bytes long");
+		return -ERANGE;
+	}
+	if (data_len % 4) {
+		NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is not a multiple of 4 bytes long");
+		return -ERANGE;
+	}
+
+	opt_len = sizeof(struct geneve_opt) + data_len;
+	if (dst) {
+		struct geneve_opt *opt = dst;
+
+		WARN_ON(dst_len < opt_len);
+
+		opt->opt_class =
+			nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]);
+		opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]);
+		opt->length = data_len / 4; /* length is in units of 4 bytes */
+		opt->r1 = 0;
+		opt->r2 = 0;
+		opt->r3 = 0;
+
+		memcpy(opt + 1, data, data_len);
+	}
+
+	return opt_len;
+}
+
+static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst,
+				int dst_len, struct netlink_ext_ack *extack)
+{
+	int err, rem, opt_len, len = nla_len(nla), opts_len = 0;
+	const struct nlattr *attr, *head = nla_data(nla);
+
+	err = nla_validate(head, len, TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+			   enc_opts_policy, extack);
+	if (err)
+		return err;
+
+	nla_for_each_attr(attr, head, len, rem) {
+		switch (nla_type(attr)) {
+		case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+			opt_len = tunnel_key_copy_geneve_opt(attr, dst,
+							     dst_len, extack);
+			if (opt_len < 0)
+				return opt_len;
+			opts_len += opt_len;
+			if (dst) {
+				dst_len -= opt_len;
+				dst += opt_len;
+			}
+			break;
+		}
+	}
+
+	if (!opts_len) {
+		NL_SET_ERR_MSG(extack, "Empty list of tunnel options");
+		return -EINVAL;
+	}
+
+	if (rem > 0) {
+		NL_SET_ERR_MSG(extack, "Trailing data after parsing tunnel key options attributes");
+		return -EINVAL;
+	}
+
+	return opts_len;
+}
+
+static int tunnel_key_get_opts_len(struct nlattr *nla,
+				   struct netlink_ext_ack *extack)
+{
+	return tunnel_key_copy_opts(nla, NULL, 0, extack);
+}
+
+static int tunnel_key_opts_set(struct nlattr *nla, struct ip_tunnel_info *info,
+			       int opts_len, struct netlink_ext_ack *extack)
+{
+	info->options_len = opts_len;
+	switch (nla_type(nla_data(nla))) {
+	case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+#if IS_ENABLED(CONFIG_INET)
+		info->key.tun_flags |= TUNNEL_GENEVE_OPT;
+		return tunnel_key_copy_opts(nla, ip_tunnel_info_opts(info),
+					    opts_len, extack);
+#else
+		return -EAFNOSUPPORT;
+#endif
+	default:
+		NL_SET_ERR_MSG(extack, "Cannot set tunnel options for unknown tunnel type");
+		return -EINVAL;
+	}
+}
+
 static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_PARMS]	    = { .len = sizeof(struct tc_tunnel_key) },
 	[TCA_TUNNEL_KEY_ENC_IPV4_SRC] = { .type = NLA_U32 },
@@ -66,6 +196,7 @@ static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_ENC_KEY_ID]   = { .type = NLA_U32 },
 	[TCA_TUNNEL_KEY_ENC_DST_PORT] = {.type = NLA_U16},
 	[TCA_TUNNEL_KEY_NO_CSUM]      = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPTS]     = { .type = NLA_NESTED },
 };
 
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
@@ -81,6 +212,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	struct tcf_tunnel_key *t;
 	bool exists = false;
 	__be16 dst_port = 0;
+	int opts_len = 0;
 	__be64 key_id;
 	__be16 flags;
 	int ret = 0;
@@ -128,6 +260,15 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		if (tb[TCA_TUNNEL_KEY_ENC_DST_PORT])
 			dst_port = nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_DST_PORT]);
 
+		if (tb[TCA_TUNNEL_KEY_ENC_OPTS]) {
+			opts_len = tunnel_key_get_opts_len(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+							   extack);
+			if (opts_len < 0) {
+				ret = opts_len;
+				goto err_out;
+			}
+		}
+
 		if (tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC] &&
 		    tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]) {
 			__be32 saddr;
@@ -138,7 +279,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 
 			metadata = __ip_tun_set_dst(saddr, daddr, 0, 0,
 						    dst_port, flags,
-						    key_id, 0);
+						    key_id, opts_len);
 		} else if (tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC] &&
 			   tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]) {
 			struct in6_addr saddr;
@@ -162,6 +303,14 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			goto err_out;
 		}
 
+		if (opts_len) {
+			ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+						  &metadata->u.tun_info,
+						  opts_len, extack);
+			if (ret < 0)
+				goto err_out;
+		}
+
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
@@ -234,6 +383,61 @@ static void tunnel_key_release(struct tc_action *a)
 	}
 }
 
+static int tunnel_key_geneve_opts_dump(struct sk_buff *skb,
+				       const struct ip_tunnel_info *info)
+{
+	int len = info->options_len;
+	u8 *src = (u8 *)(info + 1);
+	struct nlattr *start;
+
+	start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS_GENEVE);
+	if (!start)
+		return -EMSGSIZE;
+
+	while (len > 0) {
+		struct geneve_opt *opt = (struct geneve_opt *)src;
+
+		if (nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,
+				 opt->opt_class) ||
+		    nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,
+			       opt->type) ||
+		    nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,
+			    opt->length * 4, opt + 1))
+			return -EMSGSIZE;
+
+		len -= sizeof(struct geneve_opt) + opt->length * 4;
+		src += sizeof(struct geneve_opt) + opt->length * 4;
+	}
+
+	nla_nest_end(skb, start);
+	return 0;
+}
+
+static int tunnel_key_opts_dump(struct sk_buff *skb,
+				const struct ip_tunnel_info *info)
+{
+	struct nlattr *start;
+	int err;
+
+	if (!info->options_len)
+		return 0;
+
+	start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (info->key.tun_flags & TUNNEL_GENEVE_OPT) {
+		err = tunnel_key_geneve_opts_dump(skb, info);
+		if (err)
+			return err;
+	} else {
+		return -EINVAL;
+	}
+
+	nla_nest_end(skb, start);
+	return 0;
+}
+
 static int tunnel_key_dump_addresses(struct sk_buff *skb,
 				     const struct ip_tunnel_info *info)
 {
@@ -284,8 +488,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 		goto nla_put_failure;
 
 	if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) {
-		struct ip_tunnel_key *key =
-			&params->tcft_enc_metadata->u.tun_info.key;
+		struct ip_tunnel_info *info =
+			&params->tcft_enc_metadata->u.tun_info;
+		struct ip_tunnel_key *key = &info->key;
 		__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
 		if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
@@ -293,7 +498,8 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 					      &params->tcft_enc_metadata->u.tun_info) ||
 		    nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) ||
 		    nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
-			       !(key->tun_flags & TUNNEL_CSUM)))
+			       !(key->tun_flags & TUNNEL_CSUM)) ||
+		    tunnel_key_opts_dump(skb, info))
 			goto nla_put_failure;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v2] fib_rules: match rules based on suppress_* properties too
From: Roopa Prabhu @ 2018-06-27  4:53 UTC (permalink / raw)
  To: David Miller; +Cc: Jason A. Donenfeld, netdev
In-Reply-To: <20180627.103417.614359212764375850.davem@davemloft.net>

On Tue, Jun 26, 2018 at 6:34 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Date: Tue, 26 Jun 2018 01:39:32 +0200
>
>> Two rules with different values of suppress_prefix or suppress_ifgroup
>> are not the same. This fixes an -EEXIST when running:
>>
>>    $ ip -4 rule add table main suppress_prefixlength 0
>>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
>
> Roopa, thanks for doing all of that analysis.
>
> I think applying this patch makes the most sense at this point,
> so that it what I have done.


Thanks, will keep an eye out and add some more tests

^ permalink raw reply

* Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
From: Leon Romanovsky @ 2018-06-27  5:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, RDMA mailing list, Matan Barak, Michael J Ruhl,
	Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
	linux-netdev
In-Reply-To: <20180626203921.GF5381@ziepe.ca>

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

On Tue, Jun 26, 2018 at 02:39:21PM -0600, Jason Gunthorpe wrote:
> On Tue, Jun 26, 2018 at 07:21:26AM +0300, Leon Romanovsky wrote:
> > On Mon, Jun 25, 2018 at 03:34:38PM -0600, Jason Gunthorpe wrote:
> > > On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > >
> > > > Hi,
> > > >
> > > > This is bunch of patches trigged by running syzkaller internally.
> > > >
> > > > I'm sending them based on rdma-next mainly for two reasons:
> > > > 1, Most of the patches fix the old issues and it doesn't matter when
> > > > they will hit the Linus's tree: now or later in a couple of weeks
> > > > during merge window.
> > > > 2. They interleave with code cleanup, mlx5-next patches and Michael's
> > > > feedback on flow counters series.
> > > >
> > > > Thanks
> > > >
> > > > Leon Romanovsky (12):
> > > >   RDMA/uverbs: Protect from attempts to create flows on unsupported QP
> > > >   RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow
> > >
> > > I applied these two to for-rc
> > >
> > > >   RDMA/uverbs: Check existence of create_flow callback
> > > >   RDMA/verbs: Drop kernel variant of create_flow
> > > >   RDMA/verbs: Drop kernel variant of destroy_flow
> > > >   net/mlx5: Rate limit errors in command interface
> > > >   RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
> > > >   RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
> > > >   RDMA/uverbs: Remove redundant check
> > >
> > > These to for-next
> >
> > Jason,
> >
> > We would like to see patch "[PATCH mlx5-next 05/12] net/mlx5:
> > Rate limit errors in command interface" in out mlx5-next. Is it possible
> > at this point to drop it from for-next, so I'll be able to take it into
> > mlx5-next?
>
> Okay, you got to this while it was still 'wip', so it is dropped. Add
> it to the mlx5-next branch and netdev or rdma can pull it next time
> there is some reason to pull the branch..

Thanks, I cherry-picked that patch from your wip branch, so it includes
your SOB too. Most probably, the RDMA will pull it in dump_fill_mkey
series.

Thanks

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH mlx5-next 05/12] net/mlx5: Rate limit errors in command interface
From: Leon Romanovsky @ 2018-06-27  5:48 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: RDMA mailing list, Hadar Hen Zion, Matan Barak, Michael J Ruhl,
	Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
	linux-netdev
In-Reply-To: <20180624082353.16138-6-leon@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

On Sun, Jun 24, 2018 at 11:23:46AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
>
> Any error status returned by FW will trigger similar
> to the following error message in the dmesg.
>
> [   55.884355] mlx5_core 0000:00:04.0: mlx5_cmd_check:712:(pid 555):
> ALLOC_UAR(0x802) op_mod(0x0) failed, status limits exceeded(0x8),
> syndrome (0x0)
>
> Those prints are extremely valuable to diagnose issues with running
> system and it is important to keep them. However, not-so-careful user
> can trigger endless number of such prints by depleting HW resources
> and will spam dmesg.
>
> Rate limiting of such messages solves this issue.
>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c       | 11 ++++-------
>  drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h |  6 ++++++
>  2 files changed, 10 insertions(+), 7 deletions(-)
>

Thanks, applied to mlx5-next.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox