Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 1/3] rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
From: Sowmini Varadhan @ 2017-12-22 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar
In-Reply-To: <cover.1513946398.git.sowmini.varadhan@oracle.com>

If the rds_sock is not added to the bind_hash_table, we must
reset rs_bound_addr so that rds_remove_bound will not trip on
this rds_sock.

rds_add_bound() does a rds_sock_put() in this failure path, so
failing to reset rs_bound_addr will result in a socket refcount
bug, and will trigger a WARN_ON with the stack shown below when
the application subsequently tries to close the PF_RDS socket.

     WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \
		rds_sock_destruct+0x15/0x30 [rds]
       :
     __sk_destruct+0x21/0x190
     rds_remove_bound.part.13+0xb6/0x140 [rds]
     rds_release+0x71/0x120 [rds]
     sock_release+0x1a/0x70
     sock_close+0xe/0x20
     __fput+0xd5/0x210
     task_work_run+0x82/0xa0
     do_exit+0x2ce/0xb30
     ? syscall_trace_enter+0x1cc/0x2b0
     do_group_exit+0x39/0xa0
     SyS_exit_group+0x10/0x10
     do_syscall_64+0x61/0x1a0

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/bind.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 75d43dc..5aa3a64 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -114,6 +114,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
 			  rs, &addr, (int)ntohs(*port));
 			break;
 		} else {
+			rs->rs_bound_addr = 0;
 			rds_sock_put(rs);
 			ret = -ENOMEM;
 			break;
-- 
1.7.1

^ permalink raw reply related

* [PATCH net 2/3] rds: tcp: initialize t_tcp_detached to false
From: Sowmini Varadhan @ 2017-12-22 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar
In-Reply-To: <cover.1513946398.git.sowmini.varadhan@oracle.com>

Commit f10b4cff98c6 ("rds: tcp: atomically purge entries from
rds_tcp_conn_list during netns delete") adds the field t_tcp_detached,
but this needs to be initialized explicitly to false.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/tcp.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 39f502d..a61a498 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -290,6 +290,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 		tc->t_cpath = &conn->c_path[i];
 
 		spin_lock_irq(&rds_tcp_conn_lock);
+		tc->t_tcp_node_detached = false;
 		list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list);
 		spin_unlock_irq(&rds_tcp_conn_lock);
 		rdsdebug("rds_conn_path [%d] tc %p\n", i,
-- 
1.7.1

^ permalink raw reply related

* [PATCH net 3/3] rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()
From: Sowmini Varadhan @ 2017-12-22 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar
In-Reply-To: <cover.1513946398.git.sowmini.varadhan@oracle.com>

If kmem_cache_alloc() fails in the middle of the for() loop,
cleanup anything that might have been allocated so far.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/tcp.c |   46 ++++++++++++++++++++++++++--------------------
 1 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index a61a498..2e554ef 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -270,16 +270,33 @@ static int rds_tcp_laddr_check(struct net *net, __be32 addr)
 	return -EADDRNOTAVAIL;
 }
 
+static void rds_tcp_conn_free(void *arg)
+{
+	struct rds_tcp_connection *tc = arg;
+	unsigned long flags;
+
+	rdsdebug("freeing tc %p\n", tc);
+
+	spin_lock_irqsave(&rds_tcp_conn_lock, flags);
+	if (!tc->t_tcp_node_detached)
+		list_del(&tc->t_tcp_node);
+	spin_unlock_irqrestore(&rds_tcp_conn_lock, flags);
+
+	kmem_cache_free(rds_tcp_conn_slab, tc);
+}
+
 static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 {
 	struct rds_tcp_connection *tc;
-	int i;
+	int i, j;
+	int ret = 0;
 
 	for (i = 0; i < RDS_MPATH_WORKERS; i++) {
 		tc = kmem_cache_alloc(rds_tcp_conn_slab, gfp);
-		if (!tc)
-			return -ENOMEM;
-
+		if (!tc) {
+			ret = -ENOMEM;
+			break;
+		}
 		mutex_init(&tc->t_conn_path_lock);
 		tc->t_sock = NULL;
 		tc->t_tinc = NULL;
@@ -296,22 +313,11 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 		rdsdebug("rds_conn_path [%d] tc %p\n", i,
 			 conn->c_path[i].cp_transport_data);
 	}
-
-	return 0;
-}
-
-static void rds_tcp_conn_free(void *arg)
-{
-	struct rds_tcp_connection *tc = arg;
-	unsigned long flags;
-	rdsdebug("freeing tc %p\n", tc);
-
-	spin_lock_irqsave(&rds_tcp_conn_lock, flags);
-	if (!tc->t_tcp_node_detached)
-		list_del(&tc->t_tcp_node);
-	spin_unlock_irqrestore(&rds_tcp_conn_lock, flags);
-
-	kmem_cache_free(rds_tcp_conn_slab, tc);
+	if (ret) {
+		for (j = 0; j < i; j++)
+			rds_tcp_conn_free(conn->c_path[j].cp_transport_data);
+	}
+	return ret;
 }
 
 static bool list_has_conn(struct list_head *list, struct rds_connection *conn)
-- 
1.7.1

^ permalink raw reply related

* Re: [QUESTION] Doubt about NAPI_GRO_CB(skb)->is_atomic in tcpv4 gro process
From: Alexander Duyck @ 2017-12-22 16:32 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: netdev@vger.kernel.org, davem@davemloft.net, linuxarm@huawei.com,
	yuxiaowu, wzhen.wang, Xuehuahu
In-Reply-To: <ccef7de5-c49e-e58b-e15f-d64c0205444c@huawei.com>

On Fri, Dec 22, 2017 at 12:49 AM, Yunsheng Lin <linyunsheng@huawei.com> wrote:
> Hi, Alexander
>
> On 2017/12/22 0:29, Alexander Duyck wrote:
>> On Thu, Dec 21, 2017 at 1:16 AM, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>> Hi, Alexander
>>>
>>> On 2017/12/21 0:24, Alexander Duyck wrote:
>>>> On Wed, Dec 20, 2017 at 1:09 AM, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>> Hi, all
>>>>>         I have some doubt about NAPI_GRO_CB(skb)->is_atomic when
>>>>> analyzing the tcpv4 gro process:
>>>>>
>>>>> Firstly we set NAPI_GRO_CB(skb)->is_atomic to 1 in dev_gro_receive:
>>>>> https://elixir.free-electrons.com/linux/v4.15-rc4/source/net/core/dev.c#L4838
>>>>>
>>>>> And then in inet_gro_receive, we check the NAPI_GRO_CB(skb)->is_atomic
>>>>> before setting NAPI_GRO_CB(skb)->is_atomic according to IP_DF bit in the ip header:
>>>>> https://elixir.free-electrons.com/linux/v4.15-rc4/source/net/ipv4/af_inet.c#L1319
>>>>>
>>>>> struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>>>>> {
>>>>> .....................
>>>>>         for (p = *head; p; p = p->next) {
>>>>> ........................
>>>>>
>>>>>                 /* If the previous IP ID value was based on an atomic
>>>>>                  * datagram we can overwrite the value and ignore it.
>>>>>                  */
>>>>>                 if (NAPI_GRO_CB(skb)->is_atomic)                      //we check it here
>>>>>                         NAPI_GRO_CB(p)->flush_id = flush_id;
>>>>>                 else
>>>>>                         NAPI_GRO_CB(p)->flush_id |= flush_id;
>>>>>         }
>>>>>
>>>>>         NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));  //we set it here
>>>>>         NAPI_GRO_CB(skb)->flush |= flush;
>>>>>         skb_set_network_header(skb, off);
>>>>> ................................
>>>>> }
>>>>>
>>>>> My question is whether we should check the NAPI_GRO_CB(skb)->is_atomic or NAPI_GRO_CB(p)->is_atomic?
>>>>> If we should check NAPI_GRO_CB(skb)->is_atomic, then maybe it is unnecessary because it is alway true.
>>>>> If we should check NAPI_GRO_CB(p)->is_atomic, maybe there is a bug here.
>>>>>
>>>>> So what is the logic here? I am just start analyzing the gro, maybe I miss something obvious here.
>>>>
>>>> The logic there is to address the multiple IP header case where there
>>>> are 2 or more IP headers due to things like VXLAN or GRE tunnels. So
>>>> what will happen is that an outer IP header will end up being sent
>>>> with DF not set and will clear the is_atomic value then we want to OR
>>>> in the next header that is applied. It defaults to assignment on
>>>> is_atomic because the first IP header will encounter flush_id with no
>>>> previous configuration occupying it.
>>>
>>> I see your point now.
>>>
>>> But for the same flow of tunnels packet, the outer and inner ip header must
>>> have the same fixed id or increment id?
>>>
>>> For example, if we have a flow of tunnels packet which has fixed id in outer
>>> header and increment id in inner header(the inner header does have DF flag set):
>
> Sorry, a typo error here. I meant the inner header does *not* have DF flag set here.
>
>>>
>>> 1. For the first packet, NAPI_GRO_CB(skb)->is_atomic will be set to zero when
>>> inet_gro_receive is processing the inner ip header.
>>>
>>> 2. For the second packet, when inet_gro_receive is processing the outer ip header
>>> which has a fixed id, NAPI_GRO_CB(p)->is_atomic is zero according to [1], so
>>> NAPI_GRO_CB(p)->flush_id will be set to 0xFFFF, then the second packet will not
>>> be merged to first packet in tcp_gro_receive.
>>
>> I'm not sure how valid your case here is. The is_atomic is only really
>> meant to apply to the inner-most header.
>
> For the new skb, NAPI_GRO_CB(skb)->is_atomic is indeed applied to the
> inner-most header.
>
> What about the NAPI_GRO_CB(p)->is_atomic, p is the same skb flow already
> merged by gro.
>
> Let me try if I understand it correctly:
> when there is only one skb merged in p, then NAPI_GRO_CB(p)->is_atomic
> is set according to the first skb' inner-most ip DF flags.
>
> When the second skb comes, and inet_gro_receive is processing the
> outer-most ip, for the below code, NAPI_GRO_CB(p)->is_atomic
> is for first skb's inner-most ip DF flags, and "iph->frag_off & htons(IP_DF)"
> is for second skb' outer-most ip DF flags.
> Why don't we use the first skb's outer-most ip DF flags here? I think it is
> more logical to use first skb's outer-most ip DF flags here. But the first
> skb's outer-most ip DF flags is lost when we get here, right?
>
>                 if (!NAPI_GRO_CB(p)->is_atomic ||
>                     !(iph->frag_off & htons(IP_DF))) {
>                         flush_id ^= NAPI_GRO_CB(p)->count;
>                         flush_id = flush_id ? 0xFFFF : 0;
>                 }

We already did in a way. If you look earlier up we will set flush if
the DF bit of this packet and the next don't match. So if the DF bit
changes we are flushing anyway.

>
>
>  In the case of TCP the
>> inner-most header should almost always have the DF bit set which means
>> the inner-most is almost always atomic.
>>
>>> I thought outer ip header could have a fixed id while inner ip header could
>>> have a increment id. Do I miss something here?
>>
>> You have it backwards. The innermost will have DF bit set so it can be
>> fixed, the outer-most will in many cases not since it is usually UDP
>> and as such it will likely need to increment.
>
> Actually I got it backwards because that is what intel do when TSOing.
>
> Below is quoted from 710 datasheet in page 1052:
> The Identification field (in the IP header in non tunneled packets or the inner IP header in
> tunneled packets) is taken from the TSO header in the first segment and then it is increased by
> one for each transmitted segment.
> The Identification field (in the external IP header) is taken from the TSO header in the first
> segment and then it is either increased by one for each transmitted segment or remains
> constant depending on the EIP_NOINC setting in the transmit context descriptor.
>
> If I understand it correctly, intel is doing oppositely as pointed out by you.
> Am I miss something here?
>
> Refer:
> https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf

The only form of IP ID mangling supported by i40e is to increment
everything always for outer and inner header. This code was meant more
to handle the setup generated by igb or ixgbe when we are doing
GSO_PARTIAL. We don't support fixed outer IDs in that driver if I
recall correctly. The hardware might support it but the stack doesn't.
Basically if DF is not set then we must increment and if DF is set we
could either increment or be fixed and we would be within spec for how
IP IDs are supposed to be handled.

>>
>>>>
>>>> The part I am not sure about is if we should be using assignment for
>>>> is_atomic or using an "&=" to clear the bit and leave it cleared.
>>>
>>> I am not sure I understood you here. is_atomic is a bit field, why do you
>>> want to use "&="?
>>
>> Actually that was my mind kind of wandering. It has been a while since
>> I looked at this code and the use of &= wouldn't be appropriate since
>> is_atomic should only apply to the innermost header.
>>
>> Basically the only acceptable combinations for is_atomic and flush_id
>> are false with 0, or true with 1. We can't have a fixed outer header
>> value if DF is not set.
>>
>> Hope that helps to clarify things.
>
> Yes, Your reqly has helped a lot. But I still have some doubt above,
> please help me understand it.
> I hope this does not take much of your time.
> Thanks very much.
>
> Yunsheng Lin
>

You are free to doubt, but I suggest you actually walk through the
protocol and thoroughly read the code. Arguing hypothetical bugs is
rather tedious when I can easily point out the flaws and why something
isn't a bug.

- Alex

^ permalink raw reply

* [PATCH net 0/3] rds bug fixes
From: Sowmini Varadhan @ 2017-12-22 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

Ran into pre-existing bugs when working on the fix for
   https://www.spinics.net/lists/netdev/msg472849.html

The bugs fixed in this patchset are unrelated to the syzbot 
failure (which I'm still testing and trying to reproduce) but 
meanwhile, let's get these fixes out of the way.

Sowmini Varadhan (3):
  rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
  rds: tcp: initialized t_tcp_detached to false
  rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()

 net/rds/bind.c |    1 +
 net/rds/tcp.c  |   47 +++++++++++++++++++++++++++--------------------
 2 files changed, 28 insertions(+), 20 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next] net/trace: fix printk format in inet_sock_set_state
From: Sergei Shtylyov @ 2017-12-22 17:04 UTC (permalink / raw)
  To: Yafang Shao, davem; +Cc: netdev
In-Reply-To: <1513957033-24100-1-git-send-email-laoar.shao@gmail.com>

Hello!

On 12/22/2017 06:37 PM, Yafang Shao wrote:

> There's a space character missed in the printk messages.
> This error should be prevented with checkscript.pl, but it couldn't caught
                                                                      ^ be?

> by running with "checkscript.pl -f xxxx.patch", that's what I had run
> before.
> What a carelessness.

    You generally don't need to break up the messages violating 80-column 
limit, and checkpatch.pl should be aware of this...

> Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint with
> inet_sock_set_state tracepoint")
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
[...]

MBR, Sergei

^ permalink raw reply

* [bpf-next V2 PATCH 00/14] xdp: new XDP rx-queue info concept
From: Jesper Dangaard Brouer @ 2017-12-22 17:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan

V2:
* Changed API exposed to drivers
  - Removed invocation of "init" in drivers, and only call "reg"
    (Suggested by Saeed)
  - Allow "reg" to fail and handle this in drivers
    (Suggested by David Ahern)
* Removed the SINKQ qtype, instead allow to register as "unused"
* Also fixed some drivers during testing on actual HW (noted in patches)

There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on.  For both the XDP bpf-prog and kernel side.

Instead of extending struct xdp_buff each time new info is needed,
this patchset takes a different approach.  Struct xdp_buff is only
extended with a pointer to a struct xdp_rxq_info (allowing for easier
extending this later).  This xdp_rxq_info contains information related
to how the driver have setup the individual RX-queue's.  This is
read-mostly information, and all xdp_buff frames (in drivers
napi_poll) point to the same xdp_rxq_info (per RX-queue).

We stress this data/cache-line is for read-mostly info.  This is NOT
for dynamic per packet info, use the data_meta for such use-cases.

This patchset start out small, and only expose ingress_ifindex and the
RX-queue index to the XDP/BPF program. Access to tangible info like
the ingress ifindex and RX queue index, is fairly easy to comprehent.
The other future use-cases could allow XDP frames to be recycled back
to the originating device driver, by providing info on RX device and
queue number.

As XDP doesn't have driver feature flags, and eBPF code due to
bpf-tail-calls cannot determine that XDP driver invoke it, this
patchset have to update every driver that support XDP.

For driver developers (review individual driver patches!):

The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
modification require (temporary) stopping RX frames, then the
xdp_rxq_info should (likely) also be unregistred and re-registered,
especially if reallocating the pages in the ring. Make sure ethtool
set_channels does the right thing. When replacing XDP prog, if and
only if RX-ring need to be changed, then also re-register the
xdp_rxq_info.

I'm Cc'ing the individual driver patches to the registered maintainers.

Testing:

I've only tested the NIC drivers I have hardware for.  The general
test procedure is to (DUT = Device Under Test):
 (1) run pktgen script pktgen_sample04_many_flows.sh       (against DUT)
 (2) run samples/bpf program xdp_rxq_info --dev $DEV       (on DUT)
 (3) runtime modify number of NIC queues via ethtool -L    (on DUT)
 (4) runtime modify number of NIC ring-size via ethtool -G (on DUT)

Patch based on git tree bpf-next (at commit c475ffad58a8a2):
 https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/

---

Jesper Dangaard Brouer (14):
      xdp: base API for new XDP rx-queue info concept
      xdp/mlx5: setup xdp_rxq_info
      i40e: setup xdp_rxq_info
      ixgbe: setup xdp_rxq_info
      xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
      mlx4: setup xdp_rxq_info
      bnxt_en: setup xdp_rxq_info
      nfp: setup xdp_rxq_info
      thunderx: setup xdp_rxq_info
      tun: setup xdp_rxq_info
      virtio_net: setup xdp_rxq_info
      xdp: generic XDP handling of xdp_rxq_info
      bpf: finally expose xdp_rxq_info to XDP bpf-programs
      samples/bpf: program demonstrating access to xdp_rxq_info

 drivers/net/ethernet/broadcom/bnxt/bnxt.c          |   10 
 drivers/net/ethernet/broadcom/bnxt/bnxt.h          |    2 
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |    1 
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   11 
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |    4 
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |    2 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c     |    2 
 drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   18 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.h        |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe.h           |    2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |    4 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   10 
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |    3 
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   13 
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |    4 
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |    4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |    9 
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |    1 
 drivers/net/ethernet/netronome/nfp/nfp_net.h       |    5 
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   10 
 drivers/net/ethernet/qlogic/qede/qede.h            |    2 
 drivers/net/ethernet/qlogic/qede/qede_fp.c         |    1 
 drivers/net/ethernet/qlogic/qede/qede_main.c       |   10 
 drivers/net/tun.c                                  |   24 +
 drivers/net/virtio_net.c                           |   12 
 include/linux/filter.h                             |    2 
 include/linux/netdevice.h                          |    2 
 include/net/xdp.h                                  |   48 ++
 include/uapi/linux/bpf.h                           |    3 
 net/core/Makefile                                  |    2 
 net/core/dev.c                                     |   69 ++-
 net/core/filter.c                                  |   19 +
 net/core/xdp.c                                     |   74 +++
 samples/bpf/Makefile                               |    4 
 samples/bpf/xdp_rxq_info_kern.c                    |   96 ++++
 samples/bpf/xdp_rxq_info_user.c                    |  532 ++++++++++++++++++++
 36 files changed, 990 insertions(+), 28 deletions(-)
 create mode 100644 include/net/xdp.h
 create mode 100644 net/core/xdp.c
 create mode 100644 samples/bpf/xdp_rxq_info_kern.c
 create mode 100644 samples/bpf/xdp_rxq_info_user.c

^ permalink raw reply

* [bpf-next V2 PATCH 01/14] xdp: base API for new XDP rx-queue info concept
From: Jesper Dangaard Brouer @ 2017-12-22 17:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

This patch only introduce the core data structures and API functions.
All XDP enabled drivers must use the API before this info can used.

There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on.  For both the XDP bpf-prog and kernel side.

Instead of extending xdp_buff each time new info is needed, the patch
creates a separate read-mostly struct xdp_rxq_info, that contains this
info.  We stress this data/cache-line is for read-only info.  This is
NOT for dynamic per packet info, use the data_meta for such use-cases.

The performance advantage is this info can be setup at RX-ring init
time, instead of updating N-members in xdp_buff.  A possible (driver
level) micro optimization is that xdp_buff->rxq assignment could be
done once per XDP/NAPI loop.  The extra pointer deref only happens for
program needing access to this info (thus, no slowdown to existing
use-cases).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/filter.h |    2 +
 include/net/xdp.h      |   47 +++++++++++++++++++++++++++++++++
 net/core/Makefile      |    2 +
 net/core/xdp.c         |   68 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 include/net/xdp.h
 create mode 100644 net/core/xdp.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2b0df2703671..425056c7f96c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,6 +20,7 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 
+#include <net/xdp.h>
 #include <net/sch_generic.h>
 
 #include <uapi/linux/filter.h>
@@ -503,6 +504,7 @@ struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	struct xdp_rxq_info *rxq;
 };
 
 /* Compute the linear packet data range [data, data_end) which
diff --git a/include/net/xdp.h b/include/net/xdp.h
new file mode 100644
index 000000000000..8c6e83293bba
--- /dev/null
+++ b/include/net/xdp.h
@@ -0,0 +1,47 @@
+/* include/net/xdp.h
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#ifndef __LINUX_NET_XDP_H__
+#define __LINUX_NET_XDP_H__
+
+/**
+ * DOC: XDP RX-queue information
+ *
+ * The XDP RX-queue info (xdp_rxq_info) is associated with the driver
+ * level RX-ring queues.  It is information that is specific to how
+ * the driver have configured a given RX-ring queue.
+ *
+ * Each xdp_buff frame received in the driver carry a (pointer)
+ * reference to this xdp_rxq_info structure.  This provides the XDP
+ * data-path read-access to RX-info for both kernel and bpf-side
+ * (limited subset).
+ *
+ * For now, direct access is only safe while running in NAPI/softirq
+ * context.
+ *
+ * The driver usage API is an init, register and unregister API.
+ *
+ * The struct is not directly tied to the XDP prog.  A new XDP prog
+ * can be attached as long as it doesn't change the underlying
+ * RX-ring.  If the RX-ring does change significantly, the NIC driver
+ * naturally need to stop the RX-ring before purging and reallocating
+ * memory.  In that process the driver MUST call unregistor (which
+ * also apply for driver shutdown and unload).  The init and register
+ * API is also mandatory during RX-ring setup.
+ */
+
+struct xdp_rxq_info {
+	struct net_device *dev;
+	u32 queue_index;
+	u32 reg_state;
+} ____cacheline_aligned; /* perf critical, avoid false-sharing */
+
+void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
+		     struct net_device *dev, u32 queue_index);
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
+
+#endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index 1fd0a9c88b1b..6dbbba8c57ae 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
 			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
-			fib_notifier.o
+			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
diff --git a/net/core/xdp.c b/net/core/xdp.c
new file mode 100644
index 000000000000..dde541b71aaa
--- /dev/null
+++ b/net/core/xdp.c
@@ -0,0 +1,68 @@
+/* net/core/xdp.c
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#include <linux/types.h>
+#include <linux/mm.h>
+
+#include <net/xdp.h>
+
+#define REG_STATE_NEW		0x0
+#define REG_STATE_REGISTERED	0x1
+#define REG_STATE_UNREGISTERED	0x2
+#define REG_STATE_UNUSED	0x3
+
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
+{
+	/* Simplify driver cleanup code paths, allow unreg "unused" */
+	if (xdp_rxq->reg_state == REG_STATE_UNUSED)
+		return;
+
+	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
+
+	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
+	xdp_rxq->dev = NULL;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
+
+void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
+{
+	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
+
+/* Returns 0 on success, negative on failure */
+int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
+		     struct net_device *dev, u32 queue_index)
+{
+	if (xdp_rxq->reg_state == REG_STATE_UNUSED) {
+		WARN(1, "Driver promised not to register this");
+		return -EINVAL;
+	}
+
+	if (xdp_rxq->reg_state == REG_STATE_REGISTERED) {
+		WARN(1, "Missing unregister, handled but fix driver");
+		xdp_rxq_info_unreg(xdp_rxq);
+	}
+
+	if (!dev) {
+		WARN(1, "Missing net_device from driver");
+		return -ENODEV;
+	}
+
+	/* State either UNREGISTERED or NEW */
+	xdp_rxq_info_init(xdp_rxq);
+	xdp_rxq->dev = dev;
+	xdp_rxq->queue_index = queue_index;
+
+	xdp_rxq->reg_state = REG_STATE_REGISTERED;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);
+
+void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq)
+{
+	xdp_rxq->reg_state = REG_STATE_UNUSED;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unused);

^ permalink raw reply related

* [bpf-next V2 PATCH 02/14] xdp/mlx5: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, dsahern, Matan Barak, Saeed Mahameed,
	Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan,
	Tariq Toukan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

The mlx5 driver have a special drop-RQ queue (one per interface) that
simply drops all incoming traffic. It helps driver keep other HW
objects (flow steering) alive upon down/up operations.  It is
temporarily pointed by flow steering objects during the interface
setup, and when interface is down. It lacks many fields that are set
in a regular RQ (for example its state is never switched to
MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation).

The XDP RX-queue info for this drop-RQ marked as unused, which
allow us to use the same takedown/free code path as other RX-queues.

Driver hook points for xdp_rxq_info:
 * reg   : mlx5e_alloc_rq()
 * unused: mlx5e_alloc_drop_rq()
 * unreg : mlx5e_free_rq()

Tested on actual hardware with samples/bpf program

Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Matan Barak <matanb@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    4 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    9 +++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |    1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c0872b3284cb..405b09db2a84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -46,6 +46,7 @@
 #include <linux/mlx5/transobj.h>
 #include <linux/rhashtable.h>
 #include <net/switchdev.h>
+#include <net/xdp.h>
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
@@ -568,6 +569,9 @@ struct mlx5e_rq {
 	u32                    rqn;
 	struct mlx5_core_dev  *mdev;
 	struct mlx5_core_mkey  umr_mkey;
+
+	/* XDP read-mostly */
+	struct xdp_rxq_info    xdp_rxq;
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_channel {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0f5c012de52e..128cb9a32549 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -589,6 +589,9 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		goto err_rq_wq_destroy;
 	}
 
+	if (xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix) < 0)
+		goto err_rq_wq_destroy;
+
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = params->rq_headroom;
 
@@ -695,6 +698,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 err_rq_wq_destroy:
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
+	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -707,6 +711,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 
+	xdp_rxq_info_unreg(&rq->xdp_rxq);
+
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		mlx5e_rq_free_mpwqe_info(rq);
@@ -2768,6 +2774,9 @@ static int mlx5e_alloc_drop_rq(struct mlx5_core_dev *mdev,
 	if (err)
 		return err;
 
+	/* Mark as unused given "Drop-RQ" packets never reach XDP */
+	xdp_rxq_info_unused(&rq->xdp_rxq);
+
 	rq->mdev = mdev;
 
 	return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b499c7a698f..7b38480811d4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -812,6 +812,7 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
 	xdp.data_hard_start = va;
+	xdp.rxq = &rq->xdp_rxq;
 
 	act = bpf_prog_run_xdp(prog, &xdp);
 	switch (act) {

^ permalink raw reply related

* [bpf-next V2 PATCH 03/14] i40e: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, intel-wired-lan, Jeff Kirsher,
	dsahern, gospo, bjorn.topel, michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

The i40e driver have a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
a sideband channel for configuring/updating the flow director tables.
This (i40e_vsi_)type does not invoke XDP-ebpf code. Thus, mark this
type as a XDP RX-queue type RXQ_TYPE_SINK.

Driver hook points for xdp_rxq_info:
 * reg  : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources)
 * unreg: i40e_free_rx_resources    (via i40e_vsi_free_rx_resources)

Tested on actual hardware with samples/bpf program.

V2: Fixed bug in i40e_set_ringparam

Cc: intel-wired-lan@lists.osuosl.org
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    2 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    |   18 ++++++++++++++++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |    3 +++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 5f6cf7212d4f..cfd788b4fd7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1585,6 +1585,8 @@ static int i40e_set_ringparam(struct net_device *netdev,
 			 */
 			rx_rings[i].desc = NULL;
 			rx_rings[i].rx_bi = NULL;
+			/* Clear cloned XDP RX-queue info before setup call */
+			memset(&rx_rings[i].xdp_rxq, 0, sizeof(rx_rings[i].xdp_rxq));
 			/* this is to allow wr32 to have something to write to
 			 * during early allocation of Rx buffers
 			 */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 4566d66ffc7c..2a8a85e3ae8f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -27,6 +27,7 @@
 #include <linux/prefetch.h>
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
+#include <net/xdp.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -1236,6 +1237,8 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 {
 	i40e_clean_rx_ring(rx_ring);
+	if (rx_ring->vsi->type == I40E_VSI_MAIN)
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
 	rx_ring->xdp_prog = NULL;
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
@@ -1256,6 +1259,7 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 {
 	struct device *dev = rx_ring->dev;
+	int err = -ENOMEM;
 	int bi_size;
 
 	/* warn if we are about to overwrite the pointer */
@@ -1283,13 +1287,21 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
+	/* XDP RX-queue info only needed for RX rings exposed to XDP */
+	if (rx_ring->vsi->type == I40E_VSI_MAIN) {
+		err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
+				       rx_ring->queue_index);
+		if (err < 0)
+			goto err;
+	}
+
 	rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
 
 	return 0;
 err:
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
-	return -ENOMEM;
+	return err;
 }
 
 /**
@@ -2068,11 +2080,13 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	struct sk_buff *skb = rx_ring->skb;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
 	bool failure = false, xdp_xmit = false;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		struct i40e_rx_buffer *rx_buffer;
 		union i40e_rx_desc *rx_desc;
-		struct xdp_buff xdp;
 		unsigned int size;
 		u16 vlan_tag;
 		u8 rx_ptype;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fbae1182e2ea..2d08760fc4ce 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -27,6 +27,8 @@
 #ifndef _I40E_TXRX_H_
 #define _I40E_TXRX_H_
 
+#include <net/xdp.h>
+
 /* Interrupt Throttling and Rate Limiting Goodies */
 
 #define I40E_MAX_ITR               0x0FF0  /* reg uses 2 usec resolution */
@@ -428,6 +430,7 @@ struct i40e_ring {
 					 */
 
 	struct i40e_channel *ch;
+	struct xdp_rxq_info xdp_rxq;
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)

^ permalink raw reply related

* [bpf-next V2 PATCH 04/14] ixgbe: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, dsahern, Alexander Duyck, intel-wired-lan, Jeff Kirsher,
	Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Driver hook points for xdp_rxq_info:
 * reg  : ixgbe_setup_rx_resources()
 * unreg: ixgbe_free_rx_resources()

Tested on actual hardware.

V2: Fix ixgbe_set_ringparam, clear xdp_rxq_info in temp_ring

Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |    2 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |    4 ++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |   10 +++++++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 468c3555a629..8611763d6129 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -53,6 +53,7 @@
 #include <linux/dca.h>
 #endif
 
+#include <net/xdp.h>
 #include <net/busy_poll.h>
 
 /* common prefix used by pr_<> macros */
@@ -371,6 +372,7 @@ struct ixgbe_ring {
 		struct ixgbe_tx_queue_stats tx_stats;
 		struct ixgbe_rx_queue_stats rx_stats;
 	};
+	struct xdp_rxq_info xdp_rxq;
 } ____cacheline_internodealigned_in_smp;
 
 enum ixgbe_ring_f_enum {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 0aad1c2a3667..0aaf70b3cfcd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1156,6 +1156,10 @@ static int ixgbe_set_ringparam(struct net_device *netdev,
 			memcpy(&temp_ring[i], adapter->rx_ring[i],
 			       sizeof(struct ixgbe_ring));
 
+			/* Clear copied XDP RX-queue info */
+			memset(&temp_ring[i].xdp_rxq, 0,
+			       sizeof(temp_ring[i].xdp_rxq));
+
 			temp_ring[i].count = new_rx_count;
 			err = ixgbe_setup_rx_resources(adapter, &temp_ring[i]);
 			if (err) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7737a05c717c..95aba975b391 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2318,12 +2318,14 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 #endif /* IXGBE_FCOE */
 	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
 	bool xdp_xmit = false;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
 		struct ixgbe_rx_buffer *rx_buffer;
 		struct sk_buff *skb;
-		struct xdp_buff xdp;
 		unsigned int size;
 
 		/* return some buffers to hardware, one at a time is too slow */
@@ -6444,6 +6446,11 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
+	/* XDP RX-queue info */
+	if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, adapter->netdev,
+			     rx_ring->queue_index) < 0)
+		goto err;
+
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
@@ -6541,6 +6548,7 @@ void ixgbe_free_rx_resources(struct ixgbe_ring *rx_ring)
 	ixgbe_clean_rx_ring(rx_ring);
 
 	rx_ring->xdp_prog = NULL;
+	xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
 	vfree(rx_ring->rx_buffer_info);
 	rx_ring->rx_buffer_info = NULL;
 

^ permalink raw reply related

* [bpf-next V2 PATCH 05/14] xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: bjorn.topel, netdev, Jesper Dangaard Brouer, Ariel Elior, dsahern,
	gospo, everest-linux-l2, michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

The driver code qede_free_fp_array() depend on kfree() can be called
with a NULL pointer. This stems from the qede_alloc_fp_array()
function which either (kz)alloc memory for fp->txq or fp->rxq.
This also simplifies error handling code in case of memory allocation
failures, but xdp_rxq_info_unreg need to know the difference.

Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails
and detect this is the failure path by seeing that xdp_rxq_info was
not registred yet, which first happens after successful alloaction in
qede_init_fp().

Driver hook points for xdp_rxq_info:
 * reg  : qede_init_fp
 * unreg: qede_free_fp_array

Tested on actual hardware with samples/bpf program.

V2: Driver have no proper error path for failed XDP RX-queue info reg, as
qede_init_fp() is a void function.

Cc: everest-linux-l2@cavium.com
Cc: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/qlogic/qede/qede.h      |    2 ++
 drivers/net/ethernet/qlogic/qede/qede_fp.c   |    1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c |   10 ++++++++++
 include/net/xdp.h                            |    1 +
 net/core/xdp.c                               |    6 ++++++
 5 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h
index a3a70ade411f..62f47736511b 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -40,6 +40,7 @@
 #include <linux/kernel.h>
 #include <linux/mutex.h>
 #include <linux/bpf.h>
+#include <net/xdp.h>
 #include <linux/qed/qede_rdma.h>
 #include <linux/io.h>
 #ifdef CONFIG_RFS_ACCEL
@@ -345,6 +346,7 @@ struct qede_rx_queue {
 	u64 xdp_no_pass;
 
 	void *handle;
+	struct xdp_rxq_info xdp_rxq;
 };
 
 union db_prod {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index 48ec4c56cddf..dafc079ab6b9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1006,6 +1006,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
 	xdp.data = xdp.data_hard_start + *data_offset;
 	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
+	xdp.rxq = &rxq->xdp_rxq;
 
 	/* Queues always have a full reset currently, so for the time
 	 * being until there's atomic program replace just mark read
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 57332b3e5e64..24160f1fd0e5 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -762,6 +762,12 @@ static void qede_free_fp_array(struct qede_dev *edev)
 			fp = &edev->fp_array[i];
 
 			kfree(fp->sb_info);
+			/* Handle mem alloc failure case where qede_init_fp
+			 * didn't register xdp_rxq_info yet.
+			 * Implicit only (fp->type & QEDE_FASTPATH_RX)
+			 */
+			if (fp->rxq && xdp_rxq_info_is_reg(&fp->rxq->xdp_rxq))
+				xdp_rxq_info_unreg(&fp->rxq->xdp_rxq);
 			kfree(fp->rxq);
 			kfree(fp->xdp_tx);
 			kfree(fp->txq);
@@ -1498,6 +1504,10 @@ static void qede_init_fp(struct qede_dev *edev)
 			else
 				fp->rxq->data_direction = DMA_FROM_DEVICE;
 			fp->rxq->dev = &edev->pdev->dev;
+
+			/* Driver have no error path from here */
+			WARN_ON(xdp_rxq_info_reg(&fp->rxq->xdp_rxq, edev->ndev,
+						 fp->rxq->rxq_id) < 0);
 		}
 
 		if (fp->type & QEDE_FASTPATH_TX) {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 8c6e83293bba..9b29e06e4687 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -43,5 +43,6 @@ int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
+bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/xdp.c b/net/core/xdp.c
index dde541b71aaa..5af17c44b92d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -66,3 +66,9 @@ void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq)
 	xdp_rxq->reg_state = REG_STATE_UNUSED;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unused);
+
+bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
+{
+	return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);

^ permalink raw reply related

* [bpf-next V2 PATCH 06/14] mlx4: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, dsahern, Jesper Dangaard Brouer, gospo, bjorn.topel,
	michael.chan, Tariq Toukan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Driver hook points for xdp_rxq_info:
 * reg  : mlx4_en_create_rx_ring
 * unreg: mlx4_en_destroy_rx_ring

Tested on actual hardware.

Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |   13 ++++++++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    4 +++-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 99051a294fa6..0cfcf3089ae4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2172,8 +2172,9 @@ static int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
 
 		if (mlx4_en_create_rx_ring(priv, &priv->rx_ring[i],
 					   prof->rx_ring_size, priv->stride,
-					   node))
+					   node, i))
 			goto err;
+
 	}
 
 #ifdef CONFIG_RFS_ACCEL
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 85e28efcda33..48bbfc1ad391 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -262,7 +262,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
 
 int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_rx_ring **pring,
-			   u32 size, u16 stride, int node)
+			   u32 size, u16 stride, int node, int queue_index)
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_rx_ring *ring;
@@ -286,6 +286,9 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 	ring->log_stride = ffs(ring->stride) - 1;
 	ring->buf_size = ring->size * ring->stride + TXBB_SIZE;
 
+	if (xdp_rxq_info_reg(&ring->xdp_rxq, priv->dev, queue_index) < 0)
+		goto err_ring;
+
 	tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
 					sizeof(struct mlx4_en_rx_alloc));
 	ring->rx_info = vzalloc_node(tmp, node);
@@ -293,7 +296,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 		ring->rx_info = vzalloc(tmp);
 		if (!ring->rx_info) {
 			err = -ENOMEM;
-			goto err_ring;
+			goto err_xdp_info;
 		}
 	}
 
@@ -317,6 +320,8 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 err_info:
 	vfree(ring->rx_info);
 	ring->rx_info = NULL;
+err_xdp_info:
+	xdp_rxq_info_unreg(&ring->xdp_rxq);
 err_ring:
 	kfree(ring);
 	*pring = NULL;
@@ -440,6 +445,7 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 					lockdep_is_held(&mdev->state_lock));
 	if (old_prog)
 		bpf_prog_put(old_prog);
+	xdp_rxq_info_unreg(&ring->xdp_rxq);
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
 	ring->rx_info = NULL;
@@ -650,6 +656,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	int cq_ring = cq->ring;
 	bool doorbell_pending;
 	struct mlx4_cqe *cqe;
+	struct xdp_buff xdp;
 	int polled = 0;
 	int index;
 
@@ -664,6 +671,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	/* Protect accesses to: ring->xdp_prog, priv->mac_hash list */
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(ring->xdp_prog);
+	xdp.rxq = &ring->xdp_rxq;
 	doorbell_pending = 0;
 
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
@@ -748,7 +756,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 * read bytes but not past the end of the frag.
 		 */
 		if (xdp_prog) {
-			struct xdp_buff xdp;
 			dma_addr_t dma;
 			void *orig_data;
 			u32 act;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 2b72677eccd4..39f3f1b1ab0a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -46,6 +46,7 @@
 #endif
 #include <linux/cpu_rmap.h>
 #include <linux/ptp_clock_kernel.h>
+#include <net/xdp.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/qp.h>
@@ -356,6 +357,7 @@ struct mlx4_en_rx_ring {
 	unsigned long dropped;
 	int hwtstamp_rx_filter;
 	cpumask_var_t affinity_mask;
+	struct xdp_rxq_info xdp_rxq;
 };
 
 struct mlx4_en_cq {
@@ -719,7 +721,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev);
 void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv);
 int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_rx_ring **pring,
-			   u32 size, u16 stride, int node);
+			   u32 size, u16 stride, int node, int queue_index);
 void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 			     struct mlx4_en_rx_ring **pring,
 			     u32 size, u16 stride);

^ permalink raw reply related

* [bpf-next V2 PATCH 07/14] bnxt_en: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, michael.chan,
	bjorn.topel, gospo, Andy Gospodarek
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Driver hook points for xdp_rxq_info:
 * reg  : bnxt_alloc_rx_rings
 * unreg: bnxt_free_rx_rings

This driver should be updated to re-register when changing
allocation mode of RX rings.

Tested on actual hardware.

Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |   10 ++++++++++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |    2 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |    1 +
 3 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1d865ae201db..e380f1a039ca 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2247,6 +2247,9 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 		if (rxr->xdp_prog)
 			bpf_prog_put(rxr->xdp_prog);
 
+		if (xdp_rxq_info_is_reg(&rxr->xdp_rxq))
+			xdp_rxq_info_unreg(&rxr->xdp_rxq);
+
 		kfree(rxr->rx_tpa);
 		rxr->rx_tpa = NULL;
 
@@ -2280,6 +2283,10 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
 
 		ring = &rxr->rx_ring_struct;
 
+		rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
+		if (rc < 0)
+			return rc;
+
 		rc = bnxt_alloc_ring(bp, ring);
 		if (rc)
 			return rc;
@@ -2834,6 +2841,9 @@ void bnxt_set_ring_params(struct bnxt *bp)
 	bp->cp_ring_mask = bp->cp_bit - 1;
 }
 
+/* Changing allocation mode of RX rings.
+ * TODO: Update when extending xdp_rxq_info to support allocation modes.
+ */
 int bnxt_set_rx_skb_mode(struct bnxt *bp, bool page_mode)
 {
 	if (page_mode) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 5359a1f0045f..2d268fc26f5e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -23,6 +23,7 @@
 #include <net/devlink.h>
 #include <net/dst_metadata.h>
 #include <net/switchdev.h>
+#include <net/xdp.h>
 
 struct tx_bd {
 	__le32 tx_bd_len_flags_type;
@@ -664,6 +665,7 @@ struct bnxt_rx_ring_info {
 
 	struct bnxt_ring_struct	rx_ring_struct;
 	struct bnxt_ring_struct	rx_agg_ring_struct;
+	struct xdp_rxq_info	xdp_rxq;
 };
 
 struct bnxt_cp_ring_info {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 261e5847557a..1389ab5e05df 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -96,6 +96,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
 	xdp.data = *data_ptr;
 	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = *data_ptr + *len;
+	xdp.rxq = &rxr->xdp_rxq;
 	orig_data = xdp.data;
 	mapping = rx_buf->mapping - bp->rx_dma_offset;
 

^ permalink raw reply related

* [bpf-next V2 PATCH 08/14] nfp: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: dsahern, Jakub Kicinski, oss-drivers, Simon Horman,
	Jesper Dangaard Brouer, netdev, gospo, bjorn.topel, michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Driver hook points for xdp_rxq_info:
 * reg  : nfp_net_rx_ring_alloc
 * unreg: nfp_net_rx_ring_free

In struct nfp_net_rx_ring moved member @size into a hole on 64-bit.
Thus, the size remaines the same after adding member @xdp_rxq.

Cc: oss-drivers@netronome.com
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h       |    5 ++++-
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   10 ++++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 3801c52098d5..0e564cfabe7e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -47,6 +47,7 @@
 #include <linux/netdevice.h>
 #include <linux/pci.h>
 #include <linux/io-64-nonatomic-hi-lo.h>
+#include <net/xdp.h>
 
 #include "nfp_net_ctrl.h"
 
@@ -350,6 +351,7 @@ struct nfp_net_rx_buf {
  * @rxds:       Virtual address of FL/RX ring in host memory
  * @dma:        DMA address of the FL/RX ring
  * @size:       Size, in bytes, of the FL/RX ring (needed to free)
+ * @xdp_rxq:    RX-ring info avail for XDP
  */
 struct nfp_net_rx_ring {
 	struct nfp_net_r_vector *r_vec;
@@ -361,13 +363,14 @@ struct nfp_net_rx_ring {
 	u32 idx;
 
 	int fl_qcidx;
+	unsigned int size;
 	u8 __iomem *qcp_fl;
 
 	struct nfp_net_rx_buf *rxbufs;
 	struct nfp_net_rx_desc *rxds;
 
 	dma_addr_t dma;
-	unsigned int size;
+	struct xdp_rxq_info xdp_rxq;
 } ____cacheline_aligned;
 
 /**
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 0add4870ce2e..45b8cae937be 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1608,11 +1608,13 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 	unsigned int true_bufsz;
 	struct sk_buff *skb;
 	int pkts_polled = 0;
+	struct xdp_buff xdp;
 	int idx;
 
 	rcu_read_lock();
 	xdp_prog = READ_ONCE(dp->xdp_prog);
 	true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz;
+	xdp.rxq = &rx_ring->xdp_rxq;
 	tx_ring = r_vec->xdp_ring;
 
 	while (pkts_polled < budget) {
@@ -1703,7 +1705,6 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 				  dp->bpf_offload_xdp) && !meta.portid) {
 			void *orig_data = rxbuf->frag + pkt_off;
 			unsigned int dma_off;
-			struct xdp_buff xdp;
 			int act;
 
 			xdp.data_hard_start = rxbuf->frag + NFP_NET_RX_BUF_HEADROOM;
@@ -2252,6 +2253,7 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring)
 	struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
 	struct nfp_net_dp *dp = &r_vec->nfp_net->dp;
 
+	xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
 	kfree(rx_ring->rxbufs);
 
 	if (rx_ring->rxds)
@@ -2275,7 +2277,11 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring)
 static int
 nfp_net_rx_ring_alloc(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring)
 {
-	int sz;
+	int sz, err;
+
+	err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, dp->netdev, rx_ring->idx);
+	if (err < 0)
+		return err;
 
 	rx_ring->cnt = dp->rxd_cnt;
 	rx_ring->size = sizeof(*rx_ring->rxds) * rx_ring->cnt;

^ permalink raw reply related

* [bpf-next V2 PATCH 09/14] thunderx: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: bjorn.topel, Robert Richter, netdev, dsahern,
	Jesper Dangaard Brouer, gospo, Sunil Goutham, michael.chan,
	linux-arm-kernel
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

This driver uses a bool scheme for "enable"/"disable" when setting up
different resources.  Thus, the hook points for xdp_rxq_info is done
in the same function call nicvf_rcv_queue_config().  This is activated
through enable/disable via nicvf_config_data_transfer(), which is tied
into nicvf_stop()/nicvf_open().

Extending driver packet handler call-path nicvf_rcv_pkt_handler() with
a pointer to the given struct rcv_queue, in-order to access the
xdp_rxq_info data area (in nicvf_xdp_rx()).

V2: Driver have no proper error path for failed XDP RX-queue info reg,
as nicvf_rcv_queue_config is a void function.

Cc: linux-arm-kernel@lists.infradead.org
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Robert Richter <rric@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   11 +++++++----
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |    4 ++++
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |    2 ++
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 52b3a6044f85..21618d0d694f 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -521,7 +521,7 @@ static void nicvf_unmap_page(struct nicvf *nic, struct page *page, u64 dma_addr)
 
 static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
 				struct cqe_rx_t *cqe_rx, struct snd_queue *sq,
-				struct sk_buff **skb)
+				struct rcv_queue *rq, struct sk_buff **skb)
 {
 	struct xdp_buff xdp;
 	struct page *page;
@@ -545,6 +545,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
 	xdp.data = (void *)cpu_addr;
 	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + len;
+	xdp.rxq = &rq->xdp_rxq;
 	orig_data = xdp.data;
 
 	rcu_read_lock();
@@ -698,7 +699,8 @@ static inline void nicvf_set_rxhash(struct net_device *netdev,
 
 static void nicvf_rcv_pkt_handler(struct net_device *netdev,
 				  struct napi_struct *napi,
-				  struct cqe_rx_t *cqe_rx, struct snd_queue *sq)
+				  struct cqe_rx_t *cqe_rx,
+				  struct snd_queue *sq, struct rcv_queue *rq)
 {
 	struct sk_buff *skb = NULL;
 	struct nicvf *nic = netdev_priv(netdev);
@@ -724,7 +726,7 @@ static void nicvf_rcv_pkt_handler(struct net_device *netdev,
 	/* For XDP, ignore pkts spanning multiple pages */
 	if (nic->xdp_prog && (cqe_rx->rb_cnt == 1)) {
 		/* Packet consumed by XDP */
-		if (nicvf_xdp_rx(snic, nic->xdp_prog, cqe_rx, sq, &skb))
+		if (nicvf_xdp_rx(snic, nic->xdp_prog, cqe_rx, sq, rq, &skb))
 			return;
 	} else {
 		skb = nicvf_get_rcv_skb(snic, cqe_rx,
@@ -781,6 +783,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
 	struct cqe_rx_t *cq_desc;
 	struct netdev_queue *txq;
 	struct snd_queue *sq = &qs->sq[cq_idx];
+	struct rcv_queue *rq = &qs->rq[cq_idx];
 	unsigned int tx_pkts = 0, tx_bytes = 0, txq_idx;
 
 	spin_lock_bh(&cq->lock);
@@ -811,7 +814,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
 
 		switch (cq_desc->cqe_type) {
 		case CQE_TYPE_RX:
-			nicvf_rcv_pkt_handler(netdev, napi, cq_desc, sq);
+			nicvf_rcv_pkt_handler(netdev, napi, cq_desc, sq, rq);
 			work_done++;
 		break;
 		case CQE_TYPE_SEND:
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index f38ea349aa00..14e62c6ac342 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -760,6 +760,7 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
 
 	if (!rq->enable) {
 		nicvf_reclaim_rcv_queue(nic, qs, qidx);
+		xdp_rxq_info_unreg(&rq->xdp_rxq);
 		return;
 	}
 
@@ -772,6 +773,9 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
 	/* all writes of RBDR data to be loaded into L2 Cache as well*/
 	rq->caching = 1;
 
+	/* Driver have no proper error path for failed XDP RX-queue info reg */
+	WARN_ON(xdp_rxq_info_reg(&rq->xdp_rxq, nic->netdev, qidx) < 0);
+
 	/* Send a mailbox msg to PF to config RQ */
 	mbx.rq.msg = NIC_MBOX_MSG_RQ_CFG;
 	mbx.rq.qs_num = qs->vnic_id;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 178ab6e8e3c5..7d1e4e2aaad0 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -12,6 +12,7 @@
 #include <linux/netdevice.h>
 #include <linux/iommu.h>
 #include <linux/bpf.h>
+#include <net/xdp.h>
 #include "q_struct.h"
 
 #define MAX_QUEUE_SET			128
@@ -255,6 +256,7 @@ struct rcv_queue {
 	u8		start_qs_rbdr_idx; /* RBDR idx in the above QS */
 	u8		caching;
 	struct		rx_tx_queue_stats stats;
+	struct xdp_rxq_info xdp_rxq;
 } ____cacheline_aligned_in_smp;
 
 struct cmp_queue {

^ permalink raw reply related

* [bpf-next V2 PATCH 10/14] tun: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, Jason Wang, dsahern,
	Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Driver hook points for xdp_rxq_info:
 * reg  : tun_attach
 * unreg: __tun_detach

I've done some manual testing of this tun driver, but I would
appriciate good review and someone else running their use-case tests,
as I'm not 100% sure I understand the tfile->detached semantics.

V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang.

Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/tun.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e367d6310353..e7c5f4b2a9a6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -180,6 +180,7 @@ struct tun_file {
 	struct list_head next;
 	struct tun_struct *detached;
 	struct skb_array tx_array;
+	struct xdp_rxq_info xdp_rxq;
 };
 
 struct tun_flow_entry {
@@ -687,8 +688,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 			    tun->dev->reg_state == NETREG_REGISTERED)
 				unregister_netdevice(tun->dev);
 		}
-		if (tun)
+		if (tun) {
 			skb_array_cleanup(&tfile->tx_array);
+			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+		}
 		sock_put(&tfile->sk);
 	}
 }
@@ -728,11 +731,13 @@ static void tun_detach_all(struct net_device *dev)
 		tun_napi_del(tun, tfile);
 		/* Drop read queue */
 		tun_queue_purge(tfile);
+		xdp_rxq_info_unreg(&tfile->xdp_rxq);
 		sock_put(&tfile->sk);
 	}
 	list_for_each_entry_safe(tfile, tmp, &tun->disabled, next) {
 		tun_enable_queue(tfile);
 		tun_queue_purge(tfile);
+		xdp_rxq_info_unreg(&tfile->xdp_rxq);
 		sock_put(&tfile->sk);
 	}
 	BUG_ON(tun->numdisabled != 0);
@@ -784,6 +789,22 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 
 	tfile->queue_index = tun->numqueues;
 	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
+
+	if (tfile->detached) {
+		/* Re-attach detached tfile, updating XDP queue_index */
+		WARN_ON(!xdp_rxq_info_is_reg(&tfile->xdp_rxq));
+
+		if (tfile->xdp_rxq.queue_index    != tfile->queue_index)
+			tfile->xdp_rxq.queue_index = tfile->queue_index;
+	} else {
+		/* Setup XDP RX-queue info, for new tfile getting attached */
+		err = xdp_rxq_info_reg(&tfile->xdp_rxq,
+				       tun->dev, tfile->queue_index);
+		if (err < 0)
+			goto out;
+		err = 0;
+	}
+
 	rcu_assign_pointer(tfile->tun, tun);
 	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
 	tun->numqueues++;
@@ -1508,6 +1529,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp.data = buf + pad;
 		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + len;
+		xdp.rxq = &tfile->xdp_rxq;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 

^ permalink raw reply related

* [bpf-next V2 PATCH 11/14] virtio_net: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Michael S. Tsirkin, netdev, Jason Wang, dsahern, virtualization,
	Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet.  Thus, the xdp_rxq_info also remains
fairly static.  This allow us to simply add the reg/unreg to
net_device open/close functions.

Driver hook points for xdp_rxq_info:
 * reg  : virtnet_open
 * unreg: virtnet_close

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/virtio_net.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6fb7b658a6cc..6d45d687ec6a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,6 +31,7 @@
 #include <linux/average.h>
 #include <linux/filter.h>
 #include <net/route.h>
+#include <net/xdp.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -115,6 +116,8 @@ struct receive_queue {
 
 	/* Name of this receive queue: input.$index */
 	char name[40];
+
+	struct xdp_rxq_info xdp_rxq;
 };
 
 struct virtnet_info {
@@ -559,6 +562,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 		xdp.data = xdp.data_hard_start + xdp_headroom;
 		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + len;
+		xdp.rxq = &rq->xdp_rxq;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
@@ -1225,13 +1229,18 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	int i;
+	int i, err;
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		if (i < vi->curr_queue_pairs)
 			/* Make sure we have some buffers: if oom use wq. */
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
+
+		err = xdp_rxq_info_reg(&vi->rq[i].xdp_rxq, dev, i);
+		if (err < 0)
+			return err;
+
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
@@ -1560,6 +1569,7 @@ static int virtnet_close(struct net_device *dev)
 	cancel_delayed_work_sync(&vi->refill);
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
+		xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
 		napi_disable(&vi->rq[i].napi);
 		virtnet_napi_tx_disable(&vi->sq[i].napi);
 	}

^ permalink raw reply related

* [bpf-next V2 PATCH 12/14] xdp: generic XDP handling of xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Hook points for xdp_rxq_info:
 * reg  : netif_alloc_rx_queues
 * unreg: netif_free_rx_queues

The net_device have some members (num_rx_queues + real_num_rx_queues)
and data-area (dev->_rx with struct netdev_rx_queue's) that were
primarily used for exporting information about RPS (CONFIG_RPS) queues
to sysfs (CONFIG_SYSFS).

For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
and remove some of the CONFIG_SYSFS ifdefs.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   69 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cc4ce7456e38..43595b037872 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -44,6 +44,7 @@
 #include <net/dcbnl.h>
 #endif
 #include <net/netprio_cgroup.h>
+#include <net/xdp.h>
 
 #include <linux/netdev_features.h>
 #include <linux/neighbour.h>
@@ -686,6 +687,7 @@ struct netdev_rx_queue {
 #endif
 	struct kobject			kobj;
 	struct net_device		*dev;
+	struct xdp_rxq_info		xdp_rxq;
 } ____cacheline_aligned_in_smp;
 
 /*
diff --git a/net/core/dev.c b/net/core/dev.c
index b0eee49a2489..24d7bbdf3758 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3889,9 +3889,33 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	return NET_RX_DROP;
 }
 
+static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	struct netdev_rx_queue *rxqueue;
+
+	rxqueue = dev->_rx;
+
+	if (skb_rx_queue_recorded(skb)) {
+		u16 index = skb_get_rx_queue(skb);
+
+		if (unlikely(index >= dev->real_num_rx_queues)) {
+			WARN_ONCE(dev->real_num_rx_queues > 1,
+				  "%s received packet on queue %u, but number "
+				  "of RX queues is %u\n",
+				  dev->name, index, dev->real_num_rx_queues);
+
+			return rxqueue; /* Return first rxqueue */
+		}
+		rxqueue += index;
+	}
+	return rxqueue;
+}
+
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 				     struct bpf_prog *xdp_prog)
 {
+	struct netdev_rx_queue *rxqueue;
 	u32 metalen, act = XDP_DROP;
 	struct xdp_buff xdp;
 	void *orig_data;
@@ -3935,6 +3959,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	xdp.data_hard_start = skb->data - skb_headroom(skb);
 	orig_data = xdp.data;
 
+	rxqueue = netif_get_rxqueue(skb);
+	xdp.rxq = &rxqueue->xdp_rxq;
+
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	off = xdp.data - orig_data;
@@ -7557,12 +7584,12 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
 	unsigned int i, count = dev->num_rx_queues;
 	struct netdev_rx_queue *rx;
 	size_t sz = count * sizeof(*rx);
+	int err = 0;
 
 	BUG_ON(count < 1);
 
@@ -7572,11 +7599,39 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	dev->_rx = rx;
 
-	for (i = 0; i < count; i++)
+	for (i = 0; i < count; i++) {
 		rx[i].dev = dev;
+
+		/* XDP RX-queue setup */
+		err = xdp_rxq_info_reg(&rx[i].xdp_rxq, dev, i);
+		if (err < 0)
+			goto err_rxq_info;
+	}
 	return 0;
+
+err_rxq_info:
+	/* Rollback successful reg's and free other resources */
+	while (i--)
+		xdp_rxq_info_unreg(&rx[i].xdp_rxq);
+	kfree(dev->_rx);
+	dev->_rx = NULL;
+	return err;
+}
+
+static void netif_free_rx_queues(struct net_device *dev)
+{
+	unsigned int i, count = dev->num_rx_queues;
+	struct netdev_rx_queue *rx;
+
+	/* netif_alloc_rx_queues alloc failed, resources have been unreg'ed */
+	if (!dev->_rx)
+		return;
+
+	rx = dev->_rx;
+
+	for (i = 0; i < count; i++)
+		xdp_rxq_info_unreg(&rx[i].xdp_rxq);
 }
-#endif
 
 static void netdev_init_one_queue(struct net_device *dev,
 				  struct netdev_queue *queue, void *_unused)
@@ -8137,12 +8192,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
-#ifdef CONFIG_SYSFS
 	if (rxqs < 1) {
 		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
 		return NULL;
 	}
-#endif
 
 	alloc_size = sizeof(struct net_device);
 	if (sizeof_priv) {
@@ -8199,12 +8252,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (netif_alloc_netdev_queues(dev))
 		goto free_all;
 
-#ifdef CONFIG_SYSFS
 	dev->num_rx_queues = rxqs;
 	dev->real_num_rx_queues = rxqs;
 	if (netif_alloc_rx_queues(dev))
 		goto free_all;
-#endif
 
 	strcpy(dev->name, name);
 	dev->name_assign_type = name_assign_type;
@@ -8243,9 +8294,7 @@ void free_netdev(struct net_device *dev)
 
 	might_sleep();
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_SYSFS
-	kvfree(dev->_rx);
-#endif
+	netif_free_rx_queues(dev);
 
 	kfree(rcu_dereference_protected(dev->ingress_queue, 1));
 

^ permalink raw reply related

* [bpf-next V2 PATCH 13/14] bpf: finally expose xdp_rxq_info to XDP bpf-programs
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

Now all XDP driver have been updated to setup xdp_rxq_info and assign
this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
of the xdp_rxq_info struct members.

This patch extend xdp_md and expose UAPI to userspace for
ingress_ifindex and rx_queue_index.  Access happens via bpf
instruction rewrite, that load data directly from struct xdp_rxq_info.

* ingress_ifindex map to xdp_rxq_info->dev->ifindex
* rx_queue_index  map to xdp_rxq_info->queue_index

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h |    3 +++
 net/core/filter.c        |   19 +++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69eabfcb9bdb..a6000a95d40e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -899,6 +899,9 @@ struct xdp_md {
 	__u32 data;
 	__u32 data_end;
 	__u32 data_meta;
+	/* Below access go though struct xdp_rxq_info */
+	__u32 ingress_ifindex; /* rxq->dev->ifindex */
+	__u32 rx_queue_index;  /* rxq->queue_index  */
 };
 
 enum sk_action {
diff --git a/net/core/filter.c b/net/core/filter.c
index 130b842c3a15..acdb94c0e97f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4304,6 +4304,25 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct xdp_buff, data_end));
 		break;
+	case offsetof(struct xdp_md, ingress_ifindex):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, rxq));
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, dev),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct xdp_rxq_info, dev));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      bpf_target_off(struct net_device,
+						     ifindex, 4, target_size));
+		break;
+	case offsetof(struct xdp_md, rx_queue_index):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, rxq));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      bpf_target_off(struct xdp_rxq_info,
+						queue_index, 4, target_size));
+		break;
 	}
 
 	return insn - insn_buf;

^ permalink raw reply related

* [bpf-next V2 PATCH 14/14] samples/bpf: program demonstrating access to xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-22 17:12 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151396262289.20006.1429172971820409456.stgit@firesoul>

This sample program can be used for monitoring and reporting how many
packets per sec (pps) are received per NIC RX queue index and which
CPU processed the packet. In itself it is a useful tool for quickly
identifying RSS imbalance issues, see below.

The default XDP action is XDP_PASS in-order to provide a monitor
mode. For benchmarking purposes it is possible to specify other XDP
actions on the cmdline --action.

Output below shows an imbalance RSS case where most RXQ's deliver to
CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
things from a CPU level the two CPUs are processing approx the same
amount, BUT looking at the rx_queue_index levels it is clear that
RXQ-2 receive much better service, than other RXQs which all share CPU-0.

Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       900,473     0
XDP-RX CPU      2       906,921     0
XDP-RX CPU      total   1,807,395

RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   180,098     0
rx_queue_index    0:sum 180,098
rx_queue_index    1:0   180,098     0
rx_queue_index    1:sum 180,098
rx_queue_index    2:2   906,921     0
rx_queue_index    2:sum 906,921
rx_queue_index    3:0   180,098     0
rx_queue_index    3:sum 180,098
rx_queue_index    4:0   180,082     0
rx_queue_index    4:sum 180,082
rx_queue_index    5:0   180,093     0
rx_queue_index    5:sum 180,093

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile            |    4 
 samples/bpf/xdp_rxq_info_kern.c |   96 +++++++
 samples/bpf/xdp_rxq_info_user.c |  532 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 632 insertions(+)
 create mode 100644 samples/bpf/xdp_rxq_info_kern.c
 create mode 100644 samples/bpf/xdp_rxq_info_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4fb944a7ecf8..3ff7a05bea9a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_redirect_cpu
 hostprogs-y += xdp_monitor
+hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 
 # Libbpf dependencies
@@ -90,6 +91,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
+xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 
 # Tell kbuild to always build the programs
@@ -139,6 +141,7 @@ always += xdp_redirect_kern.o
 always += xdp_redirect_map_kern.o
 always += xdp_redirect_cpu_kern.o
 always += xdp_monitor_kern.o
+always += xdp_rxq_info_kern.o
 always += syscall_tp_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
@@ -182,6 +185,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_redirect_cpu += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
+HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
diff --git a/samples/bpf/xdp_rxq_info_kern.c b/samples/bpf/xdp_rxq_info_kern.c
new file mode 100644
index 000000000000..19cbd662202c
--- /dev/null
+++ b/samples/bpf/xdp_rxq_info_kern.c
@@ -0,0 +1,96 @@
+/* Example howto extract XDP RX-queue info
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Config setup from with userspace
+ *
+ * User-side setup ifindex in config_map, to verify that
+ * ctx->ingress_ifindex is correct (against configured ifindex)
+ */
+struct config {
+	__u32 action;
+	int ifindex;
+};
+struct bpf_map_def SEC("maps") config_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(struct config),
+	.max_entries	= 1,
+};
+
+/* Common stats data record (shared with userspace) */
+struct datarec {
+	__u64 processed;
+	__u64 issue;
+};
+
+struct bpf_map_def SEC("maps") stats_global_map = {
+	.type		= BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size	= sizeof(u32),
+	.value_size	= sizeof(struct datarec),
+	.max_entries	= 1,
+};
+
+#define MAX_RXQs 64
+
+/* Stats per rx_queue_index (per CPU) */
+struct bpf_map_def SEC("maps") rx_queue_index_map = {
+	.type		= BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size	= sizeof(u32),
+	.value_size	= sizeof(struct datarec),
+	.max_entries	= MAX_RXQs + 1,
+};
+
+SEC("xdp_prog0")
+int  xdp_prognum0(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data     = (void *)(long)ctx->data;
+	struct datarec *rec, *rxq_rec;
+	int ingress_ifindex;
+	struct config *config;
+	u32 key = 0;
+
+	/* Global stats record */
+	rec = bpf_map_lookup_elem(&stats_global_map, &key);
+	if (!rec)
+		return XDP_ABORTED;
+	rec->processed++;
+
+	/* Accessing ctx->ingress_ifindex, cause BPF to rewrite BPF
+	 * instructions inside kernel to access xdp_rxq->dev->ifindex
+	 */
+	ingress_ifindex = ctx->ingress_ifindex;
+
+	config = bpf_map_lookup_elem(&config_map, &key);
+	if (!config)
+		return XDP_ABORTED;
+
+	/* Simple test: check ctx provided ifindex is as expected */
+	if (ingress_ifindex != config->ifindex) {
+		/* count this error case */
+		rec->issue++;
+		return XDP_ABORTED;
+	}
+
+	/* Update stats per rx_queue_index. Handle if rx_queue_index
+	 * is larger than stats map can contain info for.
+	 */
+	key = ctx->rx_queue_index;
+	if (key >= MAX_RXQs)
+		key = MAX_RXQs;
+	rxq_rec = bpf_map_lookup_elem(&rx_queue_index_map, &key);
+	if (!rxq_rec)
+		return XDP_ABORTED;
+	rxq_rec->processed++;
+	if (key == MAX_RXQs)
+		rxq_rec->issue++;
+
+	return config->action;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
new file mode 100644
index 000000000000..d062b0e09924
--- /dev/null
+++ b/samples/bpf/xdp_rxq_info_user.c
@@ -0,0 +1,532 @@
+/*
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+static const char *__doc__ = " XDP RX-queue info extract example\n\n"
+	"Monitor how many packets per sec (pps) are received\n"
+	"per NIC RX queue index and which CPU processed the packet\n"
+	;
+
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+#include <locale.h>
+#include <sys/resource.h>
+#include <getopt.h>
+#include <net/if.h>
+#include <time.h>
+
+#include <arpa/inet.h>
+#include <linux/if_link.h>
+
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "bpf_util.h"
+
+static int ifindex = -1;
+static char ifname_buf[IF_NAMESIZE];
+static char *ifname;
+
+static __u32 xdp_flags;
+
+/* Exit return codes */
+#define EXIT_OK		0
+#define EXIT_FAIL		1
+#define EXIT_FAIL_OPTION	2
+#define EXIT_FAIL_XDP		3
+#define EXIT_FAIL_BPF		4
+#define EXIT_FAIL_MEM		5
+
+static const struct option long_options[] = {
+	{"help",	no_argument,		NULL, 'h' },
+	{"dev",		required_argument,	NULL, 'd' },
+	{"skb-mode",	no_argument,		NULL, 'S' },
+	{"sec",		required_argument,	NULL, 's' },
+	{"no-separators", no_argument,		NULL, 'z' },
+	{"action",	required_argument,	NULL, 'a' },
+	{0, 0, NULL,  0 }
+};
+
+static void int_exit(int sig)
+{
+	fprintf(stderr,
+		"Interrupted: Removing XDP program on ifindex:%d device:%s\n",
+		ifindex, ifname);
+	if (ifindex > -1)
+		set_link_xdp_fd(ifindex, -1, xdp_flags);
+	exit(EXIT_OK);
+}
+
+struct config {
+	__u32 action;
+	int ifindex;
+};
+#define XDP_ACTION_MAX (XDP_TX + 1)
+#define XDP_ACTION_MAX_STRLEN 11
+static const char *xdp_action_names[XDP_ACTION_MAX] = {
+	[XDP_ABORTED]	= "XDP_ABORTED",
+	[XDP_DROP]	= "XDP_DROP",
+	[XDP_PASS]	= "XDP_PASS",
+	[XDP_TX]	= "XDP_TX",
+};
+
+static const char *action2str(int action)
+{
+	if (action < XDP_ACTION_MAX)
+		return xdp_action_names[action];
+	return NULL;
+}
+
+static int parse_xdp_action(char *action_str)
+{
+	size_t maxlen;
+	__u64 action = -1;
+	int i;
+
+	for (i = 0; i < XDP_ACTION_MAX; i++) {
+		maxlen = XDP_ACTION_MAX_STRLEN;
+		if (strncmp(xdp_action_names[i], action_str, maxlen) == 0) {
+			action = i;
+			break;
+		}
+	}
+	return action;
+}
+
+static void list_xdp_actions(void)
+{
+	int i;
+
+	printf("Available XDP --action <options>\n");
+	for (i = 0; i < XDP_ACTION_MAX; i++)
+		printf("\t%s\n", xdp_action_names[i]);
+	printf("\n");
+}
+
+static void usage(char *argv[])
+{
+	int i;
+
+	printf("\nDOCUMENTATION:\n%s\n", __doc__);
+	printf(" Usage: %s (options-see-below)\n", argv[0]);
+	printf(" Listing options:\n");
+	for (i = 0; long_options[i].name != 0; i++) {
+		printf(" --%-12s", long_options[i].name);
+		if (long_options[i].flag != NULL)
+			printf(" flag (internal value:%d)",
+				*long_options[i].flag);
+		else
+			printf(" short-option: -%c",
+				long_options[i].val);
+		printf("\n");
+	}
+	printf("\n");
+	list_xdp_actions();
+}
+
+#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
+static __u64 gettime(void)
+{
+	struct timespec t;
+	int res;
+
+	res = clock_gettime(CLOCK_MONOTONIC, &t);
+	if (res < 0) {
+		fprintf(stderr, "Error with gettimeofday! (%i)\n", res);
+		exit(EXIT_FAIL);
+	}
+	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
+}
+
+/* Common stats data record shared with _kern.c */
+struct datarec {
+	__u64 processed;
+	__u64 issue;
+};
+struct record {
+	__u64 timestamp;
+	struct datarec total;
+	struct datarec *cpu;
+};
+struct stats_record {
+	struct record stats;
+	struct record *rxq;
+};
+
+static struct datarec *alloc_record_per_cpu(void)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	struct datarec *array;
+	size_t size;
+
+	size = sizeof(struct datarec) * nr_cpus;
+	array = malloc(size);
+	memset(array, 0, size);
+	if (!array) {
+		fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus);
+		exit(EXIT_FAIL_MEM);
+	}
+	return array;
+}
+
+static struct record *alloc_record_per_rxq(void)
+{
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	struct record *array;
+	size_t size;
+
+	size = sizeof(struct record) * nr_rxqs;
+	array = malloc(size);
+	memset(array, 0, size);
+	if (!array) {
+		fprintf(stderr, "Mem alloc error (nr_rxqs:%u)\n", nr_rxqs);
+		exit(EXIT_FAIL_MEM);
+	}
+	return array;
+}
+
+static struct stats_record *alloc_stats_record(void)
+{
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	struct stats_record *rec;
+	int i;
+
+	rec = malloc(sizeof(*rec));
+	memset(rec, 0, sizeof(*rec));
+	if (!rec) {
+		fprintf(stderr, "Mem alloc error\n");
+		exit(EXIT_FAIL_MEM);
+	}
+	rec->rxq = alloc_record_per_rxq();
+	for (i = 0; i < nr_rxqs; i++)
+		rec->rxq[i].cpu = alloc_record_per_cpu();
+
+	rec->stats.cpu = alloc_record_per_cpu();
+	return rec;
+}
+
+static void free_stats_record(struct stats_record *r)
+{
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	int i;
+
+	for (i = 0; i < nr_rxqs; i++)
+		free(r->rxq[i].cpu);
+
+	free(r->rxq);
+	free(r->stats.cpu);
+	free(r);
+}
+
+static bool map_collect_percpu(int fd, __u32 key, struct record *rec)
+{
+	/* For percpu maps, userspace gets a value per possible CPU */
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	struct datarec values[nr_cpus];
+	__u64 sum_processed = 0;
+	__u64 sum_issue = 0;
+	int i;
+
+	if ((bpf_map_lookup_elem(fd, &key, values)) != 0) {
+		fprintf(stderr,
+			"ERR: bpf_map_lookup_elem failed key:0x%X\n", key);
+		return false;
+	}
+	/* Get time as close as possible to reading map contents */
+	rec->timestamp = gettime();
+
+	/* Record and sum values from each CPU */
+	for (i = 0; i < nr_cpus; i++) {
+		rec->cpu[i].processed = values[i].processed;
+		sum_processed        += values[i].processed;
+		rec->cpu[i].issue = values[i].issue;
+		sum_issue        += values[i].issue;
+	}
+	rec->total.processed = sum_processed;
+	rec->total.issue     = sum_issue;
+	return true;
+}
+
+static void stats_collect(struct stats_record *rec)
+{
+	int fd, i, max_rxqs;
+
+	fd = map_data[1].fd; /* map: stats_global_map */
+	map_collect_percpu(fd, 0, &rec->stats);
+
+	fd = map_data[2].fd; /* map: rx_queue_index_map */
+	max_rxqs = map_data[2].def.max_entries;
+	for (i = 0; i < max_rxqs; i++)
+		map_collect_percpu(fd, i, &rec->rxq[i]);
+}
+
+static double calc_period(struct record *r, struct record *p)
+{
+	double period_ = 0;
+	__u64 period = 0;
+
+	period = r->timestamp - p->timestamp;
+	if (period > 0)
+		period_ = ((double) period / NANOSEC_PER_SEC);
+
+	return period_;
+}
+
+static __u64 calc_pps(struct datarec *r, struct datarec *p, double period_)
+{
+	__u64 packets = 0;
+	__u64 pps = 0;
+
+	if (period_ > 0) {
+		packets = r->processed - p->processed;
+		pps = packets / period_;
+	}
+	return pps;
+}
+
+static __u64 calc_errs_pps(struct datarec *r,
+			    struct datarec *p, double period_)
+{
+	__u64 packets = 0;
+	__u64 pps = 0;
+
+	if (period_ > 0) {
+		packets = r->issue - p->issue;
+		pps = packets / period_;
+	}
+	return pps;
+}
+
+static void stats_print(struct stats_record *stats_rec,
+			struct stats_record *stats_prev,
+			int action)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	double pps = 0, err = 0;
+	struct record *rec, *prev;
+	double t;
+	int rxq;
+	int i;
+
+	/* Header */
+	printf("\nRunning XDP on dev:%s (ifindex:%d) action:%s\n",
+	       ifname, ifindex, action2str(action));
+
+	/* stats_global_map */
+	{
+		char *fmt_rx = "%-15s %-7d %'-11.0f %'-10.0f %s\n";
+		char *fm2_rx = "%-15s %-7s %'-11.0f\n";
+		char *errstr = "";
+
+		printf("%-15s %-7s %-11s %-11s\n",
+		       "XDP stats", "CPU", "pps", "issue-pps");
+
+		rec  =  &stats_rec->stats;
+		prev = &stats_prev->stats;
+		t = calc_period(rec, prev);
+		for (i = 0; i < nr_cpus; i++) {
+			struct datarec *r = &rec->cpu[i];
+			struct datarec *p = &prev->cpu[i];
+
+			pps = calc_pps     (r, p, t);
+			err = calc_errs_pps(r, p, t);
+			if (err > 0)
+				errstr = "invalid-ifindex";
+			if (pps > 0)
+				printf(fmt_rx, "XDP-RX CPU",
+					i, pps, err, errstr);
+		}
+		pps  = calc_pps     (&rec->total, &prev->total, t);
+		err  = calc_errs_pps(&rec->total, &prev->total, t);
+		printf(fm2_rx, "XDP-RX CPU", "total", pps, err);
+	}
+
+	/* rx_queue_index_map */
+	printf("\n%-15s %-7s %-11s %-11s\n",
+	       "RXQ stats", "RXQ:CPU", "pps", "issue-pps");
+
+	for (rxq = 0; rxq < nr_rxqs; rxq++) {
+		char *fmt_rx = "%-15s %3d:%-3d %'-11.0f %'-10.0f %s\n";
+		char *fm2_rx = "%-15s %3d:%-3s %'-11.0f\n";
+		char *errstr = "";
+		int rxq_ = rxq;
+
+		/* Last RXQ in map catch overflows */
+		if (rxq_ == nr_rxqs - 1)
+			rxq_ = -1;
+
+		rec  =  &stats_rec->rxq[rxq];
+		prev = &stats_prev->rxq[rxq];
+		t = calc_period(rec, prev);
+		for (i = 0; i < nr_cpus; i++) {
+			struct datarec *r = &rec->cpu[i];
+			struct datarec *p = &prev->cpu[i];
+
+			pps = calc_pps     (r, p, t);
+			err = calc_errs_pps(r, p, t);
+			if (err > 0) {
+				if (rxq_ == -1)
+					errstr = "map-overflow-RXQ";
+				else
+					errstr = "err";
+			}
+			if (pps > 0)
+				printf(fmt_rx, "rx_queue_index",
+				       rxq_, i, pps, err, errstr);
+		}
+		pps  = calc_pps     (&rec->total, &prev->total, t);
+		err  = calc_errs_pps(&rec->total, &prev->total, t);
+		if (pps || err)
+			printf(fm2_rx, "rx_queue_index", rxq_, "sum", pps, err);
+	}
+}
+
+
+/* Pointer swap trick */
+static inline void swap(struct stats_record **a, struct stats_record **b)
+{
+	struct stats_record *tmp;
+
+	tmp = *a;
+	*a = *b;
+	*b = tmp;
+}
+
+static void stats_poll(int interval, int action)
+{
+	struct stats_record *record, *prev;
+
+	record = alloc_stats_record();
+	prev   = alloc_stats_record();
+	stats_collect(record);
+
+	while (1) {
+		swap(&prev, &record);
+		stats_collect(record);
+		stats_print(record, prev, action);
+		sleep(interval);
+	}
+
+	free_stats_record(record);
+	free_stats_record(prev);
+}
+
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
+	bool use_separators = true;
+	struct config cfg = { 0 };
+	char filename[256];
+	int longindex = 0;
+	int interval = 2;
+	__u32 key = 0;
+	int opt, err;
+
+	char action_str_buf[XDP_ACTION_MAX_STRLEN + 1 /* for \0 */] = { 0 };
+	int action = XDP_PASS; /* Default action */
+	char *action_str = NULL;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK)");
+		return 1;
+	}
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "ERR in load_bpf_file(): %s", bpf_log_buf);
+		return EXIT_FAIL;
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERR: load_bpf_file: %s\n", strerror(errno));
+		return EXIT_FAIL;
+	}
+
+	/* Parse commands line args */
+	while ((opt = getopt_long(argc, argv, "hSd:",
+				  long_options, &longindex)) != -1) {
+		switch (opt) {
+		case 'd':
+			if (strlen(optarg) >= IF_NAMESIZE) {
+				fprintf(stderr, "ERR: --dev name too long\n");
+				goto error;
+			}
+			ifname = (char *)&ifname_buf;
+			strncpy(ifname, optarg, IF_NAMESIZE);
+			ifindex = if_nametoindex(ifname);
+			if (ifindex == 0) {
+				fprintf(stderr,
+					"ERR: --dev name unknown err(%d):%s\n",
+					errno, strerror(errno));
+				goto error;
+			}
+			break;
+		case 's':
+			interval = atoi(optarg);
+			break;
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'z':
+			use_separators = false;
+			break;
+		case 'a':
+			action_str = (char *)&action_str_buf;
+			strncpy(action_str, optarg, XDP_ACTION_MAX_STRLEN);
+			break;
+		case 'h':
+		error:
+		default:
+			usage(argv);
+			return EXIT_FAIL_OPTION;
+		}
+	}
+	/* Required option */
+	if (ifindex == -1) {
+		fprintf(stderr, "ERR: required option --dev missing\n");
+		usage(argv);
+		return EXIT_FAIL_OPTION;
+	}
+	cfg.ifindex = ifindex;
+
+	/* Parse action string */
+	if (action_str) {
+		action = parse_xdp_action(action_str);
+		if (action < 0) {
+			fprintf(stderr, "ERR: Invalid XDP --action: %s\n",
+				action_str);
+			list_xdp_actions();
+			return EXIT_FAIL_OPTION;
+		}
+	}
+	cfg.action = action;
+
+	/* Trick to pretty printf with thousands separators use %' */
+	if (use_separators)
+		setlocale(LC_NUMERIC, "en_US");
+
+	/* User-side setup ifindex in config_map */
+	err = bpf_map_update_elem(map_fd[0], &key, &cfg, 0);
+	if (err) {
+		fprintf(stderr, "Store config failed (err:%d)\n", err);
+		exit(EXIT_FAIL_BPF);
+	}
+
+	/* Remove XDP program when program is interrupted */
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) {
+		fprintf(stderr, "link set xdp fd failed\n");
+		return EXIT_FAIL_XDP;
+	}
+
+	stats_poll(interval, action);
+	return EXIT_OK;
+}

^ permalink raw reply related

* Re: [PATCH v3 1/4] security: Add support for SCTP security hooks
From: Casey Schaufler @ 2017-12-22 17:20 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, selinux, netdev, linux-sctp,
	linux-security-module
  Cc: paul, vyasevich, nhorman, sds, eparis, richard_c_haines
In-Reply-To: <80e6a4fb06ec6b8f81577abd11827d75dd6689ce.1513940757.git.marcelo.leitner@gmail.com>

On 12/22/2017 5:05 AM, Marcelo Ricardo Leitner wrote:
> From: Richard Haines <richard_c_haines@btinternet.com>
>
> The SCTP security hooks are explained in:
> Documentation/security/LSM-sctp.rst
>
> Signed-off-by: Richard Haines <richard_c_haines@btinternet.com>
> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  Documentation/security/LSM-sctp.rst | 194 ++++++++++++++++++++++++++++++++++++
>  include/linux/lsm_hooks.h           |  35 +++++++
>  include/linux/security.h            |  25 +++++
>  security/security.c                 |  22 ++++
>  4 files changed, 276 insertions(+)
>  create mode 100644 Documentation/security/LSM-sctp.rst
>
> diff --git a/Documentation/security/LSM-sctp.rst b/Documentation/security/LSM-sctp.rst
> new file mode 100644
> index 0000000000000000000000000000000000000000..61373672ce9f63bbd52d953500f44cdf3427c3f0
> --- /dev/null
> +++ b/Documentation/security/LSM-sctp.rst
> @@ -0,0 +1,194 @@
> +SCTP LSM Support
> +================
> +
> +For security module support, three sctp specific hooks have been implemented::
> +
> +    security_sctp_assoc_request()
> +    security_sctp_bind_connect()
> +    security_sctp_sk_clone()
> +
> +Also the following security hook has been utilised::
> +
> +    security_inet_conn_established()
> +
> +The usage of these hooks are described below with the SELinux implementation
> +described in ``Documentation/security/SELinux-sctp.rst``
> +
> +
> +security_sctp_assoc_request()
> +-----------------------------
> +This new hook passes the ``@ep`` and ``@chunk->skb`` (the association INIT
> +packet) to the security module. Returns 0 on success, error on failure.
> +::
> +
> +    @ep - pointer to sctp endpoint structure.
> +    @skb - pointer to skbuff of association packet.
> +
> +The security module performs the following operations:
> +     IF this is the first association on ``@ep->base.sk``, then set the peer
> +     sid to that in ``@skb``. This will ensure there is only one peer sid
> +     assigned to ``@ep->base.sk`` that may support multiple associations.
> +
> +     ELSE validate the ``@ep->base.sk peer_sid`` against the ``@skb peer sid``
> +     to determine whether the association should be allowed or denied.
> +
> +     Set the sctp ``@ep sid`` to socket's sid (from ``ep->base.sk``) with
> +     MLS portion taken from ``@skb peer sid``. This will be used by SCTP
> +     TCP style sockets and peeled off connections as they cause a new socket
> +     to be generated.
> +
> +     If IP security options are configured (CIPSO/CALIPSO), then the ip
> +     options are set on the socket.

Please! Basing the documentation for the infrastructure behavior
on a specific security module implementation makes it *really rough*
to adopt it to a different module. It makes it doubly difficult to
define how it will work with multiple modules. Take the SELinux specifics
out of the documentation for the hooks. Describe the general intention,
not how SELinux uses it.

> +
> +
> +security_sctp_bind_connect()
> +-----------------------------
> +This new hook passes one or more ipv4/ipv6 addresses to the security module
> +for validation based on the ``@optname`` that will result in either a bind or
> +connect service as shown in the permission check tables below.
> +Returns 0 on success, error on failure.
> +::
> +
> +    @sk      - Pointer to sock structure.
> +    @optname - Name of the option to validate.
> +    @address - One or more ipv4 / ipv6 addresses.
> +    @addrlen - The total length of address(s). This is calculated on each
> +               ipv4 or ipv6 address using sizeof(struct sockaddr_in) or
> +               sizeof(struct sockaddr_in6).
> +
> +  ------------------------------------------------------------------
> +  |                     BIND Type Checks                           |
> +  |       @optname             |         @address contains         |
> +  |----------------------------|-----------------------------------|
> +  | SCTP_SOCKOPT_BINDX_ADD     | One or more ipv4 / ipv6 addresses |
> +  | SCTP_PRIMARY_ADDR          | Single ipv4 or ipv6 address       |
> +  | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6 address       |
> +  ------------------------------------------------------------------
> +
> +  ------------------------------------------------------------------
> +  |                   CONNECT Type Checks                          |
> +  |       @optname             |         @address contains         |
> +  |----------------------------|-----------------------------------|
> +  | SCTP_SOCKOPT_CONNECTX      | One or more ipv4 / ipv6 addresses |
> +  | SCTP_PARAM_ADD_IP          | One or more ipv4 / ipv6 addresses |
> +  | SCTP_SENDMSG_CONNECT       | Single ipv4 or ipv6 address       |
> +  | SCTP_PARAM_SET_PRIMARY     | Single ipv4 or ipv6 address       |
> +  ------------------------------------------------------------------
> +
> +A summary of the ``@optname`` entries is as follows::
> +
> +    SCTP_SOCKOPT_BINDX_ADD - Allows additional bind addresses to be
> +                             associated after (optionally) calling
> +                             bind(3).
> +                             sctp_bindx(3) adds a set of bind
> +                             addresses on a socket.
> +
> +    SCTP_SOCKOPT_CONNECTX - Allows the allocation of multiple
> +                            addresses for reaching a peer
> +                            (multi-homed).
> +                            sctp_connectx(3) initiates a connection
> +                            on an SCTP socket using multiple
> +                            destination addresses.
> +
> +    SCTP_SENDMSG_CONNECT  - Initiate a connection that is generated by a
> +                            sendmsg(2) or sctp_sendmsg(3) on a new asociation.
> +
> +    SCTP_PRIMARY_ADDR     - Set local primary address.
> +
> +    SCTP_SET_PEER_PRIMARY_ADDR - Request peer sets address as
> +                                 association primary.
> +
> +    SCTP_PARAM_ADD_IP          - These are used when Dynamic Address
> +    SCTP_PARAM_SET_PRIMARY     - Reconfiguration is enabled as explained below.
> +
> +
> +To support Dynamic Address Reconfiguration the following parameters must be
> +enabled on both endpoints (or use the appropriate **setsockopt**\(2))::
> +
> +    /proc/sys/net/sctp/addip_enable
> +    /proc/sys/net/sctp/addip_noauth_enable
> +
> +then the following *_PARAM_*'s are sent to the peer in an
> +ASCONF chunk when the corresponding ``@optname``'s are present::
> +
> +          @optname                      ASCONF Parameter
> +         ----------                    ------------------
> +    SCTP_SOCKOPT_BINDX_ADD     ->   SCTP_PARAM_ADD_IP
> +    SCTP_SET_PEER_PRIMARY_ADDR ->   SCTP_PARAM_SET_PRIMARY
> +
> +
> +security_sctp_sk_clone()
> +-------------------------
> +This new hook is called whenever a new socket is created by **accept**\(2)
> +(i.e. a TCP style socket) or when a socket is 'peeled off' e.g userspace
> +calls **sctp_peeloff**\(3). ``security_sctp_sk_clone()`` will set the new
> +sockets sid and peer sid to that contained in the ``@ep sid`` and
> +``@ep peer sid`` respectively.
> +::
> +
> +    @ep - pointer to old sctp endpoint structure.
> +    @sk - pointer to old sock structure.
> +    @sk - pointer to new sock structure.

Again, SELinux may set the sids in this way, but another
module may treat the hook differently. Don't put SELinux
specifics into the infrastructure documentation.

> +
> +
> +security_inet_conn_established()
> +---------------------------------
> +This hook has been added to the receive COOKIE ACK processing where it sets
> +the connection's peer sid to that in ``@skb``::
> +
> +    @sk  - pointer to sock structure.
> +    @skb - pointer to skbuff of the COOKIE ACK packet.
> +
> +
> +Security Hooks used for Association Establishment
> +=================================================
> +The following diagram shows the use of ``security_sctp_connect_bind()``,
> +``security_sctp_assoc_request()``, ``security_inet_conn_established()`` when
> +establishing an association.
> +::
> +
> +      SCTP endpoint "A"                                SCTP endpoint "Z"
> +      =================                                =================
> +    sctp_sf_do_prm_asoc()
> + Association setup can be initiated
> + by a connect(2), sctp_connectx(3),
> + sendmsg(2) or sctp_sendmsg(3).
> + These will result in a call to
> + security_sctp_bind_connect() to
> + initiate an association to
> + SCTP peer endpoint "Z".
> +         INIT --------------------------------------------->
> +                                                   sctp_sf_do_5_1B_init()
> +                                                 Respond to an INIT chunk.
> +                                             SCTP peer endpoint "A" is
> +                                             asking for an association. Call
> +                                             security_sctp_assoc_request()
> +                                             to set the peer label if first
> +                                             association.
> +                                             If not first association, check
> +                                             whether allowed, IF so send:
> +          <----------------------------------------------- INIT ACK
> +          |                                  ELSE audit event and silently
> +          |                                       discard the packet.
> +          |
> +    COOKIE ECHO ------------------------------------------>
> +                                                          |
> +                                                          |
> +                                                          |
> +          <------------------------------------------- COOKIE ACK
> +          |                                               |
> +    sctp_sf_do_5_1E_ca                                    |
> + Call security_inet_conn_established()                    |
> + to set the correct peer sid.                             |
> +          |                                               |
> +          |                               If SCTP_SOCKET_TCP or peeled off
> +          |                               socket security_sctp_sk_clone() is
> +          |                               called to clone the new socket.
> +          |                                               |
> +      ESTABLISHED                                    ESTABLISHED
> +          |                                               |
> +    ------------------------------------------------------------------
> +    |                     Association Established                    |
> +    ------------------------------------------------------------------
> +
> +
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index c9258124e41757187cdb8b2f83c5901966345902..92ee9c6c604212ce38590bd2e5fcba55617b9c04 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -906,6 +906,32 @@
>   *	associated with the TUN device's security structure.
>   *	@security pointer to the TUN devices's security structure.
>   *
> + * Security hooks for SCTP
> + *
> + * @sctp_assoc_request:
> + *	If first association, then set the peer sid to that in @skb. If
> + *	@sctp_cid is from an INIT chunk, then set the sctp endpoint sid to
> + *	socket's sid (ep->base.sk) with MLS portion taken from peer sid.

Don't direct the module hook to do something. If the SCTP
code knows what every module should do, why not have the
SCTP code do it? Do not describe the hook in terms of the
SELinux behavior.

> + *	@ep pointer to sctp endpoint structure.
> + *	@skb pointer to skbuff of association packet.
> + *	Return 0 on success, error on failure.
> + * @sctp_bind_connect:
> + *	Validiate permissions required for each address associated with sock
> + *	@sk. Depending on @optname, the addresses will be treated as either
> + *	for a connect or bind service. The @addrlen is calculated on each
> + *	ipv4 and ipv6 address using sizeof(struct sockaddr_in) or
> + *	sizeof(struct sockaddr_in6).
> + *	@sk pointer to sock structure.
> + *	@optname name of the option to validate.
> + *	@address list containing one or more ipv4/ipv6 addresses.
> + *	@addrlen total length of address(s).
> + *	Return 0 on success, error on failure.
> + * @sctp_sk_clone:
> + *	Sets the new child socket's sid to the old endpoint sid.

... and again.

> + *	@ep pointer to old sctp endpoint structure.
> + *	@sk pointer to old sock structure.
> + *	@sk pointer to new sock structure.
> + *
>   * Security hooks for Infiniband
>   *
>   * @ib_pkey_access:
> @@ -1631,6 +1657,12 @@ union security_list_options {
>  	int (*tun_dev_attach_queue)(void *security);
>  	int (*tun_dev_attach)(struct sock *sk, void *security);
>  	int (*tun_dev_open)(void *security);
> +	int (*sctp_assoc_request)(struct sctp_endpoint *ep,
> +				  struct sk_buff *skb);
> +	int (*sctp_bind_connect)(struct sock *sk, int optname,
> +				 struct sockaddr *address, int addrlen);
> +	void (*sctp_sk_clone)(struct sctp_endpoint *ep, struct sock *sk,
> +			      struct sock *newsk);
>  #endif	/* CONFIG_SECURITY_NETWORK */
>  
>  #ifdef CONFIG_SECURITY_INFINIBAND
> @@ -1869,6 +1901,9 @@ struct security_hook_heads {
>  	struct list_head tun_dev_attach_queue;
>  	struct list_head tun_dev_attach;
>  	struct list_head tun_dev_open;
> +	struct list_head sctp_assoc_request;
> +	struct list_head sctp_bind_connect;
> +	struct list_head sctp_sk_clone;
>  #endif	/* CONFIG_SECURITY_NETWORK */
>  #ifdef CONFIG_SECURITY_INFINIBAND
>  	struct list_head ib_pkey_access;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index ce6265960d6c430a90e1ad3c3749d0a438ecaca9..51f6cc2417f278674dfbd434587af805cb0c03d3 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -115,6 +115,7 @@ struct xfrm_policy;
>  struct xfrm_state;
>  struct xfrm_user_sec_ctx;
>  struct seq_file;
> +struct sctp_endpoint;
>  
>  #ifdef CONFIG_MMU
>  extern unsigned long mmap_min_addr;
> @@ -1229,6 +1230,11 @@ int security_tun_dev_create(void);
>  int security_tun_dev_attach_queue(void *security);
>  int security_tun_dev_attach(struct sock *sk, void *security);
>  int security_tun_dev_open(void *security);
> +int security_sctp_assoc_request(struct sctp_endpoint *ep, struct sk_buff *skb);
> +int security_sctp_bind_connect(struct sock *sk, int optname,
> +			       struct sockaddr *address, int addrlen);
> +void security_sctp_sk_clone(struct sctp_endpoint *ep, struct sock *sk,
> +			    struct sock *newsk);
>  
>  #else	/* CONFIG_SECURITY_NETWORK */
>  static inline int security_unix_stream_connect(struct sock *sock,
> @@ -1421,6 +1427,25 @@ static inline int security_tun_dev_open(void *security)
>  {
>  	return 0;
>  }
> +
> +static inline int security_sctp_assoc_request(struct sctp_endpoint *ep,
> +					      struct sk_buff *skb)
> +{
> +	return 0;
> +}
> +
> +static inline int security_sctp_bind_connect(struct sock *sk, int optname,
> +					     struct sockaddr *address,
> +					     int addrlen)
> +{
> +	return 0;
> +}
> +
> +static inline void security_sctp_sk_clone(struct sctp_endpoint *ep,
> +					  struct sock *sk,
> +					  struct sock *newsk)
> +{
> +}
>  #endif	/* CONFIG_SECURITY_NETWORK */
>  
>  #ifdef CONFIG_SECURITY_INFINIBAND
> diff --git a/security/security.c b/security/security.c
> index 4bf0f571b4ef94df1d3c44b7fed6b7b651c1924f..1400678f6b72b36123f2fa2b909f35d257a62cd4 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1472,6 +1472,7 @@ void security_inet_conn_established(struct sock *sk,
>  {
>  	call_void_hook(inet_conn_established, sk, skb);
>  }
> +EXPORT_SYMBOL(security_inet_conn_established);
>  
>  int security_secmark_relabel_packet(u32 secid)
>  {
> @@ -1527,6 +1528,27 @@ int security_tun_dev_open(void *security)
>  }
>  EXPORT_SYMBOL(security_tun_dev_open);
>  
> +int security_sctp_assoc_request(struct sctp_endpoint *ep, struct sk_buff *skb)
> +{
> +	return call_int_hook(sctp_assoc_request, 0, ep, skb);
> +}
> +EXPORT_SYMBOL(security_sctp_assoc_request);
> +
> +int security_sctp_bind_connect(struct sock *sk, int optname,
> +			       struct sockaddr *address, int addrlen)
> +{
> +	return call_int_hook(sctp_bind_connect, 0, sk, optname,
> +			     address, addrlen);
> +}
> +EXPORT_SYMBOL(security_sctp_bind_connect);
> +
> +void security_sctp_sk_clone(struct sctp_endpoint *ep, struct sock *sk,
> +			    struct sock *newsk)
> +{
> +	call_void_hook(sctp_sk_clone, ep, sk, newsk);
> +}
> +EXPORT_SYMBOL(security_sctp_sk_clone);
> +
>  #endif	/* CONFIG_SECURITY_NETWORK */
>  
>  #ifdef CONFIG_SECURITY_INFINIBAND


^ permalink raw reply

* Re: [PATCH net 2/3] rds: tcp: initialize t_tcp_detached to false
From: Sowmini Varadhan @ 2017-12-22 17:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: rds-devel, santosh.shilimkar
In-Reply-To: <0500544d8e065679e6e31c4ad2793758d2160aba.1513946398.git.sowmini.varadhan@oracle.com>

On (12/22/17 08:14), Sowmini Varadhan wrote:
> Commit f10b4cff98c6 ("rds: tcp: atomically purge entries from
> rds_tcp_conn_list during netns delete") adds the field t_tcp_detached,
> but this needs to be initialized explicitly to false.

I just realized that t_tcp_detached (and the above commit) has not 
made its way to net yet, so patch 2 should apply to net-next (not net).

Please ignore this patch series,  I'll submit a V2.
Apologies for the confusion.

--Sowmini

^ permalink raw reply

* Re: [PATCH v3 1/4] security: Add support for SCTP security hooks
From: Marcelo Ricardo Leitner @ 2017-12-22 17:45 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: selinux, netdev, linux-sctp, linux-security-module, paul,
	vyasevich, nhorman, sds, eparis, richard_c_haines
In-Reply-To: <50a04eaf-6130-74d0-43e4-b09bd4de2180@schaufler-ca.com>

On Fri, Dec 22, 2017 at 09:20:45AM -0800, Casey Schaufler wrote:
> On 12/22/2017 5:05 AM, Marcelo Ricardo Leitner wrote:
> > From: Richard Haines <richard_c_haines@btinternet.com>
> >
> > The SCTP security hooks are explained in:
> > Documentation/security/LSM-sctp.rst

Thanks Casey for your comments. However, I'm not that acquainted with
these area of codes and I cannot work on them. I'll just wait for
Richard then.

  Marcelo

^ permalink raw reply

* [PATCH V2 net-next 0/3] rds bug fixes
From: Sowmini Varadhan @ 2017-12-22 17:38 UTC (permalink / raw)
  To: netdev; +Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

Ran into pre-existing bugs when working on the fix for
   https://www.spinics.net/lists/netdev/msg472849.html

The bugs fixed in this patchset are unrelated to the syzbot 
failure (which I'm still testing and trying to reproduce) but 
meanwhile, let's get these fixes out of the way.

V2: target net-next (rds:tcp patches have a dependancy on 
changes that are in net-next, but not yet in net)

Sowmini Varadhan (3):
  rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
  rds: tcp: initialized t_tcp_detached to false
  rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()

 net/rds/bind.c |    1 +
 net/rds/tcp.c  |   47 +++++++++++++++++++++++++++--------------------
 2 files changed, 28 insertions(+), 20 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox