Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2 net-next] net_failover: fix net_failover_compute_features()
From: David Miller @ 2018-06-04 13:31 UTC (permalink / raw)
  To: dan.carpenter; +Cc: sridhar.samudrala, netdev, kernel-janitors
In-Reply-To: <20180531120124.pc4txiifxnrslbei@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Thu, 31 May 2018 15:01:25 +0300

> @@ -380,7 +380,8 @@ static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
>  
>  static void net_failover_compute_features(struct net_device *dev)
>  {
> -	u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
> +	netdev_features_t vlan_features = FAILOVER_VLAN_FEATURES |
> +					  NETIF_F_ALL_FOR_ALL;

The type does need to be corrected to netdev_features_t, but the
logical operation is correct.

It's a policy operation that was simply by-hand propagated all
over the place where these kinds of calculations are performed.

So vlan_features is starting with a value of 0 intentionally.

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Eric Dumazet @ 2018-06-04 13:22 UTC (permalink / raw)
  To: Michal Hocko, Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim
In-Reply-To: <20180604131104.GS19202@dhcp22.suse.cz>



On 06/04/2018 06:11 AM, Michal Hocko wrote:
> On Thu 31-05-18 11:10:22, Michal Hocko wrote:

> Just in case you are interested
> ---
> From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 4 Jun 2018 15:07:37 +0200
> Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags
> 
> alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
> which is just a noop and a little bit confusing.
> 
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from the OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant.
> 
> Drop the pointless __GFP_NORETRY because this function is used as
> copy&paste source for other places.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [bpf-next V2 PATCH 3/8] ixgbe: implement flush flag for ndo_xdp_xmit
From: Daniel Borkmann @ 2018-06-04 13:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: liu.song.a23, songliubraving, John Fastabend
In-Reply-To: <152775719796.24817.11035788244128769860.stgit@firesoul>

On 05/31/2018 10:59 AM, Jesper Dangaard Brouer wrote:
> When passed the XDP_XMIT_FLUSH flag ixgbe_xdp_xmit now performs the
> same kind of ring tail update as in ixgbe_xdp_flush.  The update tail
> code in ixgbe_xdp_flush is generalized and shared with ixgbe_xdp_xmit.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 87f088f4af52..4fd77c9067f2 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -10022,6 +10022,15 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  	}
>  }
>  
> +static void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
> +{
> +	/* Force memory writes to complete before letting h/w know there
> +	 * are new descriptors to fetch.
> +	 */
> +	wmb();
> +	writel(ring->next_to_use, ring->tail);
> +}

Did you double check that this doesn't become a function call? Should this
get an __always_inline attribute?

> +
>  static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>  			  struct xdp_frame **frames, u32 flags)
>  {
> @@ -10033,7 +10042,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>  	if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
>  		return -ENETDOWN;
>  
> -	if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>  		return -EINVAL;
>  
>  	/* During program transitions its possible adapter->xdp_prog is assigned
> @@ -10054,6 +10063,9 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>  		}
>  	}
>  
> +	if (unlikely(flags & XDP_XMIT_FLUSH))
> +		ixgbe_xdp_ring_update_tail(ring);
> +
>  	return n - drops;
>  }
>  
> @@ -10072,11 +10084,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
>  	if (unlikely(!ring))
>  		return;
>  
> -	/* Force memory writes to complete before letting h/w know there
> -	 * are new descriptors to fetch.
> -	 */
> -	wmb();
> -	writel(ring->next_to_use, ring->tail);
> +	ixgbe_xdp_ring_update_tail(ring);
>  
>  	return;
>  }
> 

^ permalink raw reply

* Re: [PATCH] samples/bpf: Add xdp_sample_pkts example
From: Daniel Borkmann @ 2018-06-04 13:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Song Liu; +Cc: Networking
In-Reply-To: <87r2lm1z87.fsf@toke.dk>

On 06/04/2018 03:02 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 06/02/2018 06:22 AM, Daniel Borkmann wrote:
>>> On 05/31/2018 11:44 AM, Toke Høiland-Jørgensen wrote:
>>>> Song Liu <liu.song.a23@gmail.com> writes:
>>>>> On Wed, May 30, 2018 at 9:45 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>> This adds an example program showing how to sample packets from XDP using
>>>>>> the perf event buffer. The example userspace program just prints the
>>>>>> ethernet header for every packet sampled.
>>>>>>
>>>>>> Most of the userspace code is borrowed from other examples, most notably
>>>>>> trace_output.
>>>>>>
>>>>>> Note that the example only works when everything runs on CPU0; so
>>>>>> suitable smp_affinity needs to be set on the device. Some drivers seem
>>>>>> to reset smp_affinity when loading an XDP program, so it may be
>>>>>> necessary to change it after starting the example userspace program.
>>>>>
>>>>> Why does this only works when everything runs on CPU0? Is this
>>>>> something we can improve?
>>>>
>>>> Yeah, good question. Basically, the call from XDP to
>>>> bpf_perf_event_output() will fail with -EOPNOTSUPP. I tracked this down
>>>> to this if statement in __bpf_perf_event_output() in bpf_trace.c:
>>>>
>>>>> 	if (unlikely(event->oncpu != cpu))
>>>>> 		return -EOPNOTSUPP;
>>>>
>>>> I *think* that the way to fix this is for the userspace program to open
>>>> a perf file descriptor for each CPU in the system and poll all of them,
>>>> in which case the XDP program can pass the BPF_F_CURRENT_CPU flag to
>>>> access the right one.
>>> That is correct, you need one perf fd per cpu, and map them accordingly
>>> into the map slots when you use BPF_F_CURRENT_CPU.
>>
>> Given this is a sample that users are likely to copy from, I think it would
>> be great if you could fix this up so you can just pass in BPF_F_CURRENT_CPU
>> eventually. Thanks for working on this, Toke!
> 
> You're welcome! And yup, I was planning to. I'll need to add a new
> function to the trace helpers that can poll more than one fd; just
> haven't gotten around to it yet. :)

Ok, great, looking forward!

Cheers,
Daniel

^ permalink raw reply

* Re: [bpf-next V2 PATCH 2/8] i40e: implement flush flag for ndo_xdp_xmit
From: Daniel Borkmann @ 2018-06-04 13:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: liu.song.a23, songliubraving, John Fastabend
In-Reply-To: <152775719291.24817.3098409990616007642.stgit@firesoul>

On 05/31/2018 10:59 AM, Jesper Dangaard Brouer wrote:
> When passed the XDP_XMIT_FLUSH flag i40e_xdp_xmit now performs the
> same kind of ring tail update as in i40e_xdp_flush.  The advantage is
> that all the necessary checks have been performed and xdp_ring can be
> updated, instead of having to perform the exact same steps/checks in
> i40e_xdp_flush
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index c0451d6e0790..5f01e4ce9c92 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -3676,6 +3676,7 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>  	struct i40e_netdev_priv *np = netdev_priv(dev);
>  	unsigned int queue_index = smp_processor_id();
>  	struct i40e_vsi *vsi = np->vsi;
> +	struct i40e_ring *xdp_ring;
>  	int drops = 0;
>  	int i;
>  
> @@ -3685,20 +3686,25 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>  	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>  		return -ENXIO;
>  
> -	if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>  		return -EINVAL;
>  
> +	xdp_ring = vsi->xdp_rings[queue_index];
> +
>  	for (i = 0; i < n; i++) {
>  		struct xdp_frame *xdpf = frames[i];
>  		int err;
>  
> -		err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
> +		err = i40e_xmit_xdp_ring(xdpf, xdp_ring);
>  		if (err != I40E_XDP_TX) {
>  			xdp_return_frame_rx_napi(xdpf);
>  			drops++;
>  		}
>  	}
>  
> +	if (unlikely(flags & XDP_XMIT_FLUSH))
> +		i40e_xdp_ring_update_tail(xdp_ring);

In addition to Alexei's feedback, I'd remove the unlikely() on the flush from here and the
ixgbe one like you did on the rest of the drivers in the series, just let CPU decide.

For the invalid flags case it's totally fine and in fact you could probably do this for all
three cases where you bail out in the beginning of i40e_xdp_xmit() and won't able able to
send anything anyway:

        if (test_bit(__I40E_VSI_DOWN, vsi->state))
                return -ENETDOWN;

        if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
                return -ENXIO;

        if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
                return -EINVAL;

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Michal Hocko @ 2018-06-04 13:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim
In-Reply-To: <20180531091022.GL15278@dhcp22.suse.cz>

On Thu 31-05-18 11:10:22, Michal Hocko wrote:
> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> > On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> [...]
> > > I merely copied/pasted from alloc_skb_with_frags() :/
> > 
> > I will have a look at it. Thanks!
> 
> OK, so this is an example of an incremental development ;).
> 
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant. Should I send a patch?

Just in case you are interested
---
>From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 4 Jun 2018 15:07:37 +0200
Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags

alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
which is just a noop and a little bit confusing.

__GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
high order allocations") to prevent from the OOM killer. Yet this was
not enough because fb05e7a89f50 ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant.

Drop the pointless __GFP_NORETRY because this function is used as
copy&paste source for other places.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/core/skbuff.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 857e4e6f751a..c1f22adc30de 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5239,8 +5239,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 			if (npages >= 1 << order) {
 				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
 						   __GFP_COMP |
-						   __GFP_NOWARN |
-						   __GFP_NORETRY,
+						   __GFP_NOWARN,
 						   order);
 				if (page)
 					goto fill_page;
-- 
2.17.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related

* [PATCH net-next] qed: use dma_zalloc_coherent instead of allocator/memset
From: YueHaibing @ 2018-06-04 13:10 UTC (permalink / raw)
  To: davem, Ariel.Elior; +Cc: netdev, linux-kernel, everest-linux-l2, YueHaibing

Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/ethernet/qlogic/qed/qed_cxt.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
index 820b226..1835f00 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
@@ -936,14 +936,13 @@ static int qed_cxt_src_t2_alloc(struct qed_hwfn *p_hwfn)
 		u32 size = min_t(u32, total_size, psz);
 		void **p_virt = &p_mngr->t2[i].p_virt;
 
-		*p_virt = dma_alloc_coherent(&p_hwfn->cdev->pdev->dev,
-					     size,
-					     &p_mngr->t2[i].p_phys, GFP_KERNEL);
+		*p_virt = dma_zalloc_coherent(&p_hwfn->cdev->pdev->dev,
+					      size, &p_mngr->t2[i].p_phys,
+					      GFP_KERNEL);
 		if (!p_mngr->t2[i].p_virt) {
 			rc = -ENOMEM;
 			goto t2_fail;
 		}
-		memset(*p_virt, 0, size);
 		p_mngr->t2[i].size = size;
 		total_size -= size;
 	}
-- 
2.7.0

^ permalink raw reply related

* [PATCH net-next] wan/fsl_ucc_hdlc: use dma_zalloc_coherent instead of allocator/memset
From: YueHaibing @ 2018-06-04 13:07 UTC (permalink / raw)
  To: davem, qiang.zhao; +Cc: netdev, linux-kernel, linuxppc-dev, YueHaibing

Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/wan/fsl_ucc_hdlc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index 33df764..4205dfd 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -270,10 +270,10 @@ static int uhdlc_init(struct ucc_hdlc_private *priv)
 	iowrite16be(DEFAULT_HDLC_ADDR, &priv->ucc_pram->haddr4);
 
 	/* Get BD buffer */
-	bd_buffer = dma_alloc_coherent(priv->dev,
-				       (RX_BD_RING_LEN + TX_BD_RING_LEN) *
-				       MAX_RX_BUF_LENGTH,
-				       &bd_dma_addr, GFP_KERNEL);
+	bd_buffer = dma_zalloc_coherent(priv->dev,
+					(RX_BD_RING_LEN + TX_BD_RING_LEN) *
+					MAX_RX_BUF_LENGTH,
+					&bd_dma_addr, GFP_KERNEL);
 
 	if (!bd_buffer) {
 		dev_err(priv->dev, "Could not allocate buffer descriptors\n");
@@ -281,9 +281,6 @@ static int uhdlc_init(struct ucc_hdlc_private *priv)
 		goto free_tiptr;
 	}
 
-	memset(bd_buffer, 0, (RX_BD_RING_LEN + TX_BD_RING_LEN)
-			* MAX_RX_BUF_LENGTH);
-
 	priv->rx_buffer = bd_buffer;
 	priv->tx_buffer = bd_buffer + RX_BD_RING_LEN * MAX_RX_BUF_LENGTH;
 
-- 
2.7.0

^ permalink raw reply related

* Re: [PATCH] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-04 13:02 UTC (permalink / raw)
  To: Daniel Borkmann, Song Liu; +Cc: Networking
In-Reply-To: <672f2d99-f44d-7605-7c07-e9b6315f0bcd@iogearbox.net>

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 06/02/2018 06:22 AM, Daniel Borkmann wrote:
>> On 05/31/2018 11:44 AM, Toke Høiland-Jørgensen wrote:
>>> Song Liu <liu.song.a23@gmail.com> writes:
>>>
>>>> On Wed, May 30, 2018 at 9:45 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>> This adds an example program showing how to sample packets from XDP using
>>>>> the perf event buffer. The example userspace program just prints the
>>>>> ethernet header for every packet sampled.
>>>>>
>>>>> Most of the userspace code is borrowed from other examples, most notably
>>>>> trace_output.
>>>>>
>>>>> Note that the example only works when everything runs on CPU0; so
>>>>> suitable smp_affinity needs to be set on the device. Some drivers seem
>>>>> to reset smp_affinity when loading an XDP program, so it may be
>>>>> necessary to change it after starting the example userspace program.
>>>>
>>>> Why does this only works when everything runs on CPU0? Is this
>>>> something we can improve?
>>>
>>> Yeah, good question. Basically, the call from XDP to
>>> bpf_perf_event_output() will fail with -EOPNOTSUPP. I tracked this down
>>> to this if statement in __bpf_perf_event_output() in bpf_trace.c:
>>>
>>>> 	if (unlikely(event->oncpu != cpu))
>>>> 		return -EOPNOTSUPP;
>>>
>>> I *think* that the way to fix this is for the userspace program to open
>>> a perf file descriptor for each CPU in the system and poll all of them,
>>> in which case the XDP program can pass the BPF_F_CURRENT_CPU flag to
>>> access the right one.
>> That is correct, you need one perf fd per cpu, and map them accordingly
>> into the map slots when you use BPF_F_CURRENT_CPU.
>
> Given this is a sample that users are likely to copy from, I think it would
> be great if you could fix this up so you can just pass in BPF_F_CURRENT_CPU
> eventually. Thanks for working on this, Toke!

You're welcome! And yup, I was planning to. I'll need to add a new
function to the trace helpers that can poll more than one fd; just
haven't gotten around to it yet. :)

-Toke

^ permalink raw reply

* Re: 答复: ANNOUNCE: Enhanced IP v1.4
From: Eric Dumazet @ 2018-06-04 13:02 UTC (permalink / raw)
  To: PKU.孙斌, 'Willy Tarreau',
	'Eric Dumazet'
  Cc: 'Linux Kernel Network Developers'
In-Reply-To: <042801d3fbc9$02818fc0$0784af40$@pku.edu.cn>



On 06/03/2018 10:58 PM, PKU.孙斌 wrote:
> On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
>>
>>
>> On 06/03/2018 01:37 PM, Tom Herbert wrote:
>>
>>> This is not an inconsequential mechanism that is being proposed. It's
>>> a modification to IP protocol that is intended to work on the
>>> Internet, but it looks like the draft hasn't been updated for two
>>> years and it is not adopted by any IETF working group. I don't see how
>>> this can go anywhere without IETF support. Also, I suggest that you
>>> look at the IPv10 proposal since that was very similar in intent. One
>>> of the reasons that IPv10 shot down was because protocol transition
>>> mechanisms were more interesting ten years ago than today. IPv6 has
>>> good traction now. In fact, it's probably the case that it's now
>>> easier to bring up IPv6 than to try to make IPv4 options work over the
>>> Internet.
>>
>> +1
>>
>> Many hosts do not use IPv4 anymore.
>>
>> We even have the project making IPv4 support in linux optional.
> 
> I guess then Linux kernel wouldn't be able to boot itself without IPv4 built in, e.g., when we only have old L2 links (without the IPv6 frame type)...



*Optional* means that a CONFIG_IPV4 would be there, and some people could build a kernel with CONFIG_IPV4=n,

Like IPv6 is optional today.

Of course, most distros will select CONFIG_IPV4=y  (as they probably select CONFIG_IPV6=y today)

Do not worry, IPv4 is not dead, but I doubt Enhanced IP v1.4 has any chance,
it is at least 10 years too late.

^ permalink raw reply

* Re: [PATCH RFC ipsec-next 0/3] Virtual xfrm interfaces
From: David Miller @ 2018-06-04 12:58 UTC (permalink / raw)
  To: steffen.klassert
  Cc: netdev, eyal.birger, antony, benedictwong, lorenzo,
	shannon.nelson
In-Reply-To: <20180604060910.13896-1-steffen.klassert@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Mon, 4 Jun 2018 08:09:07 +0200

> This patchset introduces new virtual xfrm interfaces.
> The design of virtual xfrm interfaces interfaces was
> discussed at the Linux IPsec workshop 2018. This patchset
> implements these interfaces as the IPsec userspace and
> kernel developers agreed. The purpose of these interfaces
> is to overcome the design limitations that the existing
> VTI devices have.
> 
> We had two presentations about xfrm interfaces at
> the workshop. Slides with further informations
> can be found at the workshop homepage:
> 
> https://workshop.linux-ipsec.org/2018/

First off, you will have to describe in detail what the VTI
limitations are and how these new devices overcome them in this commit
message.

You can't just say "we discussed this over there, go take a look".

The place people "take a look" is your text here.

Second, since you didn't explain things, I have to ask.  Why is a new
special ID even necessary?  It makes the flowi bigger, and adds all of
this new logic.

All netdevs have an ifindex and you should be able to find a way to
use the ifindex of these new devices in the key somehow.

Thanks.

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Vlastimil Babka @ 2018-06-04 12:40 UTC (permalink / raw)
  To: Michal Hocko, Qing Huang
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim,
	santosh.shilimkar@oracle.com
In-Reply-To: <20180604062737.GA19202@dhcp22.suse.cz>

On 06/04/2018 08:27 AM, Michal Hocko wrote:
> On Fri 01-06-18 15:05:26, Qing Huang wrote:
>>
>>
>> On 6/1/2018 12:31 AM, Michal Hocko wrote:
>>> On Thu 31-05-18 19:04:46, Qing Huang wrote:
>>>>
>>>> On 5/31/2018 2:10 AM, Michal Hocko wrote:
>>>>> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
>>>>>> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
>>>>> [...]
>>>>>>> I merely copied/pasted from alloc_skb_with_frags() :/
>>>>>> I will have a look at it. Thanks!
>>>>> OK, so this is an example of an incremental development ;).
>>>>>
>>>>> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
>>>>> high order allocations") to prevent from OOM killer. Yet this was
>>>>> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
>>>>> allocation") didn't want an excessive reclaim for non-costly orders
>>>>> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
>>>>> place which is now redundant. Should I send a patch?
>>>>>
>>>> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
>>>> We experimented
>>>> with it a bit in the past but it seemed to cause other issue in our tests.
>>>> :-)
>>> GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
>>> memory reserves. So the risk is that you deplete those reserves and
>>> cause issues to other subsystems which need them as well.
>>>
>>>> By the way, we didn't encounter any OOM killer events. It seemed that the
>>>> mlx4_alloc_icm() triggered slowpath.
>>>> We still had about 2GB free memory while it was highly fragmented.
>>> The compaction was able to make a reasonable forward progress for you.
>>> But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
>>> then the OOM killer is clearly possible as long as the order is lower
>>> than 4.
>>
>> The allocation was 256KB so the order was much higher than 4. The compaction
>> seemed to be the root
>> cause for our problem. It took too long to finish its work while putting
>> mlx4_alloc_icm to sleep in a heavily
>> fragmented memory situation . Will NORETRY flag avoid the compaction ops and
>> fail the 256KB allocation
>> immediately so mlx4_alloc_icm can enter adjustable lower order allocation
>> code path quickly?
> 
> Costly orders should only perform a light compaction attempt unless
> __GFP_RETRY_MAY_FAIL is used IIRC. CCing Vlastimil. So __GFP_NORETRY
> shouldn't make any difference.

It's a bit more complicated. Costly allocations will try the light
compaction attempt first, even before reclaim. This is followed by
reclaim and a more costly compaction attempt. With __GFP_NORETRY, the
second compaction attempt is also only the light one, so the flag does
make a difference here.

^ permalink raw reply

* [PATCH] ath10k: use dma_zalloc_coherent instead of allocator/memset
From: YueHaibing @ 2018-06-04 12:35 UTC (permalink / raw)
  To: davem, kvalo; +Cc: netdev, linux-kernel, linux-wireless, YueHaibing

Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/wireless/ath/ath10k/wmi.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
index f97ab79..72db3bd 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -5018,13 +5018,11 @@ static int ath10k_wmi_alloc_chunk(struct ath10k *ar, u32 req_id,
 	void *vaddr;
 
 	pool_size = num_units * round_up(unit_len, 4);
-	vaddr = dma_alloc_coherent(ar->dev, pool_size, &paddr, GFP_KERNEL);
+	vaddr = dma_zalloc_coherent(ar->dev, pool_size, &paddr, GFP_KERNEL);
 
 	if (!vaddr)
 		return -ENOMEM;
 
-	memset(vaddr, 0, pool_size);
-
 	ar->wmi.mem_chunks[idx].vaddr = vaddr;
 	ar->wmi.mem_chunks[idx].paddr = paddr;
 	ar->wmi.mem_chunks[idx].len = pool_size;
-- 
2.7.0

^ permalink raw reply related

* Re: [net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Dmitry Vyukov via Virtualization @ 2018-06-04 12:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Easton, KVM list, netdev, syzkaller-bugs, LKML,
	virtualization, Guenter Roeck
In-Reply-To: <20180530055704-mutt-send-email-mst@kernel.org>

On Wed, May 30, 2018 at 5:01 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, May 29, 2018 at 03:19:08PM -0700, Guenter Roeck wrote:
>> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
>> > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
>> > so it should be allocated with kzalloc() to ensure all structure padding
>> > is zeroed.
>> >
>> > Signed-off-by: Kevin Easton <kevin@guarana.org>
>> > Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
>>
>> Is this patch going anywhere ?
>>
>> The patch fixes CVE-2018-1118. It would be useful to understand if and when
>> this problem is going to be fixed.
>>
>> Thanks,
>> Guenter
>> > ---
>> >  drivers/vhost/vhost.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> > index f3bd8e9..1b84dcff 100644
>> > --- a/drivers/vhost/vhost.c
>> > +++ b/drivers/vhost/vhost.c
>> > @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>> >  /* Create a new message. */
>> >  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
>> >  {
>> > -   struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
>> > +   struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
>> >     if (!node)
>> >             return NULL;
>> >     node->vq = vq;
>
> As I pointed out, we don't need to init the whole structure. The proper
> fix is thus (I think) below.
>
> Could you use your testing infrastructure to confirm this fixes the issue?

Hi Michael,

syzbot is self-service, see:

https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches


> Thanks!
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e941224..58d9aec90afb 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2342,6 +2342,9 @@ struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
>         struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
>         if (!node)
>                 return NULL;
> +
> +       /* Make sure all padding within the structure is initialized. */
> +       memset(&node->msg, 0, sizeof node->msg);
>         node->vq = vq;
>         node->msg.type = type;
>         return node;
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20180530055704-mutt-send-email-mst%40kernel.org.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH] samples/bpf: Add xdp_sample_pkts example
From: Daniel Borkmann @ 2018-06-04 12:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Song Liu; +Cc: Networking
In-Reply-To: <abd6bb15-175d-baf4-6ac9-d04d1baa0ebe@iogearbox.net>

On 06/02/2018 06:22 AM, Daniel Borkmann wrote:
> On 05/31/2018 11:44 AM, Toke Høiland-Jørgensen wrote:
>> Song Liu <liu.song.a23@gmail.com> writes:
>>
>>> On Wed, May 30, 2018 at 9:45 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>> This adds an example program showing how to sample packets from XDP using
>>>> the perf event buffer. The example userspace program just prints the
>>>> ethernet header for every packet sampled.
>>>>
>>>> Most of the userspace code is borrowed from other examples, most notably
>>>> trace_output.
>>>>
>>>> Note that the example only works when everything runs on CPU0; so
>>>> suitable smp_affinity needs to be set on the device. Some drivers seem
>>>> to reset smp_affinity when loading an XDP program, so it may be
>>>> necessary to change it after starting the example userspace program.
>>>
>>> Why does this only works when everything runs on CPU0? Is this
>>> something we can improve?
>>
>> Yeah, good question. Basically, the call from XDP to
>> bpf_perf_event_output() will fail with -EOPNOTSUPP. I tracked this down
>> to this if statement in __bpf_perf_event_output() in bpf_trace.c:
>>
>>> 	if (unlikely(event->oncpu != cpu))
>>> 		return -EOPNOTSUPP;
>>
>> I *think* that the way to fix this is for the userspace program to open
>> a perf file descriptor for each CPU in the system and poll all of them,
>> in which case the XDP program can pass the BPF_F_CURRENT_CPU flag to
>> access the right one.
> That is correct, you need one perf fd per cpu, and map them accordingly
> into the map slots when you use BPF_F_CURRENT_CPU.

Given this is a sample that users are likely to copy from, I think it would
be great if you could fix this up so you can just pass in BPF_F_CURRENT_CPU
eventually. Thanks for working on this, Toke!

^ permalink raw reply

* Re: bpf_redirect_map not working after tail call
From: Daniel Borkmann via iovisor-dev @ 2018-06-04 12:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, iovisor-dev, Daniel Borkmann
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20180604130435.27d29431-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 06/04/2018 01:04 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> On Fri, 1 Jun 2018 14:15:58 +0200
> Sebastiano Miano via iovisor-dev <iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
> 
>> Dear all,
>>
>> We have noticed that the bpf_redirect_map returns an error when it is
>> called after a tail call.
>> The xdp_redirect_map program (under sample/bpf) works fine, but if we
>> modify it as shown in the following diff, it doesn't work anymore.
>> I have debugged it with the xdp_monitor application and the error
>> returned is EFAULT.
>> Is this a known issue? Am I doing something wrong?
> 
> Argh, this is likely an issue/bug due to the check xdp_map_invalid(),
> that was introduced in commit 7c3001313396 ("bpf: fix ri->map_owner
> pointer on bpf_prog_realloc").
> 
> To Daniel, I don't know how to solve this, could you give some advice?
> 
> 
> 
>  static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
> 				   unsigned long aux)
>  {
> 	return (unsigned long)xdp_prog->aux != aux;
>  }
> 
>  static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
> 			       struct bpf_prog *xdp_prog)
>  {
> 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
> 	unsigned long map_owner = ri->map_owner;
> 	struct bpf_map *map = ri->map;
> 	u32 index = ri->ifindex;
> 	void *fwd = NULL;
> 	int err;
> 
> 	[...]
> 	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
> 		err = -EFAULT;
> 		map = NULL;
> 		goto err;
> 	}
> 	[...]

Argh, I see the issue. Working on a fix after checking the syzkaller reports.

Thanks for the report!

^ permalink raw reply

* Re: [PATCH net] sctp: not allow to set rto_min with a value below 200 msecs
From: Xin Long @ 2018-06-04 12:15 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Marcelo Ricardo Leitner, Neal Cardwell, Michael Tuexen,
	Neil Horman, Netdev, linux-sctp, David Miller, David Ahern,
	Eric Dumazet, syzkaller
In-Reply-To: <CACT4Y+bVvSsQm0hywC-_UqnJKhBeQryZZBjYaiWszvQGURS=vA@mail.gmail.com>

On Mon, Jun 4, 2018 at 4:34 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Tue, May 29, 2018 at 7:45 PM, Xin Long <lucien.xin@gmail.com> wrote:
>> On Wed, May 30, 2018 at 1:06 AM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>>> On Tue, May 29, 2018 at 12:03:46PM -0400, Neal Cardwell wrote:
>>>> On Tue, May 29, 2018 at 11:45 AM Marcelo Ricardo Leitner <
>>>> marcelo.leitner@gmail.com> wrote:
>>>> > - patch2 - fix rtx attack vector
>>>> >    - Add the floor value to rto_min to HZ/20 (which fits the values
>>>> >      that Michael shared on the other email)
>>>>
>>>> I would encourage allowing minimum RTO values down to 5ms, if the ACK
>>>> policy in the receiver makes this feasible. Our experience is that in
>>>> datacenter environments it can be advantageous to allow timer-based loss
>>>> recoveries using timeout values as low as 5ms, e.g.:
>>>
>>> Thanks Neal. On Xin's tests, the hearbeat timer becomes an issue at
>>> ~25ms already. Xin, can you share more details on the hw, which CPU
>>> was used?
>
> Hi,
>
> Did we reach any decision on this? This continues to produce bug
> reports on syzbot.
I will post a patch later today for the suggestion:
- patch1 - fix issue at hand
  - Use the max_t above
to fix this.


As for patch2 and patch3:
- patch2 - fix rtx attack vector
  - Add the floor value to rto_min to HZ/20 (which fits the values
    that Michael shared on the other email)
- patch3 - speed up initial HB again
  - change sctp_cmd_hb_timers_start() so hb timers are kickstarted
    when the association is established. AFAICT RFC doesn't specify
    when these initial ones should be sent, and I see no issues with
    speeding them up.

They are more like improvements, we will do it in the future after
getting more information.


>
> I am not sure whom you are asking, because Xin is you unless I am
> missing something :)
> But if you mean syzbot hardware, then it's GCE VMs with modern Intel
> CPUs but an important aspect is a heavy-debug config (which you can
> take from here https://syzkaller.appspot.com/bug?extid=3dcd59a1f907245f891f)
> and systematic bug reporting. So if it's any flaky in your testing, it
> will produce dozens of bug emails on syzbot.
>
>
>> It was on a KVM guest,  "-smp 2,cores=1,threads=1,sockets=2"
>> # lscpu
>> Architecture:          x86_64
>> CPU op-mode(s):        32-bit, 64-bit
>> Byte Order:            Little Endian
>> CPU(s):                2
>> On-line CPU(s) list:   0,1
>> Thread(s) per core:    1
>> Core(s) per socket:    1
>> Socket(s):             2
>> NUMA node(s):          1
>> Vendor ID:             GenuineIntel
>> CPU family:            6
>> Model:                 13
>> Model name:            QEMU Virtual CPU version 1.5.3
>> Stepping:              3
>> CPU MHz:               2397.222
>> BogoMIPS:              4794.44
>> Hypervisor vendor:     KVM
>> Virtualization type:   full
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              4096K
>> NUMA node0 CPU(s):     0,1
>> Flags:                 fpu de pse tsc msr pae mce cx8 apic sep mtrr
>> pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good
>> nopl cpuid pni cx16 hypervisor lahf_lm abm pti
>>
>> If we're counting on max_t to fix this CPU stuck. It should not that
>> matter if min rto < the value causing that stuck.
>>
>>>
>>> Anyway, what about we add a floor to rto_max too, so that RTO can
>>> actually grow into something bigger that don't hog the CPU? Like:
>>> rto_min floor = 5ms
>>> rto_max floor = 50ms
>>>
>>>   Marcelo

^ permalink raw reply

* [PATCH iproute2 1/2] ip: display netns name instead of nsid
From: Nicolas Dichtel @ 2018-06-04 12:12 UTC (permalink / raw)
  To: stephen; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <20180604121253.2140-1-nicolas.dichtel@6wind.com>

When iproute2 has a name for the nsid, let's display it. It's more
user friendly than a number.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 ip/ip_common.h |  1 +
 ip/ipaddress.c | 23 ++++++++++++++++++-----
 ip/ipnetns.c   | 10 ++++++++++
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 49eb7d7bed40..794478c546cd 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -60,6 +60,7 @@ void netns_map_init(void);
 void netns_nsid_socket_init(void);
 int print_nsid(const struct sockaddr_nl *who,
 	       struct nlmsghdr *n, void *arg);
+char *get_name_from_nsid(int nsid);
 int do_ipaddr(int argc, char **argv);
 int do_ipaddrlabel(int argc, char **argv);
 int do_iproute(int argc, char **argv);
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index c7c7e7df4e81..aee09c7ff6df 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -819,6 +819,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
 	unsigned int m_flag = 0;
 	SPRINT_BUF(b1);
 
+	netns_nsid_socket_init();
+	netns_map_init();
+
 	if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
 		return 0;
 
@@ -955,10 +958,16 @@ int print_linkinfo(const struct sockaddr_nl *who,
 		if (is_json_context()) {
 			print_int(PRINT_JSON, "link_netnsid", NULL, id);
 		} else {
-			if (id >= 0)
-				print_int(PRINT_FP, NULL,
-					  " link-netnsid %d", id);
-			else
+			if (id >= 0) {
+				char *name = get_name_from_nsid(id);
+
+				if (name)
+					print_string(PRINT_FP, NULL,
+						     " link-netns %s", name);
+				else
+					print_int(PRINT_FP, NULL,
+						  " link-netnsid %d", id);
+			} else
 				print_string(PRINT_FP, NULL,
 					     " link-netnsid %s", "unknown");
 		}
@@ -966,8 +975,12 @@ int print_linkinfo(const struct sockaddr_nl *who,
 
 	if (tb[IFLA_NEW_NETNSID]) {
 		int id = rta_getattr_u32(tb[IFLA_NEW_NETNSID]);
+		char *name = get_name_from_nsid(id);
 
-		print_int(PRINT_FP, NULL, " new-nsid %d", id);
+		if (name)
+			print_string(PRINT_FP, NULL, " new-netns %s", name);
+		else
+			print_int(PRINT_FP, NULL, " new-netnsid %d", id);
 	}
 	if (tb[IFLA_NEW_IFINDEX]) {
 		int id = rta_getattr_u32(tb[IFLA_NEW_IFINDEX]);
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index e06100f4ad2d..a4f5b02427e7 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -169,6 +169,16 @@ static struct nsid_cache *netns_map_get_by_nsid(int nsid)
 	return NULL;
 }
 
+char *get_name_from_nsid(int nsid)
+{
+	struct nsid_cache *c = netns_map_get_by_nsid(nsid);
+
+	if (c)
+		return c->name;
+
+	return NULL;
+}
+
 static int netns_map_add(int nsid, const char *name)
 {
 	struct nsid_cache *c;
-- 
2.15.1

^ permalink raw reply related

* [PATCH iproute2 0/2] display netns name instead of nsid
From: Nicolas Dichtel @ 2018-06-04 12:12 UTC (permalink / raw)
  To: stephen; +Cc: netdev
In-Reply-To: <20180531114615.3f10766f@shemminger-XPS-13-9360>


[PATCH iproute2 0/2] display netns name instead of nsid
 
After these patches, the iproute2 name of netns is displayed instead of
the nsid. It's easier to read/understand.

 ip/ip_common.h |  3 +++
 ip/ipaddress.c | 23 ++++++++++++++++++-----
 ip/iplink.c    | 18 ++++++++++++++++--
 ip/ipnetns.c   | 18 ++++++++++++++++--
 4 files changed, 53 insertions(+), 9 deletions(-)

Comments are welcomed,
Regards,
Nicolas

^ permalink raw reply

* [PATCH iproute2 2/2] iplink: enable to specify a name for the link-netns
From: Nicolas Dichtel @ 2018-06-04 12:12 UTC (permalink / raw)
  To: stephen; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <20180604121253.2140-1-nicolas.dichtel@6wind.com>

The 'link-netnsid' argument needs a number. Add 'link-netns' when the user
wants to use the iproute2 netns name instead of the nsid.

Example:
ip link add ipip1 link-netns foo type ipip remote 10.16.0.121 local 10.16.0.249

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 ip/ip_common.h |  2 ++
 ip/iplink.c    | 18 ++++++++++++++++--
 ip/ipnetns.c   |  8 ++++++--
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 794478c546cd..4d3227cbc389 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -61,6 +61,8 @@ void netns_nsid_socket_init(void);
 int print_nsid(const struct sockaddr_nl *who,
 	       struct nlmsghdr *n, void *arg);
 char *get_name_from_nsid(int nsid);
+int get_netnsid_from_name(const char *name);
+int set_netnsid_from_name(const char *name, int nsid);
 int do_ipaddr(int argc, char **argv);
 int do_ipaddrlabel(int argc, char **argv);
 int do_iproute(int argc, char **argv);
diff --git a/ip/iplink.c b/ip/iplink.c
index 9ff5f692a1d4..e4d4da96aedb 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -85,7 +85,7 @@ void iplink_usage(void)
 		"	                  [ broadcast LLADDR ]\n"
 		"	                  [ mtu MTU ]\n"
 		"	                  [ netns { PID | NAME } ]\n"
-		"	                  [ link-netnsid ID ]\n"
+		"	                  [ link-netns NAME | link-netnsid ID ]\n"
 		"			  [ alias NAME ]\n"
 		"	                  [ vf NUM [ mac LLADDR ]\n"
 		"				   [ vlan VLANID [ qos VLAN-QOS ] [ proto VLAN-PROTO ] ]\n"
@@ -865,10 +865,24 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 				 IFLA_INET6_ADDR_GEN_MODE, mode);
 			addattr_nest_end(&req->n, afs6);
 			addattr_nest_end(&req->n, afs);
+		} else if (matches(*argv, "link-netns") == 0) {
+			NEXT_ARG();
+			if (link_netnsid != -1)
+				duparg("link-netns/link-netnsid", *argv);
+			link_netnsid = get_netnsid_from_name(*argv);
+			/* No nsid? Try to assign one. */
+			if (link_netnsid < 0)
+				set_netnsid_from_name(*argv, -1);
+			link_netnsid = get_netnsid_from_name(*argv);
+			if (link_netnsid < 0)
+				invarg("Invalid \"link-netns\" value\n",
+				       *argv);
+			addattr32(&req->n, sizeof(*req), IFLA_LINK_NETNSID,
+				  link_netnsid);
 		} else if (matches(*argv, "link-netnsid") == 0) {
 			NEXT_ARG();
 			if (link_netnsid != -1)
-				duparg("link-netnsid", *argv);
+				duparg("link-netns/link-netnsid", *argv);
 			if (get_integer(&link_netnsid, *argv, 0))
 				invarg("Invalid \"link-netnsid\" value\n",
 				       *argv);
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index a4f5b02427e7..fb1daade366b 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -91,7 +91,7 @@ static int ipnetns_have_nsid(void)
 	return have_rtnl_getnsid;
 }
 
-static int get_netnsid_from_name(const char *name)
+int get_netnsid_from_name(const char *name)
 {
 	struct {
 		struct nlmsghdr n;
@@ -108,6 +108,8 @@ static int get_netnsid_from_name(const char *name)
 	struct rtgenmsg *rthdr;
 	int len, fd;
 
+	netns_nsid_socket_init();
+
 	fd = netns_get_fd(name);
 	if (fd < 0)
 		return fd;
@@ -701,7 +703,7 @@ out_delete:
 	return -1;
 }
 
-static int set_netnsid_from_name(const char *name, int nsid)
+int set_netnsid_from_name(const char *name, int nsid)
 {
 	struct {
 		struct nlmsghdr n;
@@ -715,6 +717,8 @@ static int set_netnsid_from_name(const char *name, int nsid)
 	};
 	int fd, err = 0;
 
+	netns_nsid_socket_init();
+
 	fd = netns_get_fd(name);
 	if (fd < 0)
 		return fd;
-- 
2.15.1

^ permalink raw reply related

* [PATCH bpf-next 11/11] samples/bpf: xdpsock: use skb Tx path for XDP_SKB
From: Björn Töpel @ 2018-06-04 12:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Make sure that XDP_SKB also uses the skb Tx path.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 samples/bpf/xdpsock_user.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 7494f60fbff8..d69c8d78d3fd 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -75,6 +75,7 @@ static int opt_queue;
 static int opt_poll;
 static int opt_shared_packet_buffer;
 static int opt_interval = 1;
+static u32 opt_xdp_bind_flags;
 
 struct xdp_umem_uqueue {
 	u32 cached_prod;
@@ -541,9 +542,12 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = opt_ifindex;
 	sxdp.sxdp_queue_id = opt_queue;
+
 	if (shared) {
 		sxdp.sxdp_flags = XDP_SHARED_UMEM;
 		sxdp.sxdp_shared_umem_fd = umem->fd;
+	} else {
+		sxdp.sxdp_flags = opt_xdp_bind_flags;
 	}
 
 	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
@@ -699,6 +703,7 @@ static void parse_command_line(int argc, char **argv)
 			break;
 		case 'S':
 			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			opt_xdp_bind_flags |= XDP_COPY;
 			break;
 		case 'N':
 			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
From: Björn Töpel @ 2018-06-04 12:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
XDP Tx rings are used for zero-copy. This will result in other devices
doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
dropped.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  93 +++++++++++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  23 +++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 140 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   2 +
 include/net/xdp_sock.h                      |  14 +++
 6 files changed, 242 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 8c602424d339..98c18c41809d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3073,8 +3073,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
-	if (ring_is_xdp(ring))
+	ring->clean_tx_irq = i40e_clean_tx_irq;
+	if (ring_is_xdp(ring)) {
 		ring->xsk_umem = i40e_xsk_umem(ring);
+		if (ring->xsk_umem)
+			ring->clean_tx_irq = i40e_clean_tx_irq_zc;
+	}
 
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
@@ -12162,6 +12166,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
 	.ndo_xdp_flush		= i40e_xdp_flush,
+	.ndo_xsk_async_xmit	= i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6b1142fbc697..923bb84a93ab 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -10,16 +10,6 @@
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
 
-static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
-				u32 td_tag)
-{
-	return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
-			   ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
-			   ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
-			   ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
-			   ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
-}
-
 #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
 /**
  * i40e_fdir - Generate a Flow Director descriptor based on fdata
@@ -649,9 +639,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		return;
 
-	/* Free all the Tx ring sk_buffs */
-	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+	/* Cleanup only needed for non XSK TX ZC rings */
+	if (!tx_ring->xsk_umem) {
+		/* Free all the Tx ring sk_buffs */
+		for (i = 0; i < tx_ring->count; i++)
+			i40e_unmap_and_free_tx_resource(tx_ring,
+							&tx_ring->tx_bi[i]);
+	}
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
@@ -768,8 +762,40 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
+void i40e_update_tx_stats(struct i40e_ring *tx_ring,
+			  unsigned int total_packets,
+			  unsigned int total_bytes)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+}
+
 #define WB_STRIDE 4
 
+void i40e_arm_wb(struct i40e_ring *tx_ring,
+		 struct i40e_vsi *vsi,
+		 int budget)
+{
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && (j > 0) &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -778,8 +804,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget)
 {
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
@@ -874,27 +900,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
-
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring, false);
 
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
+	i40e_arm_wb(tx_ring, vsi, budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
@@ -2467,10 +2475,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_tx_irq(vsi, ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
+
 		arm_wb |= ring->arm_wb;
 		ring->arm_wb = false;
 	}
@@ -3595,6 +3604,12 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
+	/* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
+	 * will drop incoming packets redirected by other devices!
+	 */
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return -ENXIO;
+
 	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
 		return -EINVAL;
 
@@ -3633,5 +3648,11 @@ void i40e_xdp_flush(struct net_device *dev)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return;
 
+	/* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
+	 * will drop incoming packets redirected by other devices!
+	 */
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return;
+
 	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index cddb185cd2f8..b9c42c352a8d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -426,6 +426,8 @@ struct i40e_ring {
 
 	int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
 	bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
+	bool (*clean_tx_irq)(struct i40e_vsi *vsi, struct i40e_ring *ring,
+			     int budget);
 	struct xdp_umem *xsk_umem;
 
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
@@ -506,6 +508,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		  u32 flags);
 void i40e_xdp_flush(struct net_device *dev);
 int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget);
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -687,6 +692,16 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
 }
 
+static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
+				u32 td_tag)
+{
+	return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
+			   ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
+			   ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
+			   ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
+			   ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
+}
+
 void i40e_fd_handle_status(struct i40e_ring *rx_ring,
 			   union i40e_rx_desc *rx_desc, u8 prog_id);
 int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
@@ -696,4 +711,12 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 			     u8 rx_ptype);
 void i40e_receive_skb(struct i40e_ring *rx_ring,
 		      struct sk_buff *skb, u16 vlan_tag);
+
+void i40e_update_tx_stats(struct i40e_ring *tx_ring,
+			  unsigned int total_packets,
+			  unsigned int total_bytes);
+void i40e_arm_wb(struct i40e_ring *tx_ring,
+		 struct i40e_vsi *vsi,
+		 int budget);
+
 #endif /* _I40E_TXRX_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 9d16924415b9..021fec5b5799 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -535,3 +535,143 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+/* Returns true if the work is finished */
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+	unsigned int total_packets = 0, total_bytes = 0;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	bool work_done = true;
+	dma_addr_t dma;
+	u32 len;
+
+	while (budget-- > 0) {
+		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			work_done = false;
+			break;
+		}
+
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
+			break;
+
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_BIDIRECTIONAL);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+		tx_bi->gso_segs = 1;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+							| I40E_TX_DESC_CMD_EOP,
+							  0, len, 0);
+
+		total_packets++;
+		total_bytes += len;
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+	}
+
+	if (total_packets > 0) {
+		/* Request an interrupt for the last frame and bump tail ptr. */
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+		i40e_xdp_ring_update_tail(xdp_ring);
+
+		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
+		i40e_update_tx_stats(xdp_ring, total_packets, total_bytes);
+	}
+
+	return !!budget && work_done;
+}
+
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget)
+{
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 head_idx = i40e_get_head(tx_ring);
+	unsigned int budget = vsi->work_limit;
+	bool work_done = true, xmit_done;
+	u32 completed_frames;
+	u32 frames_ready;
+
+	if (head_idx < tx_ring->next_to_clean)
+		head_idx += tx_ring->count;
+	frames_ready = head_idx - tx_ring->next_to_clean;
+
+	if (frames_ready == 0) {
+		goto out_xmit;
+	} else if (frames_ready > budget) {
+		completed_frames = budget;
+		work_done = false;
+	} else {
+		completed_frames = frames_ready;
+	}
+
+	tx_ring->next_to_clean += completed_frames;
+	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
+		tx_ring->next_to_clean -= tx_ring->count;
+
+	xsk_umem_complete_tx(umem, completed_frames);
+
+	i40e_arm_wb(tx_ring, vsi, budget);
+
+out_xmit:
+	xmit_done = i40e_xmit_zc(tx_ring, budget);
+
+	return work_done && xmit_done;
+}
+
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -ENETDOWN;
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return -ENXIO;
+
+	if (queue_id >= vsi->num_queue_pairs)
+		return -ENXIO;
+
+	if (!vsi->xdp_rings[queue_id]->xsk_umem)
+		return -ENXIO;
+
+	ring = vsi->xdp_rings[queue_id];
+
+	if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
+		i40e_force_wb(vsi, ring->q_vector);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
index 757ac5ca8511..bd006f1a4397 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -13,5 +13,7 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
 void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
 int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget);
 
 #endif /* _I40E_XSK_H_ */
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ec8fd3314097..63aa05abf11d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -103,6 +103,20 @@ static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
 static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
 {
 }
+
+static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+}
+
+static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+				       u32 *len)
+{
+	return false;
+}
+
+static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
+{
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

This commit adds initial AF_XDP zero-copy support for i40e-based
NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
to pass a UMEM to the driver. The driver will then DMA map all the
frames in the UMEM for the driver. Next, the Rx code will allocate
frames from the UMEM fill queue, instead of the regular page
allocator.

Externally, for the rest of the XDP code, the driver internal UMEM
allocator will appear as a MEM_TYPE_ZERO_COPY.

The commit also introduces a completely new clean_rx_irq/allocator
functions for zero-copy, and means (functions pointers) to set
allocators and clean_rx functions.

This first version does not support:
* passing frames to the stack via XDP_PASS (clone/copy to skb).
* doing XDP redirect to other than AF_XDP sockets
  (convert_to_xdp_frame does not clone the frame yet).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h      |  23 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
 include/net/xdp_sock.h                      |  19 +
 net/xdp/xdp_umem.h                          |  10 -
 9 files changed, 789 insertions(+), 146 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 14397e7e9925..50590e8d1fd1 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
 	i40e_txrx.o	\
 	i40e_ptp.o	\
 	i40e_client.o   \
-	i40e_virtchnl_pf.o
+	i40e_virtchnl_pf.o \
+	i40e_xsk.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..20955e5dce02 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,12 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	/* AF_XDP zero-copy */
+	struct xdp_umem **xsk_umems;
+	u16 num_xsk_umems_used;
+	u16 num_xsk_umems;
+
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 	return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+	int qid = ring->queue_index;
+
+	if (ring_is_xdp(ring))
+		qid -= ring->vsi->alloc_queue_pairs;
+
+	if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+		return NULL;
+
+	return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
@@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
 int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
 				      struct i40e_cloud_filter *filter,
 				      bool add);
+int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
+int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
+
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 369a116edaa1..8c602424d339 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <net/xdp_sock.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -16,6 +17,7 @@
  */
 #define CREATE_TRACE_POINTS
 #include "i40e_trace.h"
+#include "i40e_xsk.h"
 
 const char i40e_driver_name[] = "i40e";
 static const char i40e_driver_string[] =
@@ -3071,6 +3073,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
+	if (ring_is_xdp(ring))
+		ring->xsk_umem = i40e_xsk_umem(ring);
+
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
 		ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -3180,13 +3185,30 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	i40e_status err = 0;
+	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	ring->xsk_umem = i40e_xsk_umem(ring);
+	if (ring->xsk_umem) {
+		ring->clean_rx_irq = i40e_clean_rx_irq_zc;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
+		ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
+				   XDP_PACKET_HEADROOM;
+		ring->zca.free = i40e_zca_free;
+		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_ZERO_COPY,
+						 &ring->zca);
+		if (ret)
+			return ret;
+	} else {
+		ring->clean_rx_irq = i40e_clean_rx_irq;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
+		ring->rx_buf_len = vsi->rx_buf_len;
+	}
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3242,7 +3264,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
 }
@@ -12022,7 +12044,7 @@ static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
  *
  * Returns 0 on success, <0 on failure.
  **/
-static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
 {
 	int err;
 
@@ -12047,7 +12069,7 @@ static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
  *
  * Returns 0 on success, <0 on failure.
  **/
-static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
 {
 	int err;
 
@@ -12095,6 +12117,11 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_QUERY_XSK_UMEM:
+		return 0;
+	case XDP_SETUP_XSK_UMEM:
+		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5f01e4ce9c92..6b1142fbc697 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -5,6 +5,7 @@
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp.h>
+#include <net/xdp_sock.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
  * This is used to verify if the FD programming or invalidation
  * requested by SW to the HW is successful or not and take actions accordingly.
  **/
-static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
-				  union i40e_rx_desc *rx_desc, u8 prog_id)
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+			   union i40e_rx_desc *rx_desc, u8 prog_id)
 {
 	struct i40e_pf *pf = rx_ring->vsi->back;
 	struct pci_dev *pdev = pf->pdev;
@@ -1246,25 +1247,6 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
 	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
 }
 
-/**
- * i40e_rx_is_programming_status - check for programming status descriptor
- * @qw: qword representing status_error_len in CPU ordering
- *
- * The value of in the descriptor length field indicate if this
- * is a programming status descriptor for flow director or FCoE
- * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
- * it is a packet descriptor.
- **/
-static inline bool i40e_rx_is_programming_status(u64 qw)
-{
-	/* The Rx filter programming status and SPH bit occupy the same
-	 * spot in the descriptor. Since we don't support packet split we
-	 * can just reuse the bit as an indication that this is a
-	 * programming status descriptor.
-	 */
-	return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
-}
-
 /**
  * i40e_clean_programming_status - clean the programming status descriptor
  * @rx_ring: the rx ring that has this descriptor
@@ -1373,31 +1355,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	}
 
 	/* Free all the Rx ring sk_buffs */
-	for (i = 0; i < rx_ring->count; i++) {
-		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
-
-		if (!rx_bi->page)
-			continue;
+	if (!rx_ring->xsk_umem) {
+		for (i = 0; i < rx_ring->count; i++) {
+			struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		/* Invalidate cache lines that may have been written to by
-		 * device so that we avoid corrupting memory.
-		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     I40E_RX_DMA_ATTR);
-
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+			if (!rx_bi->page)
+				continue;
 
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+			/* Invalidate cache lines that may have been
+			 * written to by device so that we avoid
+			 * corrupting memory.
+			 */
+			dma_sync_single_range_for_cpu(rx_ring->dev,
+						      rx_bi->dma,
+						      rx_bi->page_offset,
+						      rx_ring->rx_buf_len,
+						      DMA_FROM_DEVICE);
+
+			/* free resources associated with mapping */
+			dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
+					     i40e_rx_pg_size(rx_ring),
+					     DMA_FROM_DEVICE,
+					     I40E_RX_DMA_ATTR);
+
+			__page_frag_cache_drain(rx_bi->page,
+						rx_bi->pagecnt_bias);
+
+			rx_bi->page = NULL;
+			rx_bi->page_offset = 0;
+		}
 	}
 
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
@@ -1487,27 +1473,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 	return err;
 }
 
-/**
- * i40e_release_rx_desc - Store the new tail and head values
- * @rx_ring: ring to bump
- * @val: new head index
- **/
-static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
-{
-	rx_ring->next_to_use = val;
-
-	/* update next to alloc since we have filled the ring */
-	rx_ring->next_to_alloc = val;
-
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.  (Only
-	 * applicable for weak-ordered memory model archs,
-	 * such as IA-64).
-	 */
-	wmb();
-	writel(val, rx_ring->tail);
-}
-
 /**
  * i40e_rx_offset - Return expected offset into page to access data
  * @rx_ring: Ring we are requesting offset of
@@ -1576,8 +1541,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
  * @skb: packet to send up
  * @vlan_tag: vlan tag for packet
  **/
-static void i40e_receive_skb(struct i40e_ring *rx_ring,
-			     struct sk_buff *skb, u16 vlan_tag)
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+		      struct sk_buff *skb, u16 vlan_tag)
 {
 	struct i40e_q_vector *q_vector = rx_ring->q_vector;
 
@@ -1804,7 +1769,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
  * order to populate the hash, checksum, VLAN, protocol, and
  * other fields within the skb.
  **/
-static inline
 void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 			     union i40e_rx_desc *rx_desc, struct sk_buff *skb,
 			     u8 rx_ptype)
@@ -1829,46 +1793,6 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 	skb->protocol = eth_type_trans(skb, rx_ring->netdev);
 }
 
-/**
- * i40e_cleanup_headers - Correct empty headers
- * @rx_ring: rx descriptor ring packet is being transacted on
- * @skb: pointer to current skb being fixed
- * @rx_desc: pointer to the EOP Rx descriptor
- *
- * Also address the case where we are pulling data in on pages only
- * and as such no data is present in the skb header.
- *
- * In addition if skb is not at least 60 bytes we need to pad it so that
- * it is large enough to qualify as a valid Ethernet frame.
- *
- * Returns true if an error was encountered and skb was freed.
- **/
-static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
-				 union i40e_rx_desc *rx_desc)
-
-{
-	/* XDP packets use error pointer so abort at this point */
-	if (IS_ERR(skb))
-		return true;
-
-	/* ERR_MASK will only have valid bits if EOP set, and
-	 * what we are doing here is actually checking
-	 * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
-	 * the error field
-	 */
-	if (unlikely(i40e_test_staterr(rx_desc,
-				       BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
-		dev_kfree_skb_any(skb);
-		return true;
-	}
-
-	/* if eth_skb_pad returns an error the skb was freed */
-	if (eth_skb_pad(skb))
-		return true;
-
-	return false;
-}
-
 /**
  * i40e_page_is_reusable - check if any reuse is possible
  * @page: page struct to check
@@ -2177,15 +2101,11 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 	return true;
 }
 
-#define I40E_XDP_PASS 0
-#define I40E_XDP_CONSUMED 1
-#define I40E_XDP_TX 2
-
 static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring);
 
-static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
-				 struct i40e_ring *xdp_ring)
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+			  struct i40e_ring *xdp_ring)
 {
 	struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
 
@@ -2214,8 +2134,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	prefetchw(xdp->data_hard_start); /* xdp_frame write */
-
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2263,15 +2181,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
-{
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.
-	 */
-	wmb();
-	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
-}
-
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2284,7 +2193,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
  *
  * Returns amount of work completed
  **/
-static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct sk_buff *skb = rx_ring->skb;
@@ -2576,7 +2485,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 820f76db251b..cddb185cd2f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -296,13 +296,22 @@ struct i40e_tx_buffer {
 
 struct i40e_rx_buffer {
 	dma_addr_t dma;
-	struct page *page;
+	union {
+		struct {
+			struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
+			__u32 page_offset;
 #else
-	__u16 page_offset;
+			__u16 page_offset;
 #endif
-	__u16 pagecnt_bias;
+			__u16 pagecnt_bias;
+		};
+		struct {
+			/* for umem */
+			void *addr;
+			u64 handle;
+		};
+	};
 };
 
 struct i40e_queue_stats {
@@ -414,6 +423,12 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+
+	int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
+	bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
+	struct xdp_umem *xsk_umem;
+
+	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -490,6 +505,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		  u32 flags);
 void i40e_xdp_flush(struct net_device *dev);
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -576,4 +592,108 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
 {
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
+
+#define I40E_XDP_PASS 0
+#define I40E_XDP_CONSUMED 1
+#define I40E_XDP_TX 2
+
+/**
+ * i40e_release_rx_desc - Store the new tail and head values
+ * @rx_ring: ring to bump
+ * @val: new head index
+ **/
+static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
+{
+	rx_ring->next_to_use = val;
+
+	/* update next to alloc since we have filled the ring */
+	rx_ring->next_to_alloc = val;
+
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.  (Only
+	 * applicable for weak-ordered memory model archs,
+	 * such as IA-64).
+	 */
+	wmb();
+	writel(val, rx_ring->tail);
+}
+
+/**
+ * i40e_rx_is_programming_status - check for programming status descriptor
+ * @qw: qword representing status_error_len in CPU ordering
+ *
+ * The value of in the descriptor length field indicate if this
+ * is a programming status descriptor for flow director or FCoE
+ * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
+ * it is a packet descriptor.
+ **/
+static inline bool i40e_rx_is_programming_status(u64 qw)
+{
+	/* The Rx filter programming status and SPH bit occupy the same
+	 * spot in the descriptor. Since we don't support packet split we
+	 * can just reuse the bit as an indication that this is a
+	 * programming status descriptor.
+	 */
+	return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
+}
+
+/**
+ * i40e_cleanup_headers - Correct empty headers
+ * @rx_ring: rx descriptor ring packet is being transacted on
+ * @skb: pointer to current skb being fixed
+ * @rx_desc: pointer to the EOP Rx descriptor
+ *
+ * Also address the case where we are pulling data in on pages only
+ * and as such no data is present in the skb header.
+ *
+ * In addition if skb is not at least 60 bytes we need to pad it so that
+ * it is large enough to qualify as a valid Ethernet frame.
+ *
+ * Returns true if an error was encountered and skb was freed.
+ **/
+static inline bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
+					struct sk_buff *skb,
+					union i40e_rx_desc *rx_desc)
+
+{
+	/* XDP packets use error pointer so abort at this point */
+	if (IS_ERR(skb))
+		return true;
+
+	/* ERR_MASK will only have valid bits if EOP set, and
+	 * what we are doing here is actually checking
+	 * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
+	 * the error field
+	 */
+	if (unlikely(i40e_test_staterr(rx_desc,
+				       BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
+		dev_kfree_skb_any(skb);
+		return true;
+	}
+
+	/* if eth_skb_pad returns an error the skb was freed */
+	if (eth_skb_pad(skb))
+		return true;
+
+	return false;
+}
+
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+			   union i40e_rx_desc *rx_desc, u8 prog_id);
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+			  struct i40e_ring *xdp_ring);
+void i40e_process_skb_fields(struct i40e_ring *rx_ring,
+			     union i40e_rx_desc *rx_desc, struct sk_buff *skb,
+			     u8 rx_ptype);
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+		      struct sk_buff *skb, u16 vlan_tag);
 #endif /* _I40E_TXRX_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
new file mode 100644
index 000000000000..9d16924415b9
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Intel Corporation. */
+
+#include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
+#include <net/xdp.h>
+
+#include "i40e.h"
+#include "i40e_txrx.h"
+
+static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_umems)
+		return 0;
+
+	vsi->num_xsk_umems_used = 0;
+	vsi->num_xsk_umems = vsi->alloc_queue_pairs;
+	vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
+				 GFP_KERNEL);
+	if (!vsi->xsk_umems) {
+		vsi->num_xsk_umems = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			     u16 qid)
+{
+	int err;
+
+	err = i40e_alloc_xsk_umems(vsi);
+	if (err)
+		return err;
+
+	vsi->xsk_umems[qid] = umem;
+	vsi->num_xsk_umems_used++;
+
+	return 0;
+}
+
+static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
+{
+	vsi->xsk_umems[qid] = NULL;
+	vsi->num_xsk_umems_used--;
+
+	if (vsi->num_xsk_umems == 0) {
+		kfree(vsi->xsk_umems);
+		vsi->xsk_umems = NULL;
+		vsi->num_xsk_umems = 0;
+	}
+}
+
+static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	dev = &pf->pdev->dev;
+	for (i = 0; i < umem->npgs; i++) {
+		dma = dma_map_page_attrs(dev, umem->pgs[i], 0, PAGE_SIZE,
+					 DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		umem->pages[i].dma = dma;
+	}
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		umem->pages[i].dma = 0;
+	}
+
+	return -1;
+}
+
+static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->npgs; i++) {
+		dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+
+		umem->pages[i].dma = 0;
+	}
+}
+
+static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
+				u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems && vsi->xsk_umems[qid])
+		return -EBUSY;
+
+	err = i40e_xsk_umem_dma_map(vsi, umem);
+	if (err)
+		return err;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_umem(vsi, umem, qid);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
+	    !vsi->xsk_umems[qid])
+		return -EINVAL;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
+	i40e_remove_xsk_umem(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			u16 qid)
+{
+	if (umem)
+		return i40e_xsk_umem_enable(vsi, umem, qid);
+
+	return i40e_xsk_umem_disable(vsi, qid);
+}
+
+static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
+				       struct xdp_buff *xdp)
+{
+	int err, result = I40E_XDP_PASS;
+	struct i40e_ring *xdp_ring;
+	struct bpf_prog *xdp_prog;
+	u32 act;
+	u16 off;
+
+	rcu_read_lock();
+	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	off = xdp->data - xdp->data_hard_start;
+	xdp->handle += off;
+	switch (act) {
+	case XDP_PASS:
+		break;
+	case XDP_TX:
+		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		result = I40E_XDP_CONSUMED;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ERR_PTR(-result);
+}
+
+static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
+				struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	void *addr = bi->addr;
+	u64 handle;
+
+	if (addr) {
+		rx_ring->rx_stats.page_reuse_count++;
+		return true;
+	}
+
+	if (!xsk_umem_peek_addr(umem, &handle)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	bi->dma = xdp_umem_get_dma(umem, handle);
+	bi->addr = xdp_umem_get_data(umem, handle);
+
+	bi->dma += umem->headroom + XDP_PACKET_HEADROOM;
+	bi->addr += umem->headroom + XDP_PACKET_HEADROOM;
+	bi->handle = handle + umem->headroom;
+
+	xsk_umem_discard_addr(umem);
+	return true;
+}
+
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
+{
+	u16 ntu = rx_ring->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct i40e_rx_buffer *bi;
+
+	rx_desc = I40E_RX_DESC(rx_ring, ntu);
+	bi = &rx_ring->rx_bi[ntu];
+
+	do {
+		if (!i40e_alloc_frame_zc(rx_ring, bi))
+			goto no_buffers;
+
+		/* sync the buffer for use by the device */
+		dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
+						 rx_ring->rx_buf_len,
+						 DMA_BIDIRECTIONAL);
+
+		/* Refresh the desc even if buffer_addrs didn't change
+		 * because each write-back erases this info.
+		 */
+		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
+
+		rx_desc++;
+		bi++;
+		ntu++;
+		if (unlikely(ntu == rx_ring->count)) {
+			rx_desc = I40E_RX_DESC(rx_ring, 0);
+			bi = rx_ring->rx_bi;
+			ntu = 0;
+		}
+
+		/* clear the status bits for the next_to_use descriptor */
+		rx_desc->wb.qword1.status_error_len = 0;
+
+		cleaned_count--;
+	} while (cleaned_count);
+
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	return false;
+
+no_buffers:
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	/* make sure to come back via polling to try again after
+	 * allocation failure
+	 */
+	return true;
+}
+
+static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
+						    const unsigned int size)
+{
+	struct i40e_rx_buffer *rx_buffer;
+
+	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma, 0,
+				      size,
+				      DMA_BIDIRECTIONAL);
+
+	return rx_buffer;
+}
+
+static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
+				    struct i40e_rx_buffer *old_buff)
+{
+	u64 mask = rx_ring->xsk_umem->props.chunk_mask;
+	u64 hr = rx_ring->xsk_umem->headroom;
+	u16 nta = rx_ring->next_to_alloc;
+	struct i40e_rx_buffer *new_buff;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma		= old_buff->dma & mask;
+	new_buff->addr		= (void *)((u64)old_buff->addr & mask);
+	new_buff->handle	= old_buff->handle & mask;
+
+	new_buff->dma += hr + XDP_PACKET_HEADROOM;
+	new_buff->addr += hr + XDP_PACKET_HEADROOM;
+	new_buff->handle += hr;
+}
+
+/* Called from the XDP return API in NAPI context. */
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
+{
+	struct i40e_rx_buffer *new_buff;
+	struct i40e_ring *rx_ring;
+	u64 mask;
+	u16 nta;
+
+	rx_ring = container_of(alloc, struct i40e_ring, zca);
+	mask = rx_ring->xsk_umem->props.chunk_mask;
+
+	nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	handle &= mask;
+
+	new_buff->dma		= xdp_umem_get_dma(rx_ring->xsk_umem, handle);
+	new_buff->addr		= xdp_umem_get_data(rx_ring->xsk_umem, handle);
+	new_buff->handle	= (u64)handle;
+
+	new_buff->dma += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+	new_buff->addr += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+	new_buff->handle += rx_ring->xsk_umem->headroom;
+}
+
+static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
+					    struct i40e_rx_buffer *rx_buffer,
+					    struct xdp_buff *xdp)
+{
+	/* XXX implement alloc skb and copy */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	return NULL;
+}
+
+static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
+					     union i40e_rx_desc *rx_desc,
+					     u64 qw)
+{
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
+	u8 id;
+
+	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+	/* place unused page back on the ring */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->addr = NULL;
+
+	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
+		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
+
+	if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
+		i40e_fd_handle_status(rx_ring, rx_desc, id);
+}
+
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
+{
+	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	bool failure = false, xdp_xmit = false;
+	struct sk_buff *skb;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	while (likely(total_rx_packets < (unsigned int)budget)) {
+		struct i40e_rx_buffer *rx_buffer;
+		union i40e_rx_desc *rx_desc;
+		unsigned int size;
+		u16 vlan_tag;
+		u8 rx_ptype;
+		u64 qword;
+		u32 ntc;
+
+		/* return some buffers to hardware, one at a time is too slow */
+		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
+			failure = failure ||
+				  i40e_alloc_rx_buffers_zc(rx_ring,
+							   cleaned_count);
+			cleaned_count = 0;
+		}
+
+		rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+
+		/* status_error_len will always be zero for unused descriptors
+		 * because it's cleared in cleanup, and overlaps with hdr_addr
+		 * which is always zero because packet split isn't used, if the
+		 * hardware wrote DD then the length will be non-zero
+		 */
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+
+		/* This memory barrier is needed to keep us from reading
+		 * any other fields out of the rx_desc until we have
+		 * verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		if (unlikely(i40e_rx_is_programming_status(qword))) {
+			i40e_clean_programming_status_zc(rx_ring, rx_desc,
+							 qword);
+			cleaned_count++;
+			continue;
+		}
+		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+		if (!size)
+			break;
+
+		rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
+
+		/* retrieve a buffer from the ring */
+		xdp.data = rx_buffer->addr;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+		xdp.data_end = xdp.data + size;
+		xdp.handle = rx_buffer->handle;
+
+		skb = i40e_run_xdp_zc(rx_ring, &xdp);
+
+		if (IS_ERR(skb)) {
+			if (PTR_ERR(skb) == -I40E_XDP_TX)
+				xdp_xmit = true;
+			else
+				i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+			total_rx_bytes += size;
+			total_rx_packets++;
+		} else {
+			skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
+			if (!skb) {
+				rx_ring->rx_stats.alloc_buff_failed++;
+				break;
+			}
+		}
+
+		rx_buffer->addr = NULL;
+		cleaned_count++;
+
+		/* don't care about non-EOP frames in XDP mode */
+		ntc = rx_ring->next_to_clean + 1;
+		ntc = (ntc < rx_ring->count) ? ntc : 0;
+		rx_ring->next_to_clean = ntc;
+		prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+		if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
+			skb = NULL;
+			continue;
+		}
+
+		/* probably a little skewed due to removing CRC */
+		total_rx_bytes += skb->len;
+
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+		rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+			   I40E_RXD_QW1_PTYPE_SHIFT;
+
+		/* populate checksum, VLAN, and protocol */
+		i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+		vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+			   le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+
+		i40e_receive_skb(rx_ring, skb, vlan_tag);
+		skb = NULL;
+
+		/* update budget accounting */
+		total_rx_packets++;
+	}
+
+	if (xdp_xmit) {
+		struct i40e_ring *xdp_ring =
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+
+		i40e_xdp_ring_update_tail(xdp_ring);
+		xdp_do_flush_map();
+	}
+
+	u64_stats_update_begin(&rx_ring->syncp);
+	rx_ring->stats.packets += total_rx_packets;
+	rx_ring->stats.bytes += total_rx_bytes;
+	u64_stats_update_end(&rx_ring->syncp);
+	rx_ring->q_vector->rx.total_packets += total_rx_packets;
+	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+
+	/* guarantee a trip back through this routine if there was a failure */
+	return failure ? budget : (int)total_rx_packets;
+}
+
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
new file mode 100644
index 000000000000..757ac5ca8511
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2018 Intel Corporation. */
+
+#ifndef _I40E_XSK_H_
+#define _I40E_XSK_H_
+
+struct i40e_vsi;
+struct xdp_umem;
+struct zero_copy_allocator;
+
+int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			u16 qid);
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+
+#endif /* _I40E_XSK_H_ */
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 9fe472f2ac95..ec8fd3314097 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -94,6 +94,25 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
 	return false;
 }
+
+static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
+{
+	return NULL;
+}
+
+static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
+{
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
+}
+
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+}
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index f11560334f88..c8be1ad3eb88 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -8,16 +8,6 @@
 
 #include <net/xdp_sock.h>
 
-static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
-{
-	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
-}
-
-static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
-{
-	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
-}
-
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next 08/11] i40e: added queue pair disable/enable functions
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Queue pair enable/disable plumbing.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 251 ++++++++++++++++++++++++++++
 1 file changed, 251 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b5daa5c9c7de..369a116edaa1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11823,6 +11823,257 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_control_napi(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_control_rings(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+	 * indicate done before it really is. Needs 50ms to be sure
+	 */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+	 * from the interrupt linked list, we simply disable the interrupt, and
+	 * leave the list intact.
+	 *
+	 * All rings in a qp belong to the same qvector.
+	 */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_control_rings(vsi, queue_pair,
+					    false /* disable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next 07/11] xsk: wire upp Tx zero-copy functions
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions for the netdevs.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h |  9 +++++++
 net/xdp/xdp_umem.c     | 29 +++++++++++++++++++--
 net/xdp/xdp_umem.h     |  8 +++++-
 net/xdp/xsk.c          | 70 +++++++++++++++++++++++++++++++++++++++++++++-----
 net/xdp/xsk_queue.h    | 32 ++++++++++++++++++++++-
 5 files changed, 137 insertions(+), 11 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d93d3aac3fc9..9fe472f2ac95 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -9,6 +9,7 @@
 #include <linux/workqueue.h>
 #include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <net/sock.h>
 
@@ -42,6 +43,8 @@ struct xdp_umem {
 	struct net_device *dev;
 	u16 queue_id;
 	bool zc;
+	spinlock_t xsk_list_lock;
+	struct list_head xsk_list;
 };
 
 struct xdp_sock {
@@ -53,6 +56,8 @@ struct xdp_sock {
 	struct list_head flush_node;
 	u16 queue_id;
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
+	struct list_head list;
+	bool zc;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	u64 rx_dropped;
@@ -64,8 +69,12 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+/* Used from netdev driver */
 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
 void xsk_umem_discard_addr(struct xdp_umem *umem);
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
+void xsk_umem_consume_tx_done(struct xdp_umem *umem);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index f729d79b8d91..7eb4948a38d2 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -17,6 +17,29 @@
 
 #define XDP_UMEM_MIN_CHUNK_SIZE 2048
 
+void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&umem->xsk_list_lock, flags);
+	list_add_rcu(&xs->list, &umem->xsk_list);
+	spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+}
+
+void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs)
+{
+	unsigned long flags;
+
+	if (xs->dev) {
+		spin_lock_irqsave(&umem->xsk_list_lock, flags);
+		list_del_rcu(&xs->list);
+		spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
+		if (umem->zc)
+			synchronize_net();
+	}
+}
+
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags)
 {
@@ -35,7 +58,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 
 	dev_hold(dev);
 
-	if (dev->netdev_ops->ndo_bpf) {
+	if (dev->netdev_ops->ndo_bpf && dev->netdev_ops->ndo_xsk_async_xmit) {
 		bpf.command = XDP_QUERY_XSK_UMEM;
 
 		rtnl_lock();
@@ -70,7 +93,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 	return force_zc ? -ENOTSUPP : 0; /* fail or fallback */
 }
 
-void xdp_umem_clear_dev(struct xdp_umem *umem)
+static void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
 	struct netdev_bpf bpf;
 	int err;
@@ -283,6 +306,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
+	INIT_LIST_HEAD(&umem->xsk_list);
+	spin_lock_init(&umem->xsk_list_lock);
 
 	refcount_set(&umem->users, 1);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 674508a32a4d..f11560334f88 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -13,12 +13,18 @@ static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
 }
 
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+}
+
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags);
-void xdp_umem_clear_dev(struct xdp_umem *umem);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
+void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs);
+void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs);
 struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr);
 
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ab64bd8260ea..ddca4bf1cfc8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -21,6 +21,7 @@
 #include <linux/uaccess.h>
 #include <linux/net.h>
 #include <linux/netdevice.h>
+#include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -138,6 +139,59 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+	xskq_produce_flush_addr_n(umem->cq, nb_entries);
+}
+EXPORT_SYMBOL(xsk_umem_complete_tx);
+
+void xsk_umem_consume_tx_done(struct xdp_umem *umem)
+{
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		xs->sk.sk_write_space(&xs->sk);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx_done);
+
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len)
+{
+	struct xdp_desc desc;
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		if (!xskq_peek_desc(xs->tx, &desc))
+			continue;
+
+		if (xskq_produce_addr_lazy(umem->cq, desc.addr))
+			goto out;
+
+		*dma = xdp_umem_get_dma(umem, desc.addr);
+		*len = desc.len;
+
+		xskq_discard_desc(xs->tx);
+		rcu_read_unlock();
+		return true;
+	}
+
+out:
+	rcu_read_unlock();
+	return false;
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx);
+
+static int xsk_zc_xmit(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev = xs->dev;
+
+	return dev->netdev_ops->ndo_xsk_async_xmit(dev, xs->queue_id);
+}
+
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	u64 addr = (u64)(long)skb_shinfo(skb)->destructor_arg;
@@ -151,7 +205,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			    size_t total_len)
 {
-	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	u32 max_batch = TX_BATCH_SIZE;
 	struct xdp_sock *xs = xdp_sk(sk);
 	bool sent_frame = false;
@@ -161,8 +214,6 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 	if (unlikely(!xs->tx))
 		return -ENOBUFS;
-	if (need_wait)
-		return -EOPNOTSUPP;
 
 	mutex_lock(&xs->mutex);
 
@@ -192,7 +243,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			goto out;
 		}
 
-		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		skb = sock_alloc_send_skb(sk, len, 1, &err);
 		if (unlikely(!skb)) {
 			err = -EAGAIN;
 			goto out;
@@ -235,6 +286,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 
@@ -242,8 +294,10 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		return -ENXIO;
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
 		return -ENETDOWN;
+	if (need_wait)
+		return -EOPNOTSUPP;
 
-	return xsk_generic_xmit(sk, m, total_len);
+	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
 }
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
@@ -419,10 +473,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	}
 
 	xs->dev = dev;
-	xs->queue_id = sxdp->sxdp_queue_id;
-
+	xs->zc = xs->umem->zc;
+	xs->queue_id = qid;
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
+	xdp_add_sk_umem(xs->umem, xs);
 
 out_unlock:
 	if (err)
@@ -660,6 +715,7 @@ static void xsk_destruct(struct sock *sk)
 
 	xskq_destroy(xs->rx);
 	xskq_destroy(xs->tx);
+	xdp_del_sk_umem(xs->umem, xs);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 5246ed420a16..ef6a6f0ec949 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -11,6 +11,7 @@
 #include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
+#define LAZY_UPDATE_THRESHOLD 128
 
 struct xdp_ring {
 	u32 producer ____cacheline_aligned_in_smp;
@@ -61,9 +62,14 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 	return (entries > dcnt) ? dcnt : entries;
 }
 
+static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
+{
+	return q->nentries - (producer - q->cons_tail);
+}
+
 static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 {
-	u32 free_entries = q->nentries - (producer - q->cons_tail);
+	u32 free_entries = xskq_nb_free_lazy(q, producer);
 
 	if (free_entries >= dcnt)
 		return free_entries;
@@ -123,6 +129,9 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
+	if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
 	ring->desc[q->prod_tail++ & q->ring_mask] = addr;
 
 	/* Order producer and data */
@@ -132,6 +141,27 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
 	return 0;
 }
 
+static inline int xskq_produce_addr_lazy(struct xsk_queue *q, u64 addr)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
+	ring->desc[q->prod_head++ & q->ring_mask] = addr;
+	return 0;
+}
+
+static inline void xskq_produce_flush_addr_n(struct xsk_queue *q,
+					     u32 nb_entries)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail += nb_entries;
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
 static inline int xskq_reserve_addr(struct xsk_queue *q)
 {
 	if (xskq_nb_free(q, q->prod_head, 1) == 0)
-- 
2.14.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox