Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: David Miller @ 2018-06-04 21:14 UTC (permalink / raw)
  To: zenczykowski; +Cc: maze, edumazet, netdev
In-Reply-To: <20180603174705.51802-1-zenczykowski@gmail.com>

From: "Maciej Żenczykowski" <zenczykowski@gmail.com>
Date: Sun,  3 Jun 2018 10:47:05 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH] net-tcp: extend tcp_tw_reuse sysctl to enable loopback only optimization
From: David Miller @ 2018-06-04 21:13 UTC (permalink / raw)
  To: zenczykowski; +Cc: maze, edumazet, netdev, ncardwell, ycheng, weiwan
In-Reply-To: <20180603174117.48539-1-zenczykowski@gmail.com>

From: "Maciej Żenczykowski" <zenczykowski@gmail.com>
Date: Sun,  3 Jun 2018 10:41:17 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean
> to an integer.
> 
> It now takes the values 0, 1 and 2, where 0 and 1 behave as before,
> while 2 enables timewait socket reuse only for sockets that we can
> prove are loopback connections:
>   ie. bound to 'lo' interface or where one of source or destination
>   IPs is 127.0.0.0/8, ::ffff:127.0.0.0/104 or ::1.
> 
> This enables quicker reuse of ephemeral ports for loopback connections
> - where tcp_tw_reuse is 100% safe from a protocol perspective
> (this assumes no artificially induced packet loss on 'lo').
> 
> This also makes estblishing many loopback connections *much* faster
> (allocating ports out of the first half of the ephemeral port range
> is significantly faster, then allocating from the second half)
> 
> Without this change in a 32K ephemeral port space my sample program
> (it just establishes and closes [::1]:ephemeral -> [::1]:server_port
> connections in a tight loop) fails after 32765 connections in 24 seconds.
> With it enabled 50000 connections only take 4.7 seconds.
> 
> This is particularly problematic for IPv6 where we only have one local
> address and cannot play tricks with varying source IP from 127.0.0.0/8
> pool.
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next v2] qed: Add srq core support for RoCE and iWARP
From: David Miller @ 2018-06-04 21:12 UTC (permalink / raw)
  To: yuval.bason
  Cc: netdev, jgg, dledford, linux-rdma, michal.kalderon, ariel.elior
In-Reply-To: <20180603161307.29832-1-yuval.bason@cavium.com>

From: Yuval Bason <yuval.bason@cavium.com>
Date: Sun, 3 Jun 2018 19:13:07 +0300

> This patch adds support for configuring SRQ and provides the necessary
> APIs for rdma upper layer driver (qedr) to enable the SRQ feature.
> 
> Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com>
> Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
> Signed-off-by: Yuval Bason <yuval.bason@cavium.com>
> ---
> Changes from v1:
> 	- sparse warnings
> 	- replace memset with ={}

Applied, thanks.

^ permalink raw reply

* Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: Or Gerlitz @ 2018-06-04 21:11 UTC (permalink / raw)
  To: David Miller; +Cc: Jeff Kirsher, Linux Netdev List
In-Reply-To: <20180604.163017.1841938428912069611.davem@davemloft.net>

On Mon, Jun 4, 2018 at 11:30 PM, David Miller <davem@davemloft.net> wrote:
> It's open a day or two more to deal with the AF_XDP issues...

Dave,

Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be merged for
this window -- AFAIU from [1], it's still under
examination/development/research for
non Intel HWs, am I correct or this is going to get in now?

Or

[1] https://marc.info/?l=linux-netdev&m=152810546108060&w=2

^ permalink raw reply

* Re: [PATCH 0/2] net: bnx2: Fix checkpatch and clang warnings
From: David Miller @ 2018-06-04 21:07 UTC (permalink / raw)
  To: rvarsha016
  Cc: rasesh.mody, harish.patil, Dept-GELinuxNICDev, netdev,
	linux-kernel, der.herr, lukas.bulwahn
In-Reply-To: <cover.1528025568.git.rvarsha016@gmail.com>

From: Varsha Rao <rvarsha016@gmail.com>
Date: Sun,  3 Jun 2018 17:18:04 +0530

> This patchset fixes NULL comparison and extra parentheses, checkpatch
> and clang warnings.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: gemini: fix spelling mistake: "it" -> "is"
From: David Miller @ 2018-06-04 21:05 UTC (permalink / raw)
  To: yuehaibing; +Cc: linus.walleij, netdev, linux-kernel
In-Reply-To: <20180603081001.23396-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Sun, 3 Jun 2018 16:10:01 +0800

> Trivial fix to spelling mistake in gemini dev_warn message
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: chelsio: Use zeroing memory allocator instead of allocator/memset
From: David Miller @ 2018-06-04 21:04 UTC (permalink / raw)
  To: yuehaibing; +Cc: santosh, netdev, linux-kernel, ganeshgr, leedom
In-Reply-To: <20180603024015.12744-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Sun, 3 Jun 2018 10:40:15 +0800

> Use dma_zalloc_coherent for allocating zeroed
> memory and remove unnecessary memset function.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next V2 2/2] cls_flower: Fix comparing of old filter mask with new filter
From: David Miller @ 2018-06-04 21:04 UTC (permalink / raw)
  To: paulb
  Cc: jiri, xiyou.wangcong, jhs, netdev, kliteyn, roid, shahark, markb,
	ogerlitz
In-Reply-To: <1528009574-63306-2-git-send-email-paulb@mellanox.com>

From: Paul Blakey <paulb@mellanox.com>
Date: Sun,  3 Jun 2018 10:06:14 +0300

> We incorrectly compare the mask and the result is that we can't modify
> an already existing rule.
> 
> Fix that by comparing correctly.
> 
> Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
> Reported-by: Vlad Buslov <vladbu@mellanox.com>
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next V2 1/2] cls_flower: Fix missing free of rhashtable
From: David Miller @ 2018-06-04 21:04 UTC (permalink / raw)
  To: paulb
  Cc: jiri, xiyou.wangcong, jhs, netdev, kliteyn, roid, shahark, markb,
	ogerlitz
In-Reply-To: <1528009574-63306-1-git-send-email-paulb@mellanox.com>

From: Paul Blakey <paulb@mellanox.com>
Date: Sun,  3 Jun 2018 10:06:13 +0300

> When destroying the instance, destroy the head rhashtable.
> 
> Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
> Reported-by: Vlad Buslov <vladbu@mellanox.com>
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: skbuff.h: drop unneeded <linux/slab.h>
From: David Miller @ 2018-06-04 21:02 UTC (permalink / raw)
  To: rdunlap; +Cc: netdev, linux-kernel, akpm
In-Reply-To: <86ee9a53-3136-7a2c-e59a-57d4d078188f@infradead.org>

From: Randy Dunlap <rdunlap@infradead.org>
Date: Sat, 2 Jun 2018 21:40:19 -0700

> From: Randy Dunlap <rdunlap@infradead.org>
> 
> <linux/skbuff.h> does not use nor need <linux/slab.h>, so drop this
> header file from skbuff.h.
> 
> <linux/skbuff.h> is currently #included in around 1200 C source and
> header files, making it the 31st most-used header file.
> 
> Build tested [allmodconfig] on 20 arch-es.
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>

Applied, thanks Randy.

^ permalink raw reply

* Re: [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
From: Alexander Duyck @ 2018-06-04 20:53 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, mykyta.iziumtsev, John Fastabend, Willem de Bruijn,
	Michael S. Tsirkin, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, qi.z.zhang, francois.ozog, ilias.apalodimas,
	brian.brooks, Andy Gospodarek, Michael Chan
In-Reply-To: <20180604120601.18123-11-bjorn.topel@gmail.com>

On Mon, Jun 4, 2018 at 5:06 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This will result in other devices
> doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
> dropped.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |  93 +++++++++++-------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  23 +++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 140 ++++++++++++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   2 +
>  include/net/xdp_sock.h                      |  14 +++
>  6 files changed, 242 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 8c602424d339..98c18c41809d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3073,8 +3073,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> -       if (ring_is_xdp(ring))
> +       ring->clean_tx_irq = i40e_clean_tx_irq;
> +       if (ring_is_xdp(ring)) {
>                 ring->xsk_umem = i40e_xsk_umem(ring);
> +               if (ring->xsk_umem)
> +                       ring->clean_tx_irq = i40e_clean_tx_irq_zc;

Again, I am worried what the performance penalty on this will be given
the retpoline penalty for function pointers.

> +       }
>
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
> @@ -12162,6 +12166,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_bpf                = i40e_xdp,
>         .ndo_xdp_xmit           = i40e_xdp_xmit,
>         .ndo_xdp_flush          = i40e_xdp_flush,
> +       .ndo_xsk_async_xmit     = i40e_xsk_async_xmit,
>  };
>
>  /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 6b1142fbc697..923bb84a93ab 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -10,16 +10,6 @@
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
>
> -static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> -                               u32 td_tag)
> -{
> -       return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> -                          ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> -                          ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> -                          ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> -                          ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> -}
> -
>  #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
>  /**
>   * i40e_fdir - Generate a Flow Director descriptor based on fdata
> @@ -649,9 +639,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
>         if (!tx_ring->tx_bi)
>                 return;
>
> -       /* Free all the Tx ring sk_buffs */
> -       for (i = 0; i < tx_ring->count; i++)
> -               i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
> +       /* Cleanup only needed for non XSK TX ZC rings */
> +       if (!tx_ring->xsk_umem) {
> +               /* Free all the Tx ring sk_buffs */
> +               for (i = 0; i < tx_ring->count; i++)
> +                       i40e_unmap_and_free_tx_resource(tx_ring,
> +                                                       &tx_ring->tx_bi[i]);
> +       }
>
>         bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
>         memset(tx_ring->tx_bi, 0, bi_size);
> @@ -768,8 +762,40 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
>         }
>  }
>
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> +                         unsigned int total_packets,
> +                         unsigned int total_bytes)
> +{
> +       u64_stats_update_begin(&tx_ring->syncp);
> +       tx_ring->stats.bytes += total_bytes;
> +       tx_ring->stats.packets += total_packets;
> +       u64_stats_update_end(&tx_ring->syncp);
> +       tx_ring->q_vector->tx.total_bytes += total_bytes;
> +       tx_ring->q_vector->tx.total_packets += total_packets;
> +}
> +
>  #define WB_STRIDE 4
>
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +                struct i40e_vsi *vsi,
> +                int budget)
> +{
> +       if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> +               /* check to see if there are < 4 descriptors
> +                * waiting to be written back, then kick the hardware to force
> +                * them to be written back in case we stay in NAPI.
> +                * In this mode on X722 we do not enable Interrupt.
> +                */
> +               unsigned int j = i40e_get_tx_pending(tx_ring, false);
> +
> +               if (budget &&
> +                   ((j / WB_STRIDE) == 0) && (j > 0) &&
> +                   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
> +                   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
> +                       tx_ring->arm_wb = true;
> +       }
> +}
> +
>  /**
>   * i40e_clean_tx_irq - Reclaim resources after transmit completes
>   * @vsi: the VSI we care about
> @@ -778,8 +804,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
>   *
>   * Returns true if there's any budget left (e.g. the clean is finished)
>   **/
> -static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> -                             struct i40e_ring *tx_ring, int napi_budget)
> +bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> +                      struct i40e_ring *tx_ring, int napi_budget)
>  {
>         u16 i = tx_ring->next_to_clean;
>         struct i40e_tx_buffer *tx_buf;
> @@ -874,27 +900,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>
>         i += tx_ring->count;
>         tx_ring->next_to_clean = i;
> -       u64_stats_update_begin(&tx_ring->syncp);
> -       tx_ring->stats.bytes += total_bytes;
> -       tx_ring->stats.packets += total_packets;
> -       u64_stats_update_end(&tx_ring->syncp);
> -       tx_ring->q_vector->tx.total_bytes += total_bytes;
> -       tx_ring->q_vector->tx.total_packets += total_packets;
> -
> -       if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> -               /* check to see if there are < 4 descriptors
> -                * waiting to be written back, then kick the hardware to force
> -                * them to be written back in case we stay in NAPI.
> -                * In this mode on X722 we do not enable Interrupt.
> -                */
> -               unsigned int j = i40e_get_tx_pending(tx_ring, false);
>
> -               if (budget &&
> -                   ((j / WB_STRIDE) == 0) && (j > 0) &&
> -                   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
> -                   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
> -                       tx_ring->arm_wb = true;
> -       }
> +       i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
> +       i40e_arm_wb(tx_ring, vsi, budget);
>
>         if (ring_is_xdp(tx_ring))
>                 return !!budget;
> @@ -2467,10 +2475,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>          * budget and be more aggressive about cleaning up the Tx descriptors.
>          */
>         i40e_for_each_ring(ring, q_vector->tx) {
> -               if (!i40e_clean_tx_irq(vsi, ring, budget)) {
> +               if (!ring->clean_tx_irq(vsi, ring, budget)) {
>                         clean_complete = false;
>                         continue;
>                 }
> +
>                 arm_wb |= ring->arm_wb;
>                 ring->arm_wb = false;
>         }
> @@ -3595,6 +3604,12 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return -ENXIO;
>
> +       /* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
> +        * will drop incoming packets redirected by other devices!
> +        */
> +       if (vsi->xdp_rings[queue_index]->xsk_umem)
> +               return -ENXIO;
> +
>         if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>                 return -EINVAL;
>
> @@ -3633,5 +3648,11 @@ void i40e_xdp_flush(struct net_device *dev)
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return;
>
> +       /* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
> +        * will drop incoming packets redirected by other devices!
> +        */
> +       if (vsi->xdp_rings[queue_index]->xsk_umem)
> +               return;
> +
>         i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>  }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index cddb185cd2f8..b9c42c352a8d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -426,6 +426,8 @@ struct i40e_ring {
>
>         int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
>         bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
> +       bool (*clean_tx_irq)(struct i40e_vsi *vsi, struct i40e_ring *ring,
> +                            int budget);
>         struct xdp_umem *xsk_umem;
>
>         struct zero_copy_allocator zca; /* ZC allocator anchor */
> @@ -506,6 +508,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>                   u32 flags);
>  void i40e_xdp_flush(struct net_device *dev);
>  int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
> +bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> +                      struct i40e_ring *tx_ring, int napi_budget);
> +int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -687,6 +692,16 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>         writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
>  }
>
> +static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> +                               u32 td_tag)
> +{
> +       return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> +                          ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> +                          ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> +                          ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> +                          ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> +}
> +
>  void i40e_fd_handle_status(struct i40e_ring *rx_ring,
>                            union i40e_rx_desc *rx_desc, u8 prog_id);
>  int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> @@ -696,4 +711,12 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>                              u8 rx_ptype);
>  void i40e_receive_skb(struct i40e_ring *rx_ring,
>                       struct sk_buff *skb, u16 vlan_tag);
> +
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> +                         unsigned int total_packets,
> +                         unsigned int total_bytes);
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +                struct i40e_vsi *vsi,
> +                int budget);
> +
>  #endif /* _I40E_TXRX_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9d16924415b9..021fec5b5799 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -535,3 +535,143 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>         return failure ? budget : (int)total_rx_packets;
>  }
>
> +/* Returns true if the work is finished */
> +static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> +{
> +       unsigned int total_packets = 0, total_bytes = 0;
> +       struct i40e_tx_buffer *tx_bi;
> +       struct i40e_tx_desc *tx_desc;
> +       bool work_done = true;
> +       dma_addr_t dma;
> +       u32 len;
> +
> +       while (budget-- > 0) {
> +               if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
> +                       xdp_ring->tx_stats.tx_busy++;
> +                       work_done = false;
> +                       break;
> +               }
> +
> +               if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
> +                       break;
> +
> +               dma_sync_single_for_device(xdp_ring->dev, dma, len,
> +                                          DMA_BIDIRECTIONAL);
> +
> +               tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
> +               tx_bi->bytecount = len;
> +               tx_bi->gso_segs = 1;
> +
> +               tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> +               tx_desc->buffer_addr = cpu_to_le64(dma);
> +               tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
> +                                                       | I40E_TX_DESC_CMD_EOP,
> +                                                         0, len, 0);
> +
> +               total_packets++;
> +               total_bytes += len;
> +
> +               xdp_ring->next_to_use++;
> +               if (xdp_ring->next_to_use == xdp_ring->count)
> +                       xdp_ring->next_to_use = 0;
> +       }
> +
> +       if (total_packets > 0) {
> +               /* Request an interrupt for the last frame and bump tail ptr. */
> +               tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
> +                                                I40E_TXD_QW1_CMD_SHIFT);
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +
> +               xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> +               i40e_update_tx_stats(xdp_ring, total_packets, total_bytes);
> +       }
> +

So this code is likely vulnerable to an issue we were seeing where the
Tx was stalling and surging when xmit_more was in use. We found the
issue was that we were only setting the RS once per ring fill. As a
result the ring was either full or empty from the drivers perspective.
This ends up with poor Tx performance when it occurs. As such you may
want to set the RS bit at least twice per fill so you may want to look
at going through the lesser of either half the ring size or the
budget, then set the RS, and repeat with whatever budget you have
remaining. That way the ring should on average be 50% utilized instead
of either 100% or 0%.

> +       return !!budget && work_done;
> +}
> +
> +bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
> +                         struct i40e_ring *tx_ring, int napi_budget)
> +{
> +       struct xdp_umem *umem = tx_ring->xsk_umem;
> +       u32 head_idx = i40e_get_head(tx_ring);
> +       unsigned int budget = vsi->work_limit;
> +       bool work_done = true, xmit_done;
> +       u32 completed_frames;
> +       u32 frames_ready;
> +
> +       if (head_idx < tx_ring->next_to_clean)
> +               head_idx += tx_ring->count;
> +       frames_ready = head_idx - tx_ring->next_to_clean;
> +
> +       if (frames_ready == 0) {
> +               goto out_xmit;
> +       } else if (frames_ready > budget) {
> +               completed_frames = budget;
> +               work_done = false;
> +       } else {
> +               completed_frames = frames_ready;
> +       }
> +
> +       tx_ring->next_to_clean += completed_frames;
> +       if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
> +               tx_ring->next_to_clean -= tx_ring->count;
> +
> +       xsk_umem_complete_tx(umem, completed_frames);
> +
> +       i40e_arm_wb(tx_ring, vsi, budget);
> +
> +out_xmit:
> +       xmit_done = i40e_xmit_zc(tx_ring, budget);
> +
> +       return work_done && xmit_done;
> +}

I am not a fan of using head write-back. This code just seems shaky at
best to me. Am I understanding correctly that you are using the Tx
cleanup to transmit frames?

> +
> +/**
> + * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
> + * @n: napi context
> + *
> + * Returns true if NAPI is scheduled.
> + **/
> +static bool i40e_napi_is_scheduled(struct napi_struct *n)
> +{
> +       unsigned long val, new;
> +
> +       do {
> +               val = READ_ONCE(n->state);
> +               if (val & NAPIF_STATE_DISABLE)
> +                       return true;
> +
> +               if (!(val & NAPIF_STATE_SCHED))
> +                       return false;
> +
> +               new = val | NAPIF_STATE_MISSED;
> +       } while (cmpxchg(&n->state, val, new) != val);
> +

This code does not belong here. This is core kernel code, not anything
driver specific. Probably not needed if you drop the call below.

> +       return true;
> +}
> +
> +int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
> +{
> +       struct i40e_netdev_priv *np = netdev_priv(dev);
> +       struct i40e_vsi *vsi = np->vsi;
> +       struct i40e_ring *ring;
> +
> +       if (test_bit(__I40E_VSI_DOWN, vsi->state))
> +               return -ENETDOWN;
> +
> +       if (!i40e_enabled_xdp_vsi(vsi))
> +               return -ENXIO;
> +
> +       if (queue_id >= vsi->num_queue_pairs)
> +               return -ENXIO;
> +
> +       if (!vsi->xdp_rings[queue_id]->xsk_umem)
> +               return -ENXIO;
> +
> +       ring = vsi->xdp_rings[queue_id];
> +
> +       if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
> +               i40e_force_wb(vsi, ring->q_vector);

We really shouldn't have napi being scheduled by the Tx path.

> +
> +       return 0;
> +}

Again, more comments might be helpful here.

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> index 757ac5ca8511..bd006f1a4397 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> @@ -13,5 +13,7 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
>  void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
>  bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
>  int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
> +                         struct i40e_ring *tx_ring, int napi_budget);
>
>  #endif /* _I40E_XSK_H_ */
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index ec8fd3314097..63aa05abf11d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -103,6 +103,20 @@ static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>  static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
>  {
>  }
> +
> +static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
> +{
> +}
> +
> +static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
> +                                      u32 *len)
> +{
> +       return false;
> +}
> +
> +static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
> +{
> +}
>  #endif /* CONFIG_XDP_SOCKETS */
>
>  static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> --
> 2.14.1
>

^ permalink raw reply

* Re: [PATCH net] net/ipv6: prevent use after free in ip6_route_mpath_notify
From: Eric Dumazet @ 2018-06-04 20:45 UTC (permalink / raw)
  To: dsahern, netdev; +Cc: David Ahern, Eric Dumazet
In-Reply-To: <20180604204142.8941-1-dsahern@kernel.org>



On 06/04/2018 01:41 PM, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
> 
> syzbot reported a use-after-free:
> 
> BUG: KASAN: use-after-free in ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
> Read of size 4 at addr ffff8801bf789cf0 by task syz-executor756/4555
> 
> Fix by not setting rt_last until the it is verified the insert succeeded.
> 
> Fixes: 3b1137fe7482 ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH")
> Cc: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks David !

^ permalink raw reply

* [PATCH net] net/ipv6: prevent use after free in ip6_route_mpath_notify
From: dsahern @ 2018-06-04 20:41 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern, Eric Dumazet

From: David Ahern <dsahern@gmail.com>

syzbot reported a use-after-free:

BUG: KASAN: use-after-free in ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
Read of size 4 at addr ffff8801bf789cf0 by task syz-executor756/4555

CPU: 1 PID: 4555 Comm: syz-executor756 Not tainted 4.17.0-rc7+ #78
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
 ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
 ip6_route_multipath_add+0x615/0x1910 net/ipv6/route.c:4303
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 ...

Allocated by task 4555:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
 dst_alloc+0xbb/0x1d0 net/core/dst.c:104
 __ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:361
 ip6_dst_alloc+0x29/0xb0 net/ipv6/route.c:376
 ip6_route_info_create+0x4d4/0x3a30 net/ipv6/route.c:2834
 ip6_route_multipath_add+0xc7e/0x1910 net/ipv6/route.c:4240
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 ...

Freed by task 4555:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
 dst_destroy+0x267/0x3c0 net/core/dst.c:140
 dst_release_immediate+0x71/0x9e net/core/dst.c:205
 fib6_add+0xa40/0x1650 net/ipv6/ip6_fib.c:1305
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
 ip6_route_multipath_add+0x513/0x1910 net/ipv6/route.c:4267
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 ...

The problem is that rt_last can point to a deleted route if the insert
fails.

One reproducer is to insert a route and then add a multipath route that
has a duplicate nexthop.e.g,:
    $ ip -6 ro add vrf red 2001:db8:101::/64 nexthop via 2001:db8:1::2
    $ ip -6 ro append vrf red 2001:db8:101::/64 nexthop via 2001:db8:1::4 nexthop via 2001:db8:1::2

Fix by not setting rt_last until the it is verified the insert succeeded.

Fixes: 3b1137fe7482 ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH")
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4d61736c41a..c516f8556dbe 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4263,11 +4263,15 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 
 	err_nh = NULL;
 	list_for_each_entry(nh, &rt6_nh_list, next) {
-		rt_last = nh->rt6_info;
 		err = __ip6_ins_rt(nh->rt6_info, info, &nh->mxc, extack);
-		/* save reference to first route for notification */
-		if (!rt_notif && !err)
-			rt_notif = nh->rt6_info;
+		if (!err) {
+			/* save reference to last route successfully inserted */
+			rt_last = nh->rt6_info;
+
+			/* save reference to first route for notification */
+			if (!rt_notif)
+				rt_notif = nh->rt6_info;
+		}
 
 		/* nh->rt6_info is used or freed at this point, reset to NULL*/
 		nh->rt6_info = NULL;
-- 
2.11.0

^ permalink raw reply related

* Re: [bug] cxgb4: vrf stopped working with cxgb4 card
From: David Ahern @ 2018-06-04 20:35 UTC (permalink / raw)
  To: AMG Zollner Robert, ganeshgr; +Cc: netdev
In-Reply-To: <93057e04-d4ff-e16f-02ac-132501a8e08f@cloudmedia.eu>

On 6/4/18 1:14 PM, AMG Zollner Robert wrote:
> Yes, I was enslaving while the interface was up.
> 
> Just tested some of the builds that where not working earlier and they
> are working if I keep the interface down when enslaving as you suggested.
> 
> Is this the expected behavior?

Not expected from my perspective.

The VRF device cycles interfaces when they are enslaved or unenslaved to
clean up route and neighbor tables. This is a day 1 property of VRF.

I guessed that was the problem based on the commit you bisected the
problem to. If nothing else, it gives you a workaround until it is fixed.

^ permalink raw reply

* Re: [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx
From: Alexander Duyck @ 2018-06-04 20:35 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, mykyta.iziumtsev, Björn Töpel, John Fastabend,
	Willem de Bruijn, Michael S. Tsirkin, michael.lundkvist,
	Brandeburg, Jesse, Anjali Singhai Jain, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks
In-Reply-To: <20180604120601.18123-10-bjorn.topel@gmail.com>

On Mon, Jun 4, 2018 at 5:05 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This commit adds initial AF_XDP zero-copy support for i40e-based
> NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
> XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
> to pass a UMEM to the driver. The driver will then DMA map all the
> frames in the UMEM for the driver. Next, the Rx code will allocate
> frames from the UMEM fill queue, instead of the regular page
> allocator.
>
> Externally, for the rest of the XDP code, the driver internal UMEM
> allocator will appear as a MEM_TYPE_ZERO_COPY.
>
> The commit also introduces a completely new clean_rx_irq/allocator
> functions for zero-copy, and means (functions pointers) to set
> allocators and clean_rx functions.
>
> This first version does not support:
> * passing frames to the stack via XDP_PASS (clone/copy to skb).
> * doing XDP redirect to other than AF_XDP sockets
>   (convert_to_xdp_frame does not clone the frame yet).
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h      |  23 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++-------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++++++-
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 ++++++++++++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
>  include/net/xdp_sock.h                      |  19 +
>  net/xdp/xdp_umem.h                          |  10 -
>  9 files changed, 789 insertions(+), 146 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
>
> diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
> index 14397e7e9925..50590e8d1fd1 100644
> --- a/drivers/net/ethernet/intel/i40e/Makefile
> +++ b/drivers/net/ethernet/intel/i40e/Makefile
> @@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
>         i40e_txrx.o     \
>         i40e_ptp.o      \
>         i40e_client.o   \
> -       i40e_virtchnl_pf.o
> +       i40e_virtchnl_pf.o \
> +       i40e_xsk.o
>
>  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 7a80652e2500..20955e5dce02 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -786,6 +786,12 @@ struct i40e_vsi {
>
>         /* VSI specific handlers */
>         irqreturn_t (*irq_handler)(int irq, void *data);
> +
> +       /* AF_XDP zero-copy */
> +       struct xdp_umem **xsk_umems;
> +       u16 num_xsk_umems_used;
> +       u16 num_xsk_umems;
> +
>  } ____cacheline_internodealigned_in_smp;
>
>  struct i40e_netdev_priv {
> @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
>         return !!vsi->xdp_prog;
>  }
>
> +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> +{
> +       bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> +       int qid = ring->queue_index;
> +
> +       if (ring_is_xdp(ring))
> +               qid -= ring->vsi->alloc_queue_pairs;
> +
> +       if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> +               return NULL;
> +
> +       return ring->vsi->xsk_umems[qid];
> +}
> +
>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
>  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> @@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>  int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>                                       struct i40e_cloud_filter *filter,
>                                       bool add);
> +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
> +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
> +
>  #endif /* _I40E_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 369a116edaa1..8c602424d339 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include <linux/of_net.h>
>  #include <linux/pci.h>
>  #include <linux/bpf.h>
> +#include <net/xdp_sock.h>
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -16,6 +17,7 @@
>   */
>  #define CREATE_TRACE_POINTS
>  #include "i40e_trace.h"
> +#include "i40e_xsk.h"
>
>  const char i40e_driver_name[] = "i40e";
>  static const char i40e_driver_string[] =
> @@ -3071,6 +3073,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> +       if (ring_is_xdp(ring))
> +               ring->xsk_umem = i40e_xsk_umem(ring);
> +
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
>                 ring->atr_sample_rate = vsi->back->atr_sample_rate;
> @@ -3180,13 +3185,30 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         struct i40e_hw *hw = &vsi->back->hw;
>         struct i40e_hmc_obj_rxq rx_ctx;
>         i40e_status err = 0;
> +       int ret;
>
>         bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
>
>         /* clear the context structure first */
>         memset(&rx_ctx, 0, sizeof(rx_ctx));
>
> -       ring->rx_buf_len = vsi->rx_buf_len;
> +       ring->xsk_umem = i40e_xsk_umem(ring);
> +       if (ring->xsk_umem) {
> +               ring->clean_rx_irq = i40e_clean_rx_irq_zc;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
> +               ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
> +                                  XDP_PACKET_HEADROOM;
> +               ring->zca.free = i40e_zca_free;
> +               ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> +                                                MEM_TYPE_ZERO_COPY,
> +                                                &ring->zca);
> +               if (ret)
> +                       return ret;
> +       } else {
> +               ring->clean_rx_irq = i40e_clean_rx_irq;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
> +               ring->rx_buf_len = vsi->rx_buf_len;
> +       }

With everything that is going on with retpoline overhead I am really
wary of this. We may want to look at finding another way to do this
instead of just function pointers so that we can avoid the extra
function pointer overhead. We may want to look at a flag for this
instead of using function pointers.

>         rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
>                                     BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> @@ -3242,7 +3264,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>         writel(0, ring->tail);
>
> -       i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
> +       ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>
>         return 0;
>  }
> @@ -12022,7 +12044,7 @@ static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
>   *
>   * Returns 0 on success, <0 on failure.
>   **/
> -static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
> +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
>  {
>         int err;
>
> @@ -12047,7 +12069,7 @@ static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
>   *
>   * Returns 0 on success, <0 on failure.
>   **/
> -static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
> +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
>  {
>         int err;
>
> @@ -12095,6 +12117,11 @@ static int i40e_xdp(struct net_device *dev,
>                 xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>                 xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
>                 return 0;
> +       case XDP_QUERY_XSK_UMEM:
> +               return 0;
> +       case XDP_SETUP_XSK_UMEM:
> +               return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
> +                                          xdp->xsk.queue_id);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 5f01e4ce9c92..6b1142fbc697 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -5,6 +5,7 @@
>  #include <net/busy_poll.h>
>  #include <linux/bpf_trace.h>
>  #include <net/xdp.h>
> +#include <net/xdp_sock.h>
>  #include "i40e.h"
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
> @@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
>   * This is used to verify if the FD programming or invalidation
>   * requested by SW to the HW is successful or not and take actions accordingly.
>   **/
> -static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> -                                 union i40e_rx_desc *rx_desc, u8 prog_id)
> +void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> +                          union i40e_rx_desc *rx_desc, u8 prog_id)
>  {
>         struct i40e_pf *pf = rx_ring->vsi->back;
>         struct pci_dev *pdev = pf->pdev;
> @@ -1246,25 +1247,6 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
>         new_buff->pagecnt_bias  = old_buff->pagecnt_bias;
>  }
>
> -/**
> - * i40e_rx_is_programming_status - check for programming status descriptor
> - * @qw: qword representing status_error_len in CPU ordering
> - *
> - * The value of in the descriptor length field indicate if this
> - * is a programming status descriptor for flow director or FCoE
> - * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
> - * it is a packet descriptor.
> - **/
> -static inline bool i40e_rx_is_programming_status(u64 qw)
> -{
> -       /* The Rx filter programming status and SPH bit occupy the same
> -        * spot in the descriptor. Since we don't support packet split we
> -        * can just reuse the bit as an indication that this is a
> -        * programming status descriptor.
> -        */
> -       return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
> -}
> -
>  /**
>   * i40e_clean_programming_status - clean the programming status descriptor
>   * @rx_ring: the rx ring that has this descriptor
> @@ -1373,31 +1355,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>         }
>
>         /* Free all the Rx ring sk_buffs */
> -       for (i = 0; i < rx_ring->count; i++) {
> -               struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> -
> -               if (!rx_bi->page)
> -                       continue;
> +       if (!rx_ring->xsk_umem) {

Instead of changing the indent on all this code it would probably be
easier to just add a goto and a label to skip all this.

> +               for (i = 0; i < rx_ring->count; i++) {
> +                       struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
>
> -               /* Invalidate cache lines that may have been written to by
> -                * device so that we avoid corrupting memory.
> -                */
> -               dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                             rx_bi->dma,
> -                                             rx_bi->page_offset,
> -                                             rx_ring->rx_buf_len,
> -                                             DMA_FROM_DEVICE);
> -
> -               /* free resources associated with mapping */
> -               dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> -                                    i40e_rx_pg_size(rx_ring),
> -                                    DMA_FROM_DEVICE,
> -                                    I40E_RX_DMA_ATTR);
> -
> -               __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
> +                       if (!rx_bi->page)
> +                               continue;
>
> -               rx_bi->page = NULL;
> -               rx_bi->page_offset = 0;
> +                       /* Invalidate cache lines that may have been
> +                        * written to by device so that we avoid
> +                        * corrupting memory.
> +                        */
> +                       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                                     rx_bi->dma,
> +                                                     rx_bi->page_offset,
> +                                                     rx_ring->rx_buf_len,
> +                                                     DMA_FROM_DEVICE);
> +
> +                       /* free resources associated with mapping */
> +                       dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> +                                            i40e_rx_pg_size(rx_ring),
> +                                            DMA_FROM_DEVICE,
> +                                            I40E_RX_DMA_ATTR);
> +
> +                       __page_frag_cache_drain(rx_bi->page,
> +                                               rx_bi->pagecnt_bias);
> +
> +                       rx_bi->page = NULL;
> +                       rx_bi->page_offset = 0;
> +               }
>         }
>
>         bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
> @@ -1487,27 +1473,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
>         return err;
>  }
>
> -/**
> - * i40e_release_rx_desc - Store the new tail and head values
> - * @rx_ring: ring to bump
> - * @val: new head index
> - **/
> -static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
> -{
> -       rx_ring->next_to_use = val;
> -
> -       /* update next to alloc since we have filled the ring */
> -       rx_ring->next_to_alloc = val;
> -
> -       /* Force memory writes to complete before letting h/w
> -        * know there are new descriptors to fetch.  (Only
> -        * applicable for weak-ordered memory model archs,
> -        * such as IA-64).
> -        */
> -       wmb();
> -       writel(val, rx_ring->tail);
> -}
> -
>  /**
>   * i40e_rx_offset - Return expected offset into page to access data
>   * @rx_ring: Ring we are requesting offset of
> @@ -1576,8 +1541,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
>   * @skb: packet to send up
>   * @vlan_tag: vlan tag for packet
>   **/
> -static void i40e_receive_skb(struct i40e_ring *rx_ring,
> -                            struct sk_buff *skb, u16 vlan_tag)
> +void i40e_receive_skb(struct i40e_ring *rx_ring,
> +                     struct sk_buff *skb, u16 vlan_tag)
>  {
>         struct i40e_q_vector *q_vector = rx_ring->q_vector;
>
> @@ -1804,7 +1769,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
>   * order to populate the hash, checksum, VLAN, protocol, and
>   * other fields within the skb.
>   **/
> -static inline
>  void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>                              union i40e_rx_desc *rx_desc, struct sk_buff *skb,
>                              u8 rx_ptype)
> @@ -1829,46 +1793,6 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>         skb->protocol = eth_type_trans(skb, rx_ring->netdev);
>  }
>
> -/**
> - * i40e_cleanup_headers - Correct empty headers
> - * @rx_ring: rx descriptor ring packet is being transacted on
> - * @skb: pointer to current skb being fixed
> - * @rx_desc: pointer to the EOP Rx descriptor
> - *
> - * Also address the case where we are pulling data in on pages only
> - * and as such no data is present in the skb header.
> - *
> - * In addition if skb is not at least 60 bytes we need to pad it so that
> - * it is large enough to qualify as a valid Ethernet frame.
> - *
> - * Returns true if an error was encountered and skb was freed.
> - **/
> -static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
> -                                union i40e_rx_desc *rx_desc)
> -
> -{
> -       /* XDP packets use error pointer so abort at this point */
> -       if (IS_ERR(skb))
> -               return true;
> -
> -       /* ERR_MASK will only have valid bits if EOP set, and
> -        * what we are doing here is actually checking
> -        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> -        * the error field
> -        */
> -       if (unlikely(i40e_test_staterr(rx_desc,
> -                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> -               dev_kfree_skb_any(skb);
> -               return true;
> -       }
> -
> -       /* if eth_skb_pad returns an error the skb was freed */
> -       if (eth_skb_pad(skb))
> -               return true;
> -
> -       return false;
> -}
> -
>  /**
>   * i40e_page_is_reusable - check if any reuse is possible
>   * @page: page struct to check
> @@ -2177,15 +2101,11 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>         return true;
>  }
>
> -#define I40E_XDP_PASS 0
> -#define I40E_XDP_CONSUMED 1
> -#define I40E_XDP_TX 2
> -
>  static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
>                               struct i40e_ring *xdp_ring);
>
> -static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> -                                struct i40e_ring *xdp_ring)
> +int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> +                         struct i40e_ring *xdp_ring)
>  {
>         struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
>
> @@ -2214,8 +2134,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>         if (!xdp_prog)
>                 goto xdp_out;
>
> -       prefetchw(xdp->data_hard_start); /* xdp_frame write */
> -
>         act = bpf_prog_run_xdp(xdp_prog, xdp);
>         switch (act) {
>         case XDP_PASS:
> @@ -2263,15 +2181,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
>  #endif
>  }
>
> -static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> -{
> -       /* Force memory writes to complete before letting h/w
> -        * know there are new descriptors to fetch.
> -        */
> -       wmb();
> -       writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
> -}
> -
>  /**
>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>   * @rx_ring: rx descriptor ring to transact packets on
> @@ -2284,7 +2193,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>   *
>   * Returns amount of work completed
>   **/
> -static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>  {
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>         struct sk_buff *skb = rx_ring->skb;
> @@ -2576,7 +2485,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>         budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
>
>         i40e_for_each_ring(ring, q_vector->rx) {
> -               int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
> +               int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
>
>                 work_done += cleaned;
>                 /* if we clean as many as budgeted, we must not be done */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 820f76db251b..cddb185cd2f8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -296,13 +296,22 @@ struct i40e_tx_buffer {
>
>  struct i40e_rx_buffer {
>         dma_addr_t dma;
> -       struct page *page;
> +       union {
> +               struct {
> +                       struct page *page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> -       __u32 page_offset;
> +                       __u32 page_offset;
>  #else
> -       __u16 page_offset;
> +                       __u16 page_offset;
>  #endif
> -       __u16 pagecnt_bias;
> +                       __u16 pagecnt_bias;
> +               };
> +               struct {
> +                       /* for umem */
> +                       void *addr;
> +                       u64 handle;
> +               };

It might work better to just do this as a pair of unions. One for
page/addr and another for handle, page_offset, and pagecnt_bias.

> +       };
>  };
>
>  struct i40e_queue_stats {
> @@ -414,6 +423,12 @@ struct i40e_ring {
>
>         struct i40e_channel *ch;
>         struct xdp_rxq_info xdp_rxq;
> +
> +       int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
> +       bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
> +       struct xdp_umem *xsk_umem;
> +
> +       struct zero_copy_allocator zca; /* ZC allocator anchor */
>  } ____cacheline_internodealigned_in_smp;
>
>  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> @@ -490,6 +505,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb);
>  int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>                   u32 flags);
>  void i40e_xdp_flush(struct net_device *dev);
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -576,4 +592,108 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
>  {
>         return netdev_get_tx_queue(ring->netdev, ring->queue_index);
>  }
> +
> +#define I40E_XDP_PASS 0
> +#define I40E_XDP_CONSUMED 1
> +#define I40E_XDP_TX 2
> +
> +/**
> + * i40e_release_rx_desc - Store the new tail and head values
> + * @rx_ring: ring to bump
> + * @val: new head index
> + **/
> +static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
> +{
> +       rx_ring->next_to_use = val;
> +
> +       /* update next to alloc since we have filled the ring */
> +       rx_ring->next_to_alloc = val;
> +
> +       /* Force memory writes to complete before letting h/w
> +        * know there are new descriptors to fetch.  (Only
> +        * applicable for weak-ordered memory model archs,
> +        * such as IA-64).
> +        */
> +       wmb();
> +       writel(val, rx_ring->tail);
> +}
> +
> +/**
> + * i40e_rx_is_programming_status - check for programming status descriptor
> + * @qw: qword representing status_error_len in CPU ordering
> + *
> + * The value of in the descriptor length field indicate if this
> + * is a programming status descriptor for flow director or FCoE
> + * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
> + * it is a packet descriptor.
> + **/
> +static inline bool i40e_rx_is_programming_status(u64 qw)
> +{
> +       /* The Rx filter programming status and SPH bit occupy the same
> +        * spot in the descriptor. Since we don't support packet split we
> +        * can just reuse the bit as an indication that this is a
> +        * programming status descriptor.
> +        */
> +       return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
> +}
> +
> +/**
> + * i40e_cleanup_headers - Correct empty headers
> + * @rx_ring: rx descriptor ring packet is being transacted on
> + * @skb: pointer to current skb being fixed
> + * @rx_desc: pointer to the EOP Rx descriptor
> + *
> + * Also address the case where we are pulling data in on pages only
> + * and as such no data is present in the skb header.
> + *
> + * In addition if skb is not at least 60 bytes we need to pad it so that
> + * it is large enough to qualify as a valid Ethernet frame.
> + *
> + * Returns true if an error was encountered and skb was freed.
> + **/
> +static inline bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
> +                                       struct sk_buff *skb,
> +                                       union i40e_rx_desc *rx_desc)
> +
> +{
> +       /* XDP packets use error pointer so abort at this point */
> +       if (IS_ERR(skb))
> +               return true;
> +
> +       /* ERR_MASK will only have valid bits if EOP set, and
> +        * what we are doing here is actually checking
> +        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> +        * the error field
> +        */
> +       if (unlikely(i40e_test_staterr(rx_desc,
> +                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> +               dev_kfree_skb_any(skb);
> +               return true;
> +       }
> +
> +       /* if eth_skb_pad returns an error the skb was freed */
> +       if (eth_skb_pad(skb))
> +               return true;
> +
> +       return false;
> +}
> +
> +static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> +{
> +       /* Force memory writes to complete before letting h/w
> +        * know there are new descriptors to fetch.
> +        */
> +       wmb();
> +       writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
> +}
> +
> +void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> +                          union i40e_rx_desc *rx_desc, u8 prog_id);
> +int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> +                         struct i40e_ring *xdp_ring);
> +void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> +                            union i40e_rx_desc *rx_desc, struct sk_buff *skb,
> +                            u8 rx_ptype);
> +void i40e_receive_skb(struct i40e_ring *rx_ring,
> +                     struct sk_buff *skb, u16 vlan_tag);
>  #endif /* _I40E_TXRX_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> new file mode 100644
> index 000000000000..9d16924415b9
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -0,0 +1,537 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2018 Intel Corporation. */
> +
> +#include <linux/bpf_trace.h>
> +#include <net/xdp_sock.h>
> +#include <net/xdp.h>
> +
> +#include "i40e.h"
> +#include "i40e_txrx.h"
> +
> +static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
> +{
> +       if (vsi->xsk_umems)
> +               return 0;
> +
> +       vsi->num_xsk_umems_used = 0;
> +       vsi->num_xsk_umems = vsi->alloc_queue_pairs;
> +       vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
> +                                GFP_KERNEL);
> +       if (!vsi->xsk_umems) {
> +               vsi->num_xsk_umems = 0;
> +               return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                            u16 qid)
> +{
> +       int err;
> +
> +       err = i40e_alloc_xsk_umems(vsi);
> +       if (err)
> +               return err;
> +
> +       vsi->xsk_umems[qid] = umem;
> +       vsi->num_xsk_umems_used++;
> +
> +       return 0;
> +}
> +
> +static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
> +{
> +       vsi->xsk_umems[qid] = NULL;
> +       vsi->num_xsk_umems_used--;
> +
> +       if (vsi->num_xsk_umems == 0) {
> +               kfree(vsi->xsk_umems);
> +               vsi->xsk_umems = NULL;
> +               vsi->num_xsk_umems = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i, j;
> +       dma_addr_t dma;
> +
> +       dev = &pf->pdev->dev;
> +       for (i = 0; i < umem->npgs; i++) {
> +               dma = dma_map_page_attrs(dev, umem->pgs[i], 0, PAGE_SIZE,
> +                                        DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +               if (dma_mapping_error(dev, dma))
> +                       goto out_unmap;
> +
> +               umem->pages[i].dma = dma;
> +       }
> +
> +       return 0;
> +
> +out_unmap:
> +       for (j = 0; j < i; j++) {
> +               dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
> +                                    DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +               umem->pages[i].dma = 0;
> +       }
> +
> +       return -1;
> +}
> +
> +static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i;
> +
> +       dev = &pf->pdev->dev;
> +
> +       for (i = 0; i < umem->npgs; i++) {
> +               dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
> +                                    DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +
> +               umem->pages[i].dma = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                               u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (vsi->type != I40E_VSI_MAIN)
> +               return -EINVAL;
> +
> +       if (qid >= vsi->num_queue_pairs)
> +               return -EINVAL;
> +
> +       if (vsi->xsk_umems && vsi->xsk_umems[qid])
> +               return -EBUSY;
> +
> +       err = i40e_xsk_umem_dma_map(vsi, umem);
> +       if (err)
> +               return err;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       err = i40e_add_xsk_umem(vsi, umem, qid);
> +       if (err)
> +               return err;
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
> +           !vsi->xsk_umems[qid])
> +               return -EINVAL;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
> +       i40e_remove_xsk_umem(vsi, qid);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                       u16 qid)
> +{
> +       if (umem)
> +               return i40e_xsk_umem_enable(vsi, umem, qid);
> +
> +       return i40e_xsk_umem_disable(vsi, qid);
> +}
> +
> +static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
> +                                      struct xdp_buff *xdp)
> +{
> +       int err, result = I40E_XDP_PASS;
> +       struct i40e_ring *xdp_ring;
> +       struct bpf_prog *xdp_prog;
> +       u32 act;
> +       u16 off;
> +
> +       rcu_read_lock();
> +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> +       act = bpf_prog_run_xdp(xdp_prog, xdp);
> +       off = xdp->data - xdp->data_hard_start;
> +       xdp->handle += off;
> +       switch (act) {
> +       case XDP_PASS:
> +               break;
> +       case XDP_TX:
> +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +               result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
> +               break;
> +       case XDP_REDIRECT:
> +               err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> +               result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> +               break;
> +       default:
> +               bpf_warn_invalid_xdp_action(act);
> +       case XDP_ABORTED:
> +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
> +               /* fallthrough -- handle aborts by dropping packet */
> +       case XDP_DROP:
> +               result = I40E_XDP_CONSUMED;
> +               break;
> +       }
> +
> +       rcu_read_unlock();
> +       return ERR_PTR(-result);
> +}
> +
> +static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
> +                               struct i40e_rx_buffer *bi)
> +{
> +       struct xdp_umem *umem = rx_ring->xsk_umem;
> +       void *addr = bi->addr;
> +       u64 handle;
> +
> +       if (addr) {
> +               rx_ring->rx_stats.page_reuse_count++;
> +               return true;
> +       }
> +
> +       if (!xsk_umem_peek_addr(umem, &handle)) {
> +               rx_ring->rx_stats.alloc_page_failed++;
> +               return false;
> +       }
> +
> +       bi->dma = xdp_umem_get_dma(umem, handle);
> +       bi->addr = xdp_umem_get_data(umem, handle);
> +
> +       bi->dma += umem->headroom + XDP_PACKET_HEADROOM;
> +       bi->addr += umem->headroom + XDP_PACKET_HEADROOM;
> +       bi->handle = handle + umem->headroom;
> +
> +       xsk_umem_discard_addr(umem);
> +       return true;
> +}
> +
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
> +{
> +       u16 ntu = rx_ring->next_to_use;
> +       union i40e_rx_desc *rx_desc;
> +       struct i40e_rx_buffer *bi;
> +
> +       rx_desc = I40E_RX_DESC(rx_ring, ntu);
> +       bi = &rx_ring->rx_bi[ntu];
> +
> +       do {
> +               if (!i40e_alloc_frame_zc(rx_ring, bi))
> +                       goto no_buffers;
> +
> +               /* sync the buffer for use by the device */
> +               dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
> +                                                rx_ring->rx_buf_len,
> +                                                DMA_BIDIRECTIONAL);
> +
> +               /* Refresh the desc even if buffer_addrs didn't change
> +                * because each write-back erases this info.
> +                */
> +               rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
> +
> +               rx_desc++;
> +               bi++;
> +               ntu++;
> +               if (unlikely(ntu == rx_ring->count)) {
> +                       rx_desc = I40E_RX_DESC(rx_ring, 0);
> +                       bi = rx_ring->rx_bi;
> +                       ntu = 0;
> +               }
> +
> +               /* clear the status bits for the next_to_use descriptor */
> +               rx_desc->wb.qword1.status_error_len = 0;
> +
> +               cleaned_count--;
> +       } while (cleaned_count);
> +
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       return false;
> +
> +no_buffers:
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       /* make sure to come back via polling to try again after
> +        * allocation failure
> +        */
> +       return true;
> +}
> +
> +static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                                   const unsigned int size)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +
> +       rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> +
> +       /* we are reusing so sync this buffer for CPU use */
> +       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                     rx_buffer->dma, 0,
> +                                     size,
> +                                     DMA_BIDIRECTIONAL);
> +
> +       return rx_buffer;
> +}
> +
> +static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                   struct i40e_rx_buffer *old_buff)
> +{
> +       u64 mask = rx_ring->xsk_umem->props.chunk_mask;
> +       u64 hr = rx_ring->xsk_umem->headroom;
> +       u16 nta = rx_ring->next_to_alloc;
> +       struct i40e_rx_buffer *new_buff;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       /* transfer page from old buffer to new buffer */
> +       new_buff->dma           = old_buff->dma & mask;
> +       new_buff->addr          = (void *)((u64)old_buff->addr & mask);
> +       new_buff->handle        = old_buff->handle & mask;
> +
> +       new_buff->dma += hr + XDP_PACKET_HEADROOM;
> +       new_buff->addr += hr + XDP_PACKET_HEADROOM;
> +       new_buff->handle += hr;
> +}
> +
> +/* Called from the XDP return API in NAPI context. */
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
> +{
> +       struct i40e_rx_buffer *new_buff;
> +       struct i40e_ring *rx_ring;
> +       u64 mask;
> +       u16 nta;
> +
> +       rx_ring = container_of(alloc, struct i40e_ring, zca);
> +       mask = rx_ring->xsk_umem->props.chunk_mask;
> +
> +       nta = rx_ring->next_to_alloc;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       handle &= mask;
> +
> +       new_buff->dma           = xdp_umem_get_dma(rx_ring->xsk_umem, handle);
> +       new_buff->addr          = xdp_umem_get_data(rx_ring->xsk_umem, handle);
> +       new_buff->handle        = (u64)handle;
> +
> +       new_buff->dma += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> +       new_buff->addr += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> +       new_buff->handle += rx_ring->xsk_umem->headroom;
> +}
> +
> +static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
> +                                           struct i40e_rx_buffer *rx_buffer,
> +                                           struct xdp_buff *xdp)
> +{
> +       /* XXX implement alloc skb and copy */
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       return NULL;
> +}
> +
> +static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
> +                                            union i40e_rx_desc *rx_desc,
> +                                            u64 qw)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +       u32 ntc = rx_ring->next_to_clean;
> +       u8 id;
> +
> +       /* fetch, update, and store next to clean */
> +       rx_buffer = &rx_ring->rx_bi[ntc++];
> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
> +       rx_ring->next_to_clean = ntc;
> +
> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +       /* place unused page back on the ring */
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       rx_ring->rx_stats.page_reuse_count++;
> +
> +       /* clear contents of buffer_info */
> +       rx_buffer->addr = NULL;
> +
> +       id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
> +                 I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
> +
> +       if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> +               i40e_fd_handle_status(rx_ring, rx_desc, id);
> +}
> +
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> +{
> +       unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> +       bool failure = false, xdp_xmit = false;
> +       struct sk_buff *skb;
> +       struct xdp_buff xdp;
> +
> +       xdp.rxq = &rx_ring->xdp_rxq;
> +
> +       while (likely(total_rx_packets < (unsigned int)budget)) {
> +               struct i40e_rx_buffer *rx_buffer;
> +               union i40e_rx_desc *rx_desc;
> +               unsigned int size;
> +               u16 vlan_tag;
> +               u8 rx_ptype;
> +               u64 qword;
> +               u32 ntc;
> +
> +               /* return some buffers to hardware, one at a time is too slow */
> +               if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> +                       failure = failure ||
> +                                 i40e_alloc_rx_buffers_zc(rx_ring,
> +                                                          cleaned_count);
> +                       cleaned_count = 0;
> +               }
> +
> +               rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
> +
> +               /* status_error_len will always be zero for unused descriptors
> +                * because it's cleared in cleanup, and overlaps with hdr_addr
> +                * which is always zero because packet split isn't used, if the
> +                * hardware wrote DD then the length will be non-zero
> +                */
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +
> +               /* This memory barrier is needed to keep us from reading
> +                * any other fields out of the rx_desc until we have
> +                * verified the descriptor has been written back.
> +                */
> +               dma_rmb();
> +
> +               if (unlikely(i40e_rx_is_programming_status(qword))) {
> +                       i40e_clean_programming_status_zc(rx_ring, rx_desc,
> +                                                        qword);
> +                       cleaned_count++;
> +                       continue;
> +               }
> +               size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> +                      I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
> +               if (!size)
> +                       break;
> +
> +               rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
> +
> +               /* retrieve a buffer from the ring */
> +               xdp.data = rx_buffer->addr;
> +               xdp_set_data_meta_invalid(&xdp);
> +               xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
> +               xdp.data_end = xdp.data + size;
> +               xdp.handle = rx_buffer->handle;
> +
> +               skb = i40e_run_xdp_zc(rx_ring, &xdp);
> +
> +               if (IS_ERR(skb)) {
> +                       if (PTR_ERR(skb) == -I40E_XDP_TX)
> +                               xdp_xmit = true;
> +                       else
> +                               i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +                       total_rx_bytes += size;
> +                       total_rx_packets++;
> +               } else {
> +                       skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
> +                       if (!skb) {
> +                               rx_ring->rx_stats.alloc_buff_failed++;
> +                               break;
> +                       }
> +               }
> +
> +               rx_buffer->addr = NULL;
> +               cleaned_count++;
> +
> +               /* don't care about non-EOP frames in XDP mode */
> +               ntc = rx_ring->next_to_clean + 1;
> +               ntc = (ntc < rx_ring->count) ? ntc : 0;
> +               rx_ring->next_to_clean = ntc;
> +               prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +               if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
> +                       skb = NULL;
> +                       continue;
> +               }
> +
> +               /* probably a little skewed due to removing CRC */
> +               total_rx_bytes += skb->len;
> +
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +               rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
> +                          I40E_RXD_QW1_PTYPE_SHIFT;
> +
> +               /* populate checksum, VLAN, and protocol */
> +               i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
> +
> +               vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
> +                          le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
> +
> +               i40e_receive_skb(rx_ring, skb, vlan_tag);
> +               skb = NULL;
> +
> +               /* update budget accounting */
> +               total_rx_packets++;
> +       }
> +
> +       if (xdp_xmit) {
> +               struct i40e_ring *xdp_ring =
> +                       rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +               xdp_do_flush_map();
> +       }
> +
> +       u64_stats_update_begin(&rx_ring->syncp);
> +       rx_ring->stats.packets += total_rx_packets;
> +       rx_ring->stats.bytes += total_rx_bytes;
> +       u64_stats_update_end(&rx_ring->syncp);
> +       rx_ring->q_vector->rx.total_packets += total_rx_packets;
> +       rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
> +
> +       /* guarantee a trip back through this routine if there was a failure */
> +       return failure ? budget : (int)total_rx_packets;
> +}
> +

You should really look at adding comments to the code you are adding.
>From what I can tell almost all of the code comments were just copied
exactly from the original functions in the i40e_txrx.c file.

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> new file mode 100644
> index 000000000000..757ac5ca8511
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2018 Intel Corporation. */
> +
> +#ifndef _I40E_XSK_H_
> +#define _I40E_XSK_H_
> +
> +struct i40e_vsi;
> +struct xdp_umem;
> +struct zero_copy_allocator;
> +
> +int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                       u16 qid);
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +
> +#endif /* _I40E_XSK_H_ */
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 9fe472f2ac95..ec8fd3314097 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -94,6 +94,25 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
>  {
>         return false;
>  }
> +
> +static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
> +{
> +       return NULL;
> +}
> +
> +static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
> +{
> +}
>  #endif /* CONFIG_XDP_SOCKETS */
>
> +static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> +{
> +       return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> +}
> +
> +static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
> +{
> +       return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> +}
> +
>  #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index f11560334f88..c8be1ad3eb88 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -8,16 +8,6 @@
>
>  #include <net/xdp_sock.h>
>
> -static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> -{
> -       return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> -}
> -
> -static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
> -{
> -       return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> -}
> -
>  int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
>                         u32 queue_id, u16 flags);
>  bool xdp_umem_validate_queues(struct xdp_umem *umem);
> --
> 2.14.1
>

^ permalink raw reply

* Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: David Miller @ 2018-06-04 20:30 UTC (permalink / raw)
  To: gerlitz.or; +Cc: jeffrey.t.kirsher, netdev
In-Reply-To: <CAJ3xEMjZtVDorZLYsZ-CC534LZ=zdFYRtO2YQ7oGRC3StTb=9g@mail.gmail.com>

From: Or Gerlitz <gerlitz.or@gmail.com>
Date: Mon, 4 Jun 2018 23:27:57 +0300

> On Mon, Jun 4, 2018 at 8:56 PM, Jeff Kirsher
> <jeffrey.t.kirsher@intel.com> wrote:
>> This series contains a smorgasbord of updates to documentation, e1000e,
>> igb, ixgbe, ixgbevf and i40e.
> 
> Dave,
> 
> Did you forgot to flip the sign on the shop's door [1]?
> 
> Or.
> 
> [1] http://vger.kernel.org/~davem/net-next.html

It's open a day or two more to deal with the AF_XDP issues...

^ permalink raw reply

* Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: Or Gerlitz @ 2018-06-04 20:27 UTC (permalink / raw)
  To: Jeff Kirsher, David Miller; +Cc: Linux Netdev List
In-Reply-To: <20180604175644.24293-1-jeffrey.t.kirsher@intel.com>

On Mon, Jun 4, 2018 at 8:56 PM, Jeff Kirsher
<jeffrey.t.kirsher@intel.com> wrote:
> This series contains a smorgasbord of updates to documentation, e1000e,
> igb, ixgbe, ixgbevf and i40e.

Dave,

Did you forgot to flip the sign on the shop's door [1]?

Or.

[1] http://vger.kernel.org/~davem/net-next.html

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support
From: Jeff Kirsher @ 2018-06-04 20:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Björn Töpel
  Cc: mykyta.iziumtsev, mst, brian.brooks, magnus.karlsson, andy,
	francois.ozog, willemdebruijn.kernel, daniel, ast,
	intel-wired-lan, brouer, Björn Töpel, michael.lundkvist,
	qi.z.zhang, michael.chan, magnus.karlsson, netdev,
	ilias.apalodimas
In-Reply-To: <20180604163838.5pzojvzrxd2cusny@ast-mbp>

[-- Attachment #1: Type: text/plain, Size: 5343 bytes --]

On Mon, 2018-06-04 at 09:38 -0700, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> > 
> > This patch serie introduces zerocopy (ZC) support for
> > AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> > without any copies and can also transmit packets without incurring
> > any
> > copies. No modifications to the application are needed, but the NIC
> > driver needs to be modified to support ZC. If ZC is not supported
> > by
> > the driver, the modes introduced in the AF_XDP patch will be
> > used. Using ZC in our micro benchmarks results in significantly
> > improved performance as can be seen in the performance section
> > later
> > in this cover letter.
> > 
> > Note that for an untrusted application, HW packet steering to a
> > specific queue pair (the one associated with the application) is a
> > requirement when using ZC, as the application would otherwise be
> > able
> > to see other user space processes' packets. If the HW cannot
> > support
> > the required packet steering you need to use the XDP_SKB mode or
> > the
> > XDP_DRV mode without ZC turned on. The XSKMAP introduced in the
> > AF_XDP
> > patch set can be used to do load balancing in that case.
> > 
> > For benchmarking, you can use the xdpsock application from the
> > AF_XDP
> > patch set without any modifications. Say that you would like your
> > UDP
> > traffic from port 4242 to end up in queue 16, that we will enable
> > AF_XDP on. Here, we use ethtool for this:
> > 
> >       ethtool -N p3p2 rx-flow-hash udp4 fn
> >       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> >           action 16
> > 
> > Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then
> > be
> > done using:
> > 
> >       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> > 
> > We have run some benchmarks on a dual socket system with two
> > Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has
> > 14
> > cores which gives a total of 28, but only two cores are used in
> > these
> > experiments. One for TR/RX and one for the user space application.
> > The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of
> > total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0.
> > The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> > 
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator
> > HW
> > outputing packets at full 40 Gbit/s line rate. The results are
> > without
> > retpoline so that we can compare against previous numbers. 
> > 
> > AF_XDP performance 64 byte packets. Results from the AF_XDP V3
> > patch
> > set are also reported for ease of reference. The numbers within
> > parantheses are from the RFC V1 ZC patch set.
> > Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> > rxdrop       2.9*       9.6*       21.1(21.5)
> > txpush       2.6*       -          22.0(21.6)
> > l2fwd        1.9*       2.5*       15.3(15.0)
> > 
> > AF_XDP performance 1500 byte packets:
> > Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> > rxdrop       2.1*       3.3*       3.3(3.3)
> > l2fwd        1.4*       1.8*       3.1(3.1)
> > 
> > * From AF_XDP V3 patch set and cover letter.
> > 
> > So why do we not get higher values for RX similar to the 34 Mpps we
> > had in AF_PACKET V4? We made an experiment running the rxdrop
> > benchmark without using the xdp_do_redirect/flush infrastructure
> > nor
> > using an XDP program (all traffic on a queue goes to one
> > socket). Instead the driver acts directly on the AF_XDP socket.
> > With
> > this we got 36.9 Mpps, a significant improvement without any change
> > to
> > the uapi. So not forcing users to have an XDP program if they do
> > not
> > need it, might be a good idea. This measurement is actually higher
> > than what we got with AF_PACKET V4.
> > 
> > XDP performance on our system as a base line:
> > 
> > 64 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      32.3M  0
> > 
> > 1500 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      3.3M    0
> > 
> > The structure of the patch set is as follows:
> > 
> > Patches 1-3: Plumbing for AF_XDP ZC support
> > Patches 4-5: AF_XDP ZC for RX
> > Patches 6-7: AF_XDP ZC for TX
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> for above patches
> 
> > Patch 8-10: ZC support for i40e.
> 
> these also look good to me.
> would be great if i40e experts take a look at them asap.
> 
> If there are no major objections we'd like to merge all of it
> for this merge window.

We would like a bit more time to review and test the changes, I
understand your eagerness for wanting this to get into 4.18 but this
change is large enough that a 24-48 hour review time is not prudent,
IMHO.

Alex also has requested for more time so that he can review the changes
as well.  I will go ahead and put the entire series in my tree so that
our validation team can start to "kick the tires".

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH net-next] net: phy: broadcom: Enable 125 MHz clock on LED4 pin for BCM54612E by default.
From: Kun Yi @ 2018-06-04 20:17 UTC (permalink / raw)
  To: davem, kunyi
  Cc: netdev, Avi.Fishman, tali.perry, tomer.maimon, benjaminfair,
	rlippert, f.fainelli

BCM54612E have 4 multi-functional LED pins that can be configured
through register setting; the LED4 pin can be configured to a 125MHz
reference clock output by setting the spare register. Since the dedicated
CLK125 reference clock pin is not brought out on the 48-Pin MLP, the LED4
pin is the only pin to provide such function in this package, and therefore
it is beneficial to just enable the reference clock by default.

Signed-off-by: Kun Yi <kunyi@google.com>
---
 drivers/net/phy/broadcom.c | 16 ++++++++++++++--
 include/linux/brcmphy.h    |  4 ++++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index f9c25912eb98..e86ea105c802 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -54,6 +54,8 @@ static int bcm54210e_config_init(struct phy_device *phydev)
 
 static int bcm54612e_config_init(struct phy_device *phydev)
 {
+	int reg;
+
 	/* Clear TX internal delay unless requested. */
 	if ((phydev->interface != PHY_INTERFACE_MODE_RGMII_ID) &&
 	    (phydev->interface != PHY_INTERFACE_MODE_RGMII_TXID)) {
@@ -65,8 +67,6 @@ static int bcm54612e_config_init(struct phy_device *phydev)
 	/* Clear RX internal delay unless requested. */
 	if ((phydev->interface != PHY_INTERFACE_MODE_RGMII_ID) &&
 	    (phydev->interface != PHY_INTERFACE_MODE_RGMII_RXID)) {
-		u16 reg;
-
 		reg = bcm54xx_auxctl_read(phydev,
 					  MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
 		/* Disable RXD to RXC delay (default set) */
@@ -77,6 +77,18 @@ static int bcm54612e_config_init(struct phy_device *phydev)
 				     MII_BCM54XX_AUXCTL_MISC_WREN | reg);
 	}
 
+	/* Enable CLK125 MUX on LED4 if ref clock is enabled. */
+	if (!(phydev->dev_flags & PHY_BRCM_RX_REFCLK_UNUSED)) {
+		int err;
+
+		reg = bcm_phy_read_exp(phydev, BCM54612E_EXP_SPARE0);
+		err = bcm_phy_write_exp(phydev, BCM54612E_EXP_SPARE0,
+					BCM54612E_LED4_CLK125OUT_EN | reg);
+
+		if (err < 0)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index b324e01ccf2d..daa9234a9baf 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -85,6 +85,7 @@
 #define MII_BCM54XX_EXP_SEL	0x17	/* Expansion register select */
 #define MII_BCM54XX_EXP_SEL_SSD	0x0e00	/* Secondary SerDes select */
 #define MII_BCM54XX_EXP_SEL_ER	0x0f00	/* Expansion register select */
+#define MII_BCM54XX_EXP_SEL_ETC	0x0d00	/* Expansion register spare + 2k mem */
 
 #define MII_BCM54XX_AUX_CTL	0x18	/* Auxiliary control register */
 #define MII_BCM54XX_ISR		0x1a	/* BCM54xx interrupt status register */
@@ -219,6 +220,9 @@
 #define BCM54810_SHD_CLK_CTL			0x3
 #define BCM54810_SHD_CLK_CTL_GTXCLK_EN		(1 << 9)
 
+/* BCM54612E Registers */
+#define BCM54612E_EXP_SPARE0		(MII_BCM54XX_EXP_SEL_ETC + 0x34)
+#define BCM54612E_LED4_CLK125OUT_EN	(1 << 1)
 
 /*****************************************************************************/
 /* Fast Ethernet Transceiver definitions. */
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

* Re: [bug] cxgb4: vrf stopped working with cxgb4 card
From: AMG Zollner Robert @ 2018-06-04 20:14 UTC (permalink / raw)
  To: David Ahern, ganeshgr; +Cc: netdev
In-Reply-To: <8073c78c-3243-d7f3-55c3-2cc1a2153366@cumulusnetworks.com>

Yes, I was enslaving while the interface was up.

Just tested some of the builds that where not working earlier and they 
are working if I keep the interface down when enslaving as you suggested.

Is this the expected behavior?

Thank you,
Zollner Robert


On 04.06.2018 21:17, David Ahern wrote:
> On 6/4/18 8:03 AM, AMG Zollner Robert wrote:
>> I have noticed that vrf is not working with kernel v4.15.0 but was
>> working with v4.13.0 when using cxgb4 Chelsio driver (T520-cr)
>>
>> Setup:
>> Two metal servers with a T520-cr card each, directly connected without a
>> switch in between.
>>
>>         SVR1  only ipfwd                 SVR2     with vrf
>> .----------------------------. .----------------------------------.
>> |                            |         |             |
>> |    192.168.8.1 [  ens2f4]--|---------|--[ens1f4] 192.168.8.2   |
>> |    192.168.9.1 [ens2f4d1]--|---------|--<ens1f4d1> 192.168.9.2 VRF=10   |
>> `----------------------------' `----------------------------------'
>>
>> When vrf is not working there are no error messages (dmesg or iproute
>> commands), tcpdump on the interface (SVR2.ens1f4d1) enslaved in vrf 10
>> shows packets(arp req/reply) coming in and going out, but outgoing
>> packets(arp reply) do not reach the other server SVR1.ens2f4d1
>>
>>
>> Bisect:
>> Found this commit to be the problem after doing a git bisect between
>> v4.13..v4.15:
>>
>> commit ba581f77df23c8ee70b372966e69cf10bc5453d8
>> Author: Ganesh Goudar <ganeshgr@chelsio.com>
>> Date:   Sat Sep 23 16:07:28 2017 +0530
>>
>>      cxgb4: do DCB state reset in couple of places
>>
>>      reset the driver's DCB state in couple of places
>>      where it was missing.
>>
>>
>> A bisect step was considered good when:
>> - successful ping from SVR1 to SVR2.ens1f4d1 vrf interface
>> - successful ping from SVR2 global to SVR2 vrf interface trough SVR1(l3
>> forwarding) (this check was redundant,both tests fail or pass simultaneous)
>>
>> The problem is still present on recent kernels also, checked v4.16.0 and
>> v4.17.rc7
>>
>> Disabling DCB for the card support fixes the problem ( Compiling kernel
>> with "CONFIG_CHELSIO_T4_DCB=n")
>>
> Are you doing the VRF enslave while it is up?
>
> If so, does it work ok if you change the sequence:
>
> ip li set ens1f4d1 down
> ip li set ens1f4d1 master <VRF>
> ip li set ens1f4d1 up

^ permalink raw reply

* Re: [PATCH net] rxrpc: Fix handling of call quietly cancelled out on server
From: David Miller @ 2018-06-04 20:06 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs, linux-kernel
In-Reply-To: <152798865948.5814.10196893914153717731.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Sun, 03 Jun 2018 02:17:39 +0100

> Sometimes an in-progress call will stop responding on the fileserver when
> the fileserver quietly cancels the call with an internally marked abort
> (RX_CALL_DEAD), without sending an ABORT to the client.
> 
> This causes the client's call to eventually expire from lack of incoming
> packets directed its way, which currently leads to it being cancelled
> locally with ETIME.  Note that it's not currently clear as to why this
> happens as it's really hard to reproduce.
> 
> The rotation policy implement by kAFS, however, doesn't differentiate
> between ETIME meaning we didn't get any response from the server and ETIME
> meaning the call got cancelled mid-flow.  The latter leads to an oops when
> fetching data as the rotation partially resets the afs_read descriptor,
> which can result in a cleared page pointer being dereferenced because that
> page has already been filled.
 ...
> Signed-off-by: David Howells <dhowells@redhat.com>

Applied, thanks David.

^ permalink raw reply

* Re: [PATCH net-next] Allow ethtool to change tun link settings
From: David Miller @ 2018-06-04 20:05 UTC (permalink / raw)
  To: 3chas3; +Cc: netdev
In-Reply-To: <20180602214953.22866-1-3chas3@gmail.com>

From: Chas Williams <3chas3@gmail.com>
Date: Sat,  2 Jun 2018 17:49:53 -0400

> Let user space set whatever it would like to advertise for the
> tun interface.  Preserve the existing defaults.
> 
> Signed-off-by: Chas Williams <3chas3@gmail.com>

This looks fine, applied.

^ permalink raw reply

* Re: [bpf-next PATCH] bpf: sockmap, fix crash when ipv6 sock is added
From: Daniel Borkmann @ 2018-06-04 19:59 UTC (permalink / raw)
  To: John Fastabend, edumazet, ast; +Cc: netdev
In-Reply-To: <20180604152125.6930.88723.stgit@john-Precision-Tower-5810>

On 06/04/2018 05:21 PM, John Fastabend wrote:
> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
> of tcpv6_prot.
> 
> Previously we overwrote the sk->prot field with tcp_prot even in the
> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
> are used. Further, only allow ESTABLISHED connections to join the
> map per note in TLS ULP,
> 
>    /* The TLS ulp is currently supported only for TCP sockets
>     * in ESTABLISHED state.
>     * Supporting sockets in LISTEN state will require us
>     * to modify the accept implementation to clone rather then
>     * share the ulp context.
>     */
> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
> crashing case here.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Wei Wang <weiwan@google.com>

Applied to bpf-next, thanks everyone!

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPS
From: Daniel Borkmann @ 2018-06-04 19:53 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180604155341.1517003-1-yhs@fb.com>

On 06/04/2018 05:53 PM, Yonghong Song wrote:
> Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id()
> helper") introduced a new helper bpf_get_current_cgroup_id().
> The helper has a dependency on CONFIG_CGROUPS.
> 
> When CONFIG_CGROUPS is not defined, using the helper will result
> the following verifier error:
>   kernel subsystem misconfigured func bpf_get_current_cgroup_id#80
> which is hard for users to interpret.
> Guarding the reference to bpf_get_current_cgroup_id_proto with
> CONFIG_CGROUPS will result in below better message:
>   unknown func bpf_get_current_cgroup_id#80
> 
> Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper")
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied to bpf-next, thanks Yonghong!

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes
From: Daniel Borkmann @ 2018-06-04 19:51 UTC (permalink / raw)
  To: Alexei Starovoitov, Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan
In-Reply-To: <20180604162429.zu4uno6fviz4pfte@ast-mbp>

On 06/04/2018 06:24 PM, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 01:57:10PM +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
>> https://www.spinics.net/lists/netdev/msg503664.html) is that it does
>> not support NICs that have a "type-writer" model in an efficient
>> way. In this model, a memory window is passed to the hardware and
>> multiple frames might be filled into that window, instead of just one
>> that we have in the current fixed frame-size model.
>>
>> This patch set fixes two bugs in the current implementation and then
>> changes the uapi so that the type-writer model can be supported
>> efficiently by a possible future extension of AF_XDP.
>>
>> These are the uapi changes in this patch:
>>
>> * Change the "u32 idx" in the descriptors to "u64 addr". The current
>>   idx based format does NOT work for the type-writer model (as packets
>>   can start anywhere within a frame) but that a relative address
>>   pointer (the u64 addr) works well for both models in the prototype
>>   code we have that supports both models. We increased it from u32 to
>>   u64 to support umems larger than 4G. We have also removed the u16
>>   offset when having a "u64 addr" since that information is already
>>   carried in the least significant bits of the address.
>>
>> * We want to use "u8 padding[5]" for something useful in the future
>>   (since we are not allowed to change its name), so we now call it
>>   just options so it can be extended for various purposes in the
>>   future. It is an u32 as that it what is left of the 16 byte
>>   descriptor.
>>
>> * We changed the name of frame_size in the UMEM_REG setsockopt to
>>   chunk_size since this naming also makes sense to the type-writer
>>   model.
>>
>> With these changes to the uapi, we believe the type-writer model can
>> be supported without having to resort to a new descriptor format. The
>> type-writer model could then be supported, from the uapi point of
>> view, by setting a flag at bind time and providing a new flag bit in
>> the options field of the descriptor that signals to user space that
>> all packets have been written in a chunk. Or with a new chunk
>> completion queue as suggested by Mykyta in his latest feedback mail on
>> the list.
> 
> for the set:
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Thank you for these fixes.
> According to unofficial feedback from brcm and netronome folks
> the descriptor format should work for these nics too.
> At some point we may consider second format, but I think SW
> should drive HW requirements and not the other way around.

LGTM as well, applied to bpf-next, thanks!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox