Intel-Wired-Lan Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-wired-lan] [PATCH v2] e1000e: Don't return uninitialized stats
From: Jeff Kirsher @ 2017-05-19  8:16 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <20170518.104614.1739169308591707817.davem@davemloft.net>

On Thu, 2017-05-18 at 10:46 -0400, David Miller wrote:
> From: Benjamin Poirier <bpoirier@suse.com>
> Date: Wed, 17 May 2017 16:24:13 -0400
> 
> > Some statistics passed to ethtool are garbage because
> > e1000e_get_stats64()
> > doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
> > memory to userspace and confuses users.
> > 
> > Do like ixgbe and use dev_get_stats() which first zeroes out
> > rtnl_link_stats64.
> > 
> > Fixes: 5944701df90d ("net: remove useless memset's in drivers
> > get_stats64")
> > Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> > Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
> 
> Jeff, please be sure to pick this up, thanks.

Yep, I have it in my tree, thanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20170519/af01001c/attachment.asc>

^ permalink raw reply

* [Intel-wired-lan] [PATCH v5 1/2] i40e: add XDP support for pass and drop actions
From: Alexander Duyck @ 2017-05-19 12:29 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <20170519070856.13900-2-bjorn.topel@gmail.com>

On Fri, May 19, 2017 at 12:08 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> From: Bj?rn T?pel <bjorn.topel@intel.com>
>
> This commit adds basic XDP support for i40e derived NICs. All XDP
> actions will end up in XDP_DROP.
>
> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>

I have called out a few minor cosmetic issues that I would like to see
fixed, but otherwise the patch looks good to me and the issues I saw
didn't impact functionality.

Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>

> ---
>  drivers/net/ethernet/intel/i40e/i40e.h      |   7 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c |  75 ++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 130 +++++++++++++++++++++-------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |   1 +
>  4 files changed, 182 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 395ca94faf80..d3195b29d53c 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -645,6 +645,8 @@ struct i40e_vsi {
>         u16 max_frame;
>         u16 rx_buf_len;
>
> +       struct bpf_prog *xdp_prog;
> +
>         /* List of q_vectors allocated to this VSI */
>         struct i40e_q_vector **q_vectors;
>         int num_q_vectors;
> @@ -972,4 +974,9 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
> +
> +static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
> +{
> +       return !!vsi->xdp_prog;
> +}
>  #endif /* _I40E_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 8d1d3b859af7..27a29904611a 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -27,6 +27,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/of_net.h>
>  #include <linux/pci.h>
> +#include <linux/bpf.h>
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -2408,6 +2409,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>         struct i40e_vsi *vsi = np->vsi;
>         struct i40e_pf *pf = vsi->back;
>
> +       if (i40e_enabled_xdp_vsi(vsi)) {
> +               int frame_size = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
> +
> +               if (frame_size > vsi->rx_buf_len)
> +                       return -EINVAL;
> +       }
> +
>         netdev_info(netdev, "changing MTU from %d to %d\n",
>                     netdev->mtu, new_mtu);
>         netdev->mtu = new_mtu;
> @@ -9310,6 +9318,72 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>  }
>
> +/**
> + * i40e_xdp_setup - add/remove an XDP program
> + * @vsi: VSI to changed
> + * @prog: XDP program
> + **/
> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
> +                         struct bpf_prog *prog)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       int frame_size = vsi->netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
> +       int i;
> +       bool need_reset;
> +       struct bpf_prog *old_prog;
> +
> +       /* Don't allow frames that span over multiple buffers */
> +       if (frame_size > vsi->rx_buf_len)
> +               return -EINVAL;
> +
> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
> +               return 0;
> +
> +       /* When turning XDP on->off/off->on we reset and rebuild the rings. */
> +       need_reset = (i40e_enabled_xdp_vsi(vsi) != !!prog);
> +
> +       if (need_reset)
> +               i40e_prep_for_reset(pf, true);
> +
> +       old_prog = xchg(&vsi->xdp_prog, prog);
> +
> +       if (need_reset)
> +               i40e_reset_and_rebuild(pf, true, true);
> +
> +       for (i = 0; i < vsi->num_queue_pairs; i++)
> +               WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> +
> +       if (old_prog)
> +               bpf_prog_put(old_prog);
> +
> +       return 0;
> +}
> +
> +/**
> + * i40e_xdp - implements ndo_xdp for i40e
> + * @dev: netdevice
> + * @xdp: XDP command
> + **/
> +static int i40e_xdp(struct net_device *dev,
> +                   struct netdev_xdp *xdp)
> +{
> +       struct i40e_netdev_priv *np = netdev_priv(dev);
> +       struct i40e_vsi *vsi = np->vsi;
> +
> +       if (vsi->type != I40E_VSI_MAIN)
> +               return -EINVAL;
> +
> +       switch (xdp->command) {
> +       case XDP_SETUP_PROG:
> +               return i40e_xdp_setup(vsi, xdp->prog);
> +       case XDP_QUERY_PROG:
> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_open               = i40e_open,
>         .ndo_stop               = i40e_close,
> @@ -9342,6 +9416,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_features_check     = i40e_features_check,
>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
> +       .ndo_xdp                = i40e_xdp,
>  };
>
>  /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index af554f3cda19..0c2f0230faf4 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -26,6 +26,7 @@
>
>  #include <linux/prefetch.h>
>  #include <net/busy_poll.h>
> +#include <linux/bpf_trace.h>
>  #include "i40e.h"
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
> @@ -1195,6 +1196,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>  void i40e_free_rx_resources(struct i40e_ring *rx_ring)
>  {
>         i40e_clean_rx_ring(rx_ring);
> +       rx_ring->xdp_prog = NULL;
>         kfree(rx_ring->rx_bi);
>         rx_ring->rx_bi = NULL;
>
> @@ -1241,6 +1243,8 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
>         rx_ring->next_to_clean = 0;
>         rx_ring->next_to_use = 0;
>
> +       rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
> +
>         return 0;
>  err:
>         kfree(rx_ring->rx_bi);
> @@ -1592,6 +1596,7 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>  /**
>   * i40e_cleanup_headers - Correct empty headers
>   * @rx_ring: rx descriptor ring packet is being transacted on
> + * @rx_desc: pointer to the EOP Rx descriptor
>   * @skb: pointer to current skb being fixed
>   *
>   * Also address the case where we are pulling data in on pages only
> @@ -1602,8 +1607,25 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>   *
>   * Returns true if an error was encountered and skb was freed.
>   **/
> -static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb)
> +static bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
> +                                union i40e_rx_desc *rx_desc,
> +                                struct sk_buff *skb)
>  {
> +       /* XDP packets use error pointer so abort at this point */
> +       if (IS_ERR(skb))
> +               return true;
> +
> +       /* ERR_MASK will only have valid bits if EOP set, and
> +        * what we are doing here is actually checking
> +        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> +        * the error field
> +        */
> +       if (unlikely(i40e_test_staterr(rx_desc,
> +                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> +               dev_kfree_skb_any(skb);
> +               return true;
> +       }
> +
>         /* if eth_skb_pad returns an error the skb was freed */
>         if (eth_skb_pad(skb))
>                 return true;
> @@ -1783,10 +1805,10 @@ static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring *rx_ring,
>   * skb correctly.
>   */
>  static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
> -                                         struct i40e_rx_buffer *rx_buffer,
> -                                         unsigned int size)
> +                                         struct xdp_buff *xdp,
> +                                         struct i40e_rx_buffer *rx_buffer)

One minor request here. Let's leave rx_buffer in place and instead
just replace size with the xdp pointer. It helps to reduce some of the
noise in this and will make maintenance easier for the out-of-tree
driver as we will likely just be replacing xdp with size on older
kernels via our kcompat.

>  {
> -       void *va = page_address(rx_buffer->page) + rx_buffer->page_offset;
> +       unsigned int size = xdp->data_end - xdp->data;
>  #if (PAGE_SIZE < 8192)
>         unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>  #else
> @@ -1796,9 +1818,9 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>         struct sk_buff *skb;
>
>         /* prefetch first cache line of first page */
> -       prefetch(va);
> +       prefetch(xdp->data);
>  #if L1_CACHE_BYTES < 128
> -       prefetch(va + L1_CACHE_BYTES);
> +       prefetch(xdp->data + L1_CACHE_BYTES);
>  #endif
>
>         /* allocate a skb to store the frags */
> @@ -1811,10 +1833,11 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>         /* Determine available headroom for copy */
>         headlen = size;
>         if (headlen > I40E_RX_HDR_SIZE)
> -               headlen = eth_get_headlen(va, I40E_RX_HDR_SIZE);
> +               headlen = eth_get_headlen(xdp->data, I40E_RX_HDR_SIZE);
>
>         /* align pull length to size of long to optimize memcpy performance */
> -       memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
> +       memcpy(__skb_put(skb, headlen), xdp->data,
> +              ALIGN(headlen, sizeof(long)));
>
>         /* update all of the pointers */
>         size -= headlen;
> @@ -1847,10 +1870,10 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>   * to set up the skb correctly and avoid any memcpy overhead.
>   */
>  static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
> -                                     struct i40e_rx_buffer *rx_buffer,
> -                                     unsigned int size)
> +                                     struct xdp_buff *xdp,
> +                                     struct i40e_rx_buffer *rx_buffer)

Same here. If possible please just replace size with xdp instead of
moving the variables around.

>  {
> -       void *va = page_address(rx_buffer->page) + rx_buffer->page_offset;
> +       unsigned int size = xdp->data_end - xdp->data;
>  #if (PAGE_SIZE < 8192)
>         unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>  #else
> @@ -1860,12 +1883,12 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
>         struct sk_buff *skb;
>
>         /* prefetch first cache line of first page */
> -       prefetch(va);
> +       prefetch(xdp->data);
>  #if L1_CACHE_BYTES < 128
> -       prefetch(va + L1_CACHE_BYTES);
> +       prefetch(xdp->data + L1_CACHE_BYTES);
>  #endif
>         /* build an skb around the page buffer */
> -       skb = build_skb(va - I40E_SKB_PAD, truesize);
> +       skb = build_skb(xdp->data_hard_start, truesize);
>         if (unlikely(!skb))
>                 return NULL;
>
> @@ -1944,6 +1967,46 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>         return true;
>  }
>
> +#define I40E_XDP_PASS 0
> +#define I40E_XDP_CONSUMED 1
> +
> +/**
> + * i40e_run_xdp - run an XDP program
> + * @rx_ring: Rx ring being processed
> + * @xdp: XDP buffer containing the frame
> + **/
> +static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
> +                                   struct xdp_buff *xdp)
> +{
> +       int result = I40E_XDP_PASS;
> +       struct bpf_prog *xdp_prog;
> +       u32 act;
> +
> +       rcu_read_lock();
> +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> +
> +       if (!xdp_prog)
> +               goto xdp_out;
> +
> +       act = bpf_prog_run_xdp(xdp_prog, xdp);
> +       switch (act) {
> +       case XDP_PASS:
> +               break;
> +       default:
> +               bpf_warn_invalid_xdp_action(act);
> +       case XDP_TX:
> +       case XDP_ABORTED:
> +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
> +               /* fallthrough -- handle aborts by dropping packet */
> +       case XDP_DROP:
> +               result = I40E_XDP_CONSUMED;
> +               break;
> +       }
> +xdp_out:
> +       rcu_read_unlock();
> +       return ERR_PTR(-result);
> +}
> +
>  /**
>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>   * @rx_ring: rx descriptor ring to transact packets on
> @@ -1970,6 +2033,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                 u16 vlan_tag;
>                 u8 rx_ptype;
>                 u64 qword;
> +               struct xdp_buff xdp;

Dave usually prefers what we call a reverse Christmas tree layout for
variables. Basically you should try to place the longest variable
declaration at the top and the shortest at the bottom. So you could
probably place it above vlan tag or size depending on the length of
the line.

>
>                 /* return some buffers to hardware, one at a time is too slow */
>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> @@ -2006,12 +2070,27 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                 rx_buffer = i40e_get_rx_buffer(rx_ring, size);
>
>                 /* retrieve a buffer from the ring */
> -               if (skb)
> +               if (!skb) {
> +                       xdp.data = page_address(rx_buffer->page) +
> +                                  rx_buffer->page_offset;
> +                       xdp.data_hard_start = xdp.data -
> +                                             i40e_rx_offset(rx_ring);
> +                       xdp.data_end = xdp.data + size;
> +
> +                       skb = i40e_run_xdp(rx_ring, &xdp);
> +               }
> +
> +               if (IS_ERR(skb)) {
> +                       total_rx_bytes += size;
> +                       total_rx_packets++;
> +                       rx_buffer->pagecnt_bias++;
> +               } else if (skb) {
>                         i40e_add_rx_frag(rx_ring, rx_buffer, skb, size);
> -               else if (ring_uses_build_skb(rx_ring))
> -                       skb = i40e_build_skb(rx_ring, rx_buffer, size);
> -               else
> -                       skb = i40e_construct_skb(rx_ring, rx_buffer, size);
> +               } else if (ring_uses_build_skb(rx_ring)) {
> +                       skb = i40e_build_skb(rx_ring, &xdp, rx_buffer);
> +               } else {
> +                       skb = i40e_construct_skb(rx_ring, &xdp, rx_buffer);
> +               }
>
>                 /* exit if we failed to retrieve a buffer */
>                 if (!skb) {
> @@ -2026,18 +2105,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                 if (i40e_is_non_eop(rx_ring, rx_desc, skb))
>                         continue;
>
> -               /* ERR_MASK will only have valid bits if EOP set, and
> -                * what we are doing here is actually checking
> -                * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> -                * the error field
> -                */
> -               if (unlikely(i40e_test_staterr(rx_desc, BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> -                       dev_kfree_skb_any(skb);
> -                       skb = NULL;
> -                       continue;
> -               }
> -
> -               if (i40e_cleanup_headers(rx_ring, skb)) {
> +               if (i40e_cleanup_headers(rx_ring, rx_desc, skb)) {
>                         skb = NULL;
>                         continue;
>                 }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index f5de51124cae..31f0b162996f 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -360,6 +360,7 @@ struct i40e_ring {
>         void *desc;                     /* Descriptor ring memory */
>         struct device *dev;             /* Used for DMA mapping */
>         struct net_device *netdev;      /* netdev ring maps to */
> +       struct bpf_prog *xdp_prog;
>         union {
>                 struct i40e_tx_buffer *tx_bi;
>                 struct i40e_rx_buffer *rx_bi;
> --
> 2.11.0
>

^ permalink raw reply

* [Intel-wired-lan] [PATCH v5 1/2] i40e: add XDP support for pass and drop actions
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2017-05-19 12:46 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <CAKgT0Ue4E7nDBFNBpQ+NmB9TLksLrxxxMuVs5nqaw9jn3SniBg@mail.gmail.com>

2017-05-19 14:29 GMT+02:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Fri, May 19, 2017 at 12:08 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>> From: Bj?rn T?pel <bjorn.topel@intel.com>
>>
>> This commit adds basic XDP support for i40e derived NICs. All XDP
>> actions will end up in XDP_DROP.
>>
>> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
>
> I have called out a few minor cosmetic issues that I would like to see
> fixed, but otherwise the patch looks good to me and the issues I saw
> didn't impact functionality.

Thanks for the quick turn-around, Alex!

I'll address both size/xdp parameter order and Christmas tree layout
in the next spin -- that and any issues in the XDP_TX patch.


Thanks,
Bj?rn


>
> Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
>
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e.h      |   7 ++
>>  drivers/net/ethernet/intel/i40e/i40e_main.c |  75 ++++++++++++++++
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 130 +++++++++++++++++++++-------
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |   1 +
>>  4 files changed, 182 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>> index 395ca94faf80..d3195b29d53c 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -645,6 +645,8 @@ struct i40e_vsi {
>>         u16 max_frame;
>>         u16 rx_buf_len;
>>
>> +       struct bpf_prog *xdp_prog;
>> +
>>         /* List of q_vectors allocated to this VSI */
>>         struct i40e_q_vector **q_vectors;
>>         int num_q_vectors;
>> @@ -972,4 +974,9 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
>> +
>> +static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
>> +{
>> +       return !!vsi->xdp_prog;
>> +}
>>  #endif /* _I40E_H_ */
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index 8d1d3b859af7..27a29904611a 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -27,6 +27,7 @@
>>  #include <linux/etherdevice.h>
>>  #include <linux/of_net.h>
>>  #include <linux/pci.h>
>> +#include <linux/bpf.h>
>>
>>  /* Local includes */
>>  #include "i40e.h"
>> @@ -2408,6 +2409,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>>         struct i40e_vsi *vsi = np->vsi;
>>         struct i40e_pf *pf = vsi->back;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               int frame_size = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +               if (frame_size > vsi->rx_buf_len)
>> +                       return -EINVAL;
>> +       }
>> +
>>         netdev_info(netdev, "changing MTU from %d to %d\n",
>>                     netdev->mtu, new_mtu);
>>         netdev->mtu = new_mtu;
>> @@ -9310,6 +9318,72 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>>  }
>>
>> +/**
>> + * i40e_xdp_setup - add/remove an XDP program
>> + * @vsi: VSI to changed
>> + * @prog: XDP program
>> + **/
>> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
>> +                         struct bpf_prog *prog)
>> +{
>> +       struct i40e_pf *pf = vsi->back;
>> +       int frame_size = vsi->netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +       int i;
>> +       bool need_reset;
>> +       struct bpf_prog *old_prog;
>> +
>> +       /* Don't allow frames that span over multiple buffers */
>> +       if (frame_size > vsi->rx_buf_len)
>> +               return -EINVAL;
>> +
>> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
>> +               return 0;
>> +
>> +       /* When turning XDP on->off/off->on we reset and rebuild the rings. */
>> +       need_reset = (i40e_enabled_xdp_vsi(vsi) != !!prog);
>> +
>> +       if (need_reset)
>> +               i40e_prep_for_reset(pf, true);
>> +
>> +       old_prog = xchg(&vsi->xdp_prog, prog);
>> +
>> +       if (need_reset)
>> +               i40e_reset_and_rebuild(pf, true, true);
>> +
>> +       for (i = 0; i < vsi->num_queue_pairs; i++)
>> +               WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
>> +
>> +       if (old_prog)
>> +               bpf_prog_put(old_prog);
>> +
>> +       return 0;
>> +}
>> +
>> +/**
>> + * i40e_xdp - implements ndo_xdp for i40e
>> + * @dev: netdevice
>> + * @xdp: XDP command
>> + **/
>> +static int i40e_xdp(struct net_device *dev,
>> +                   struct netdev_xdp *xdp)
>> +{
>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>> +       struct i40e_vsi *vsi = np->vsi;
>> +
>> +       if (vsi->type != I40E_VSI_MAIN)
>> +               return -EINVAL;
>> +
>> +       switch (xdp->command) {
>> +       case XDP_SETUP_PROG:
>> +               return i40e_xdp_setup(vsi, xdp->prog);
>> +       case XDP_QUERY_PROG:
>> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>> +               return 0;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +}
>> +
>>  static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_open               = i40e_open,
>>         .ndo_stop               = i40e_close,
>> @@ -9342,6 +9416,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_features_check     = i40e_features_check,
>>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
>> +       .ndo_xdp                = i40e_xdp,
>>  };
>>
>>  /**
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index af554f3cda19..0c2f0230faf4 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -26,6 +26,7 @@
>>
>>  #include <linux/prefetch.h>
>>  #include <net/busy_poll.h>
>> +#include <linux/bpf_trace.h>
>>  #include "i40e.h"
>>  #include "i40e_trace.h"
>>  #include "i40e_prototype.h"
>> @@ -1195,6 +1196,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>>  void i40e_free_rx_resources(struct i40e_ring *rx_ring)
>>  {
>>         i40e_clean_rx_ring(rx_ring);
>> +       rx_ring->xdp_prog = NULL;
>>         kfree(rx_ring->rx_bi);
>>         rx_ring->rx_bi = NULL;
>>
>> @@ -1241,6 +1243,8 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
>>         rx_ring->next_to_clean = 0;
>>         rx_ring->next_to_use = 0;
>>
>> +       rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
>> +
>>         return 0;
>>  err:
>>         kfree(rx_ring->rx_bi);
>> @@ -1592,6 +1596,7 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>>  /**
>>   * i40e_cleanup_headers - Correct empty headers
>>   * @rx_ring: rx descriptor ring packet is being transacted on
>> + * @rx_desc: pointer to the EOP Rx descriptor
>>   * @skb: pointer to current skb being fixed
>>   *
>>   * Also address the case where we are pulling data in on pages only
>> @@ -1602,8 +1607,25 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>>   *
>>   * Returns true if an error was encountered and skb was freed.
>>   **/
>> -static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb)
>> +static bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
>> +                                union i40e_rx_desc *rx_desc,
>> +                                struct sk_buff *skb)
>>  {
>> +       /* XDP packets use error pointer so abort at this point */
>> +       if (IS_ERR(skb))
>> +               return true;
>> +
>> +       /* ERR_MASK will only have valid bits if EOP set, and
>> +        * what we are doing here is actually checking
>> +        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
>> +        * the error field
>> +        */
>> +       if (unlikely(i40e_test_staterr(rx_desc,
>> +                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
>> +               dev_kfree_skb_any(skb);
>> +               return true;
>> +       }
>> +
>>         /* if eth_skb_pad returns an error the skb was freed */
>>         if (eth_skb_pad(skb))
>>                 return true;
>> @@ -1783,10 +1805,10 @@ static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring *rx_ring,
>>   * skb correctly.
>>   */
>>  static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>> -                                         struct i40e_rx_buffer *rx_buffer,
>> -                                         unsigned int size)
>> +                                         struct xdp_buff *xdp,
>> +                                         struct i40e_rx_buffer *rx_buffer)
>
> One minor request here. Let's leave rx_buffer in place and instead
> just replace size with the xdp pointer. It helps to reduce some of the
> noise in this and will make maintenance easier for the out-of-tree
> driver as we will likely just be replacing xdp with size on older
> kernels via our kcompat.
>
>>  {
>> -       void *va = page_address(rx_buffer->page) + rx_buffer->page_offset;
>> +       unsigned int size = xdp->data_end - xdp->data;
>>  #if (PAGE_SIZE < 8192)
>>         unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>>  #else
>> @@ -1796,9 +1818,9 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>>         struct sk_buff *skb;
>>
>>         /* prefetch first cache line of first page */
>> -       prefetch(va);
>> +       prefetch(xdp->data);
>>  #if L1_CACHE_BYTES < 128
>> -       prefetch(va + L1_CACHE_BYTES);
>> +       prefetch(xdp->data + L1_CACHE_BYTES);
>>  #endif
>>
>>         /* allocate a skb to store the frags */
>> @@ -1811,10 +1833,11 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>>         /* Determine available headroom for copy */
>>         headlen = size;
>>         if (headlen > I40E_RX_HDR_SIZE)
>> -               headlen = eth_get_headlen(va, I40E_RX_HDR_SIZE);
>> +               headlen = eth_get_headlen(xdp->data, I40E_RX_HDR_SIZE);
>>
>>         /* align pull length to size of long to optimize memcpy performance */
>> -       memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
>> +       memcpy(__skb_put(skb, headlen), xdp->data,
>> +              ALIGN(headlen, sizeof(long)));
>>
>>         /* update all of the pointers */
>>         size -= headlen;
>> @@ -1847,10 +1870,10 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
>>   * to set up the skb correctly and avoid any memcpy overhead.
>>   */
>>  static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
>> -                                     struct i40e_rx_buffer *rx_buffer,
>> -                                     unsigned int size)
>> +                                     struct xdp_buff *xdp,
>> +                                     struct i40e_rx_buffer *rx_buffer)
>
> Same here. If possible please just replace size with xdp instead of
> moving the variables around.
>
>>  {
>> -       void *va = page_address(rx_buffer->page) + rx_buffer->page_offset;
>> +       unsigned int size = xdp->data_end - xdp->data;
>>  #if (PAGE_SIZE < 8192)
>>         unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>>  #else
>> @@ -1860,12 +1883,12 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
>>         struct sk_buff *skb;
>>
>>         /* prefetch first cache line of first page */
>> -       prefetch(va);
>> +       prefetch(xdp->data);
>>  #if L1_CACHE_BYTES < 128
>> -       prefetch(va + L1_CACHE_BYTES);
>> +       prefetch(xdp->data + L1_CACHE_BYTES);
>>  #endif
>>         /* build an skb around the page buffer */
>> -       skb = build_skb(va - I40E_SKB_PAD, truesize);
>> +       skb = build_skb(xdp->data_hard_start, truesize);
>>         if (unlikely(!skb))
>>                 return NULL;
>>
>> @@ -1944,6 +1967,46 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>>         return true;
>>  }
>>
>> +#define I40E_XDP_PASS 0
>> +#define I40E_XDP_CONSUMED 1
>> +
>> +/**
>> + * i40e_run_xdp - run an XDP program
>> + * @rx_ring: Rx ring being processed
>> + * @xdp: XDP buffer containing the frame
>> + **/
>> +static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>> +                                   struct xdp_buff *xdp)
>> +{
>> +       int result = I40E_XDP_PASS;
>> +       struct bpf_prog *xdp_prog;
>> +       u32 act;
>> +
>> +       rcu_read_lock();
>> +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
>> +
>> +       if (!xdp_prog)
>> +               goto xdp_out;
>> +
>> +       act = bpf_prog_run_xdp(xdp_prog, xdp);
>> +       switch (act) {
>> +       case XDP_PASS:
>> +               break;
>> +       default:
>> +               bpf_warn_invalid_xdp_action(act);
>> +       case XDP_TX:
>> +       case XDP_ABORTED:
>> +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
>> +               /* fallthrough -- handle aborts by dropping packet */
>> +       case XDP_DROP:
>> +               result = I40E_XDP_CONSUMED;
>> +               break;
>> +       }
>> +xdp_out:
>> +       rcu_read_unlock();
>> +       return ERR_PTR(-result);
>> +}
>> +
>>  /**
>>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @rx_ring: rx descriptor ring to transact packets on
>> @@ -1970,6 +2033,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                 u16 vlan_tag;
>>                 u8 rx_ptype;
>>                 u64 qword;
>> +               struct xdp_buff xdp;
>
> Dave usually prefers what we call a reverse Christmas tree layout for
> variables. Basically you should try to place the longest variable
> declaration at the top and the shortest at the bottom. So you could
> probably place it above vlan tag or size depending on the length of
> the line.
>
>>
>>                 /* return some buffers to hardware, one at a time is too slow */
>>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
>> @@ -2006,12 +2070,27 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                 rx_buffer = i40e_get_rx_buffer(rx_ring, size);
>>
>>                 /* retrieve a buffer from the ring */
>> -               if (skb)
>> +               if (!skb) {
>> +                       xdp.data = page_address(rx_buffer->page) +
>> +                                  rx_buffer->page_offset;
>> +                       xdp.data_hard_start = xdp.data -
>> +                                             i40e_rx_offset(rx_ring);
>> +                       xdp.data_end = xdp.data + size;
>> +
>> +                       skb = i40e_run_xdp(rx_ring, &xdp);
>> +               }
>> +
>> +               if (IS_ERR(skb)) {
>> +                       total_rx_bytes += size;
>> +                       total_rx_packets++;
>> +                       rx_buffer->pagecnt_bias++;
>> +               } else if (skb) {
>>                         i40e_add_rx_frag(rx_ring, rx_buffer, skb, size);
>> -               else if (ring_uses_build_skb(rx_ring))
>> -                       skb = i40e_build_skb(rx_ring, rx_buffer, size);
>> -               else
>> -                       skb = i40e_construct_skb(rx_ring, rx_buffer, size);
>> +               } else if (ring_uses_build_skb(rx_ring)) {
>> +                       skb = i40e_build_skb(rx_ring, &xdp, rx_buffer);
>> +               } else {
>> +                       skb = i40e_construct_skb(rx_ring, &xdp, rx_buffer);
>> +               }
>>
>>                 /* exit if we failed to retrieve a buffer */
>>                 if (!skb) {
>> @@ -2026,18 +2105,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                 if (i40e_is_non_eop(rx_ring, rx_desc, skb))
>>                         continue;
>>
>> -               /* ERR_MASK will only have valid bits if EOP set, and
>> -                * what we are doing here is actually checking
>> -                * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
>> -                * the error field
>> -                */
>> -               if (unlikely(i40e_test_staterr(rx_desc, BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
>> -                       dev_kfree_skb_any(skb);
>> -                       skb = NULL;
>> -                       continue;
>> -               }
>> -
>> -               if (i40e_cleanup_headers(rx_ring, skb)) {
>> +               if (i40e_cleanup_headers(rx_ring, rx_desc, skb)) {
>>                         skb = NULL;
>>                         continue;
>>                 }
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index f5de51124cae..31f0b162996f 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -360,6 +360,7 @@ struct i40e_ring {
>>         void *desc;                     /* Descriptor ring memory */
>>         struct device *dev;             /* Used for DMA mapping */
>>         struct net_device *netdev;      /* netdev ring maps to */
>> +       struct bpf_prog *xdp_prog;
>>         union {
>>                 struct i40e_tx_buffer *tx_bi;
>>                 struct i40e_rx_buffer *rx_bi;
>> --
>> 2.11.0
>>

^ permalink raw reply

* [Intel-wired-lan] [PATCH v5 2/2] i40e: add support for XDP_TX action
From: Alexander Duyck @ 2017-05-19 13:40 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <20170519070856.13900-3-bjorn.topel@gmail.com>

On Fri, May 19, 2017 at 12:08 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> From: Bj?rn T?pel <bjorn.topel@intel.com>
>
> This patch adds proper XDP_TX action support. For each Tx ring, an
> additional XDP Tx ring is allocated and setup. This version does the
> DMA mapping in the fast-path, which will penalize performance for
> IOMMU enabled systems. Further, debugfs support is not wired up for
> the XDP Tx rings.
>
> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>

So I think we still have some extra complexity here we can remove. I
called it out down below.

Basic idea is I would like to see us lay out queues Tx, XDP, Rx
instead of Tx, Rx, XDP as it makes it easier for us to just overflow
out of the allocated Tx rings and to process XDP rings.

> ---
>  drivers/net/ethernet/intel/i40e/i40e.h         |   1 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  57 +++++++-
>  drivers/net/ethernet/intel/i40e/i40e_main.c    | 195 +++++++++++++++++++++----
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 119 ++++++++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |  11 ++
>  5 files changed, 350 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index d3195b29d53c..4250ab55a9f1 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -629,6 +629,7 @@ struct i40e_vsi {
>         /* These are containers of ring pointers, allocated at run-time */
>         struct i40e_ring **rx_rings;
>         struct i40e_ring **tx_rings;
> +       struct i40e_ring **xdp_rings; /* XDP Tx rings */
>
>         u32  active_filters;
>         u32  promisc_threshold;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> index 3d58762efbc0..518788e42887 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> @@ -1302,7 +1302,7 @@ static void i40e_get_ringparam(struct net_device *netdev,
>  static int i40e_set_ringparam(struct net_device *netdev,
>                               struct ethtool_ringparam *ring)
>  {
> -       struct i40e_ring *tx_rings = NULL, *rx_rings = NULL;
> +       struct i40e_ring *tx_rings = NULL, *rx_rings = NULL, *xdp_rings = NULL;
>         struct i40e_netdev_priv *np = netdev_priv(netdev);
>         struct i40e_hw *hw = &np->vsi->back->hw;
>         struct i40e_vsi *vsi = np->vsi;
> @@ -1310,6 +1310,7 @@ static int i40e_set_ringparam(struct net_device *netdev,
>         u32 new_rx_count, new_tx_count;
>         int timeout = 50;
>         int i, err = 0;
> +       bool xdp = i40e_enabled_xdp_vsi(vsi);

So this should be moved up to match up with the reverse Christmas tree
layout that I called out in the previous patch.

>
>         if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>                 return -EINVAL;
> @@ -1345,6 +1346,8 @@ static int i40e_set_ringparam(struct net_device *netdev,
>                 for (i = 0; i < vsi->num_queue_pairs; i++) {
>                         vsi->tx_rings[i]->count = new_tx_count;
>                         vsi->rx_rings[i]->count = new_rx_count;
> +                       if (xdp)
> +                               vsi->xdp_rings[i]->count = new_tx_count;

I would say you can drop the "if (xdp)" line here. If the counts
aren't active updating it should have no impact, but not updating them
can cause issues since the XDP rings can fall out of sync with the Tx
rings.

>                 }
>                 goto done;
>         }
> @@ -1440,6 +1443,41 @@ static int i40e_set_ringparam(struct net_device *netdev,
>                 }
>         }
>
> +       /* alloc updated XDP Tx resources */
> +       if (xdp && new_tx_count != vsi->xdp_rings[0]->count) {
> +               netdev_info(netdev,
> +                           "Changing XDP Tx descriptor count from %d to %d.\n",
> +                           vsi->xdp_rings[0]->count, new_tx_count);
> +               xdp_rings = kcalloc(vsi->alloc_queue_pairs,
> +                                   sizeof(struct i40e_ring), GFP_KERNEL);
> +               if (!xdp_rings) {
> +                       err = -ENOMEM;
> +                       goto free_rx;
> +               }
> +
> +               for (i = 0; i < vsi->num_queue_pairs; i++) {
> +                       /* clone ring and setup updated count */
> +                       xdp_rings[i] = *vsi->xdp_rings[i];
> +                       xdp_rings[i].count = new_tx_count;
> +                       /* the desc and bi pointers will be reallocated in the
> +                        * setup call
> +                        */
> +                       xdp_rings[i].desc = NULL;
> +                       xdp_rings[i].rx_bi = NULL;
> +                       err = i40e_setup_tx_descriptors(&xdp_rings[i]);
> +                       if (err) {
> +                               while (i) {
> +                                       i--;
> +                                       i40e_free_tx_resources(&xdp_rings[i]);
> +                               }
> +                               kfree(xdp_rings);
> +                               xdp_rings = NULL;
> +
> +                               goto free_rx;
> +                       }
> +               }
> +       }
> +

Instead of adding this as a new block it might make sense to make this
a subsection of the Tx ring setup. You could then add a label to jump
to setup Rx if XDP is not supported and could avoid having to perform
the test more than once.

>         /* Bring interface down, copy in the new ring info,
>          * then restore the interface
>          */
> @@ -1474,8 +1512,25 @@ static int i40e_set_ringparam(struct net_device *netdev,
>                 rx_rings = NULL;
>         }
>
> +       if (xdp_rings) {
> +               for (i = 0; i < vsi->num_queue_pairs; i++) {
> +                       i40e_free_tx_resources(vsi->xdp_rings[i]);
> +                       *vsi->xdp_rings[i] = xdp_rings[i];
> +               }
> +               kfree(xdp_rings);
> +               xdp_rings = NULL;
> +       }
> +

I would prefer to see us move this up and handle it between Tx and Rx
instead of after Rx.

>         i40e_up(vsi);
>
> +free_rx:
> +       /* error cleanup if the XDP Tx allocations failed after getting Rx */
> +       if (rx_rings) {
> +               for (i = 0; i < vsi->num_queue_pairs; i++)
> +                       i40e_free_rx_resources(&rx_rings[i]);
> +               kfree(rx_rings);
> +               rx_rings = NULL;
> +       }

With the suggestion I made above I would rather see us freeing XDP
rings if the Rx failed rather than freeing Rx if XDP failed.

>  free_tx:
>         /* error cleanup if the Rx allocations failed after getting Tx */
>         if (tx_rings) {
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 27a29904611a..f1dbcead79ba 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -408,6 +408,27 @@ struct rtnl_link_stats64 *i40e_get_vsi_stats_struct(struct i40e_vsi *vsi)
>  }
>
>  /**
> + * i40e_get_netdev_stats_struct_tx - populate stats from a Tx ring
> + * @ring: Tx ring to get statistics from
> + * @stats: statistics entry to be updated
> + **/
> +static void i40e_get_netdev_stats_struct_tx(struct i40e_ring *ring,
> +                                           struct rtnl_link_stats64 *stats)
> +{
> +       u64 bytes, packets;
> +       unsigned int start;
> +
> +       do {
> +               start = u64_stats_fetch_begin_irq(&ring->syncp);
> +               packets = ring->stats.packets;
> +               bytes   = ring->stats.bytes;
> +       } while (u64_stats_fetch_retry_irq(&ring->syncp, start));
> +
> +       stats->tx_packets += packets;
> +       stats->tx_bytes   += bytes;
> +}
> +
> +/**
>   * i40e_get_netdev_stats_struct - Get statistics for netdev interface
>   * @netdev: network interface device structure
>   *
> @@ -437,15 +458,8 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev,
>                 tx_ring = ACCESS_ONCE(vsi->tx_rings[i]);
>                 if (!tx_ring)
>                         continue;
> +               i40e_get_netdev_stats_struct_tx(tx_ring, stats);
>
> -               do {
> -                       start = u64_stats_fetch_begin_irq(&tx_ring->syncp);
> -                       packets = tx_ring->stats.packets;
> -                       bytes   = tx_ring->stats.bytes;
> -               } while (u64_stats_fetch_retry_irq(&tx_ring->syncp, start));
> -
> -               stats->tx_packets += packets;
> -               stats->tx_bytes   += bytes;
>                 rx_ring = &tx_ring[1];
>
>                 do {
> @@ -456,6 +470,9 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev,
>
>                 stats->rx_packets += packets;
>                 stats->rx_bytes   += bytes;
> +
> +               if (i40e_enabled_xdp_vsi(vsi))
> +                       i40e_get_netdev_stats_struct_tx(&rx_ring[1], stats);
>         }
>         rcu_read_unlock();
>
> @@ -2802,6 +2819,12 @@ static int i40e_vsi_setup_tx_resources(struct i40e_vsi *vsi)
>         for (i = 0; i < vsi->num_queue_pairs && !err; i++)
>                 err = i40e_setup_tx_descriptors(vsi->tx_rings[i]);
>
> +       if (!i40e_enabled_xdp_vsi(vsi))
> +               return err;
> +
> +       for (i = 0; i < vsi->num_queue_pairs && !err; i++)
> +               err = i40e_setup_tx_descriptors(vsi->xdp_rings[i]);
> +
>         return err;
>  }
>
> @@ -2815,12 +2838,17 @@ static void i40e_vsi_free_tx_resources(struct i40e_vsi *vsi)
>  {
>         int i;
>
> -       if (!vsi->tx_rings)
> -               return;
> +       if (vsi->tx_rings) {
> +               for (i = 0; i < vsi->num_queue_pairs; i++)
> +                       if (vsi->tx_rings[i] && vsi->tx_rings[i]->desc)
> +                               i40e_free_tx_resources(vsi->tx_rings[i]);
> +       }
>
> -       for (i = 0; i < vsi->num_queue_pairs; i++)
> -               if (vsi->tx_rings[i] && vsi->tx_rings[i]->desc)
> -                       i40e_free_tx_resources(vsi->tx_rings[i]);
> +       if (vsi->xdp_rings) {
> +               for (i = 0; i < vsi->num_queue_pairs; i++)
> +                       if (vsi->xdp_rings[i] && vsi->xdp_rings[i]->desc)
> +                               i40e_free_tx_resources(vsi->xdp_rings[i]);
> +       }
>  }
>
>  /**
> @@ -3081,6 +3109,12 @@ static int i40e_vsi_configure_tx(struct i40e_vsi *vsi)
>         for (i = 0; (i < vsi->num_queue_pairs) && !err; i++)
>                 err = i40e_configure_tx_ring(vsi->tx_rings[i]);
>
> +       if (!i40e_enabled_xdp_vsi(vsi))
> +               return err;
> +
> +       for (i = 0; (i < vsi->num_queue_pairs) && !err; i++)
> +               err = i40e_configure_tx_ring(vsi->xdp_rings[i]);
> +
>         return err;
>  }
>
> @@ -3230,6 +3264,7 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
>         u16 vector;
>         int i, q;
>         u32 qp;
> +       bool has_xdp = i40e_enabled_xdp_vsi(vsi);

This line should be moved up for the reverse Xmas tree layout.

>
>         /* The interrupt indexing is offset by 1 in the PFINT_ITRn
>          * and PFINT_LNKLSTn registers, e.g.:
> @@ -3256,16 +3291,28 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
>                 wr32(hw, I40E_PFINT_LNKLSTN(vector - 1), qp);
>                 for (q = 0; q < q_vector->num_ringpairs; q++) {
>                         u32 val;
> +                       u32 nextqp = has_xdp ? qp + vsi->alloc_queue_pairs : qp;

Same here. This line should come before "u32 val;" not after.

>
>                         val = I40E_QINT_RQCTL_CAUSE_ENA_MASK |
>                               (I40E_RX_ITR << I40E_QINT_RQCTL_ITR_INDX_SHIFT)  |
>                               (vector      << I40E_QINT_RQCTL_MSIX_INDX_SHIFT) |
> -                             (qp          << I40E_QINT_RQCTL_NEXTQ_INDX_SHIFT)|
> +                             (nextqp      << I40E_QINT_RQCTL_NEXTQ_INDX_SHIFT)|
>                               (I40E_QUEUE_TYPE_TX
>                                       << I40E_QINT_RQCTL_NEXTQ_TYPE_SHIFT);
>
>                         wr32(hw, I40E_QINT_RQCTL(qp), val);
>
> +                       if (has_xdp) {
> +                               val = I40E_QINT_TQCTL_CAUSE_ENA_MASK |
> +                                     (I40E_TX_ITR << I40E_QINT_TQCTL_ITR_INDX_SHIFT)  |
> +                                     (vector      << I40E_QINT_TQCTL_MSIX_INDX_SHIFT) |
> +                                     (qp          << I40E_QINT_TQCTL_NEXTQ_INDX_SHIFT)|
> +                                     (I40E_QUEUE_TYPE_TX
> +                                      << I40E_QINT_TQCTL_NEXTQ_TYPE_SHIFT);
> +
> +                               wr32(hw, I40E_QINT_TQCTL(nextqp), val);
> +                       }
> +
>                         val = I40E_QINT_TQCTL_CAUSE_ENA_MASK |
>                               (I40E_TX_ITR << I40E_QINT_TQCTL_ITR_INDX_SHIFT)  |
>                               (vector      << I40E_QINT_TQCTL_MSIX_INDX_SHIFT) |
> @@ -3334,6 +3381,7 @@ static void i40e_configure_msi_and_legacy(struct i40e_vsi *vsi)
>         struct i40e_pf *pf = vsi->back;
>         struct i40e_hw *hw = &pf->hw;
>         u32 val;
> +       u32 nextqp = i40e_enabled_xdp_vsi(vsi) ? vsi->alloc_queue_pairs : 0;

Same here. The line should be moved up to the point that it has a
longer variable declaration in front of it. If there are no longer
variable declarations it should be the top.

>         /* set the ITR configuration */
>         q_vector->itr_countdown = ITR_COUNTDOWN_START;
> @@ -3350,12 +3398,22 @@ static void i40e_configure_msi_and_legacy(struct i40e_vsi *vsi)
>         wr32(hw, I40E_PFINT_LNKLST0, 0);
>
>         /* Associate the queue pair to the vector and enable the queue int */
> -       val = I40E_QINT_RQCTL_CAUSE_ENA_MASK                  |
> -             (I40E_RX_ITR << I40E_QINT_RQCTL_ITR_INDX_SHIFT) |
> +       val = I40E_QINT_RQCTL_CAUSE_ENA_MASK                   |
> +             (I40E_RX_ITR << I40E_QINT_RQCTL_ITR_INDX_SHIFT)  |
> +             (nextqp      << I40E_QINT_RQCTL_NEXTQ_INDX_SHIFT)|
>               (I40E_QUEUE_TYPE_TX << I40E_QINT_TQCTL_NEXTQ_TYPE_SHIFT);
>
>         wr32(hw, I40E_QINT_RQCTL(0), val);
>
> +       if (i40e_enabled_xdp_vsi(vsi)) {
> +               val = I40E_QINT_TQCTL_CAUSE_ENA_MASK                 |
> +                     (I40E_TX_ITR << I40E_QINT_TQCTL_ITR_INDX_SHIFT)|
> +                     (I40E_QUEUE_TYPE_TX
> +                      << I40E_QINT_TQCTL_NEXTQ_TYPE_SHIFT);
> +
> +              wr32(hw, I40E_QINT_TQCTL(nextqp), val);
> +       }
> +
>         val = I40E_QINT_TQCTL_CAUSE_ENA_MASK                  |
>               (I40E_TX_ITR << I40E_QINT_TQCTL_ITR_INDX_SHIFT) |
>               (I40E_QUEUE_END_OF_LIST << I40E_QINT_TQCTL_NEXTQ_INDX_SHIFT);
> @@ -3522,6 +3580,9 @@ static void i40e_vsi_disable_irq(struct i40e_vsi *vsi)
>         for (i = 0; i < vsi->num_queue_pairs; i++) {
>                 wr32(hw, I40E_QINT_TQCTL(vsi->tx_rings[i]->reg_idx), 0);
>                 wr32(hw, I40E_QINT_RQCTL(vsi->rx_rings[i]->reg_idx), 0);
> +               if (!i40e_enabled_xdp_vsi(vsi))
> +                       continue;
> +               wr32(hw, I40E_QINT_TQCTL(vsi->xdp_rings[i]->reg_idx), 0);
>         }
>
>         if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
> @@ -3818,12 +3879,22 @@ static void i40e_map_vector_to_qp(struct i40e_vsi *vsi, int v_idx, int qp_idx)
>         struct i40e_q_vector *q_vector = vsi->q_vectors[v_idx];
>         struct i40e_ring *tx_ring = vsi->tx_rings[qp_idx];
>         struct i40e_ring *rx_ring = vsi->rx_rings[qp_idx];
> +       struct i40e_ring *xdp_ring = i40e_enabled_xdp_vsi(vsi) ?
> +                                    vsi->xdp_rings[qp_idx] : NULL;

What you might try doing is pulling this line out and actually drop it
into the if statement below. You could replace the xdp_ring check with
the i40e_enabled_xdp_vsi(vsi) check. Then it makes this a bit more
readable as the XDP setup below becomes a mini version of the
i40e_map_vector_to_qp function. Also it avoids the reverse xmas tree
issues as you could initialize the xdp_ring pointer inside of the if
statement.

>
>         tx_ring->q_vector = q_vector;
>         tx_ring->next = q_vector->tx.ring;
>         q_vector->tx.ring = tx_ring;
>         q_vector->tx.count++;
>
> +       /* Place XDP Tx ring in the same q_vector ring list as regular Tx */
> +       if (xdp_ring) {
> +               xdp_ring->q_vector = q_vector;
> +               xdp_ring->next = q_vector->tx.ring;
> +               q_vector->tx.ring = xdp_ring;
> +               q_vector->tx.count++;
> +       }
> +
>         rx_ring->q_vector = q_vector;
>         rx_ring->next = q_vector->rx.ring;
>         q_vector->rx.ring = rx_ring;
> @@ -4026,6 +4097,23 @@ static int i40e_vsi_control_tx(struct i40e_vsi *vsi, bool enable)
>                 }
>         }
>
> +       if (ret || !i40e_enabled_xdp_vsi(vsi))
> +               return ret;
> +
> +       pf_q = vsi->base_queue + vsi->alloc_queue_pairs;
> +       for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
> +               i40e_control_tx_q(pf, pf_q, enable);
> +
> +               /* wait for the change to finish */
> +               ret = i40e_pf_txq_wait(pf, pf_q, enable);
> +               if (ret) {
> +                       dev_info(&pf->pdev->dev,
> +                                "VSI seid %d XDP Tx ring %d %sable timeout\n",
> +                                vsi->seid, pf_q, (enable ? "en" : "dis"));
> +                       break;
> +               }
> +       }
> +

It might work better to place this inside the original loop in this
function. What you could do is just add another variable inside the
original loop in this function that could be pf_q +
vsi->alloc_queue_pairs.

>         return ret;
>  }
>
> @@ -4535,7 +4623,7 @@ int i40e_vsi_wait_queues_disabled(struct i40e_vsi *vsi)
>                                  vsi->seid, pf_q);
>                         return ret;
>                 }
> -               /* Check and wait for the Tx queue */
> +               /* Check and wait for the Rx queue */
>                 ret = i40e_pf_rxq_wait(pf, pf_q, false);
>                 if (ret) {
>                         dev_info(&pf->pdev->dev,
> @@ -4545,6 +4633,21 @@ int i40e_vsi_wait_queues_disabled(struct i40e_vsi *vsi)
>                 }
>         }
>
> +       if (!i40e_enabled_xdp_vsi(vsi))
> +               return 0;
> +
> +       pf_q = vsi->base_queue + vsi->alloc_queue_pairs;
> +       for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
> +               /* Check and wait for the XDP Tx queue */
> +               ret = i40e_pf_txq_wait(pf, pf_q, false);
> +               if (ret) {
> +                       dev_info(&pf->pdev->dev,
> +                                "VSI seid %d XDP Tx ring %d disable timeout\n",
> +                                vsi->seid, pf_q);
> +                       return ret;
> +               }
> +       }
> +

This would make more sense inside of the original loop instead of
being added as an extra bit.

>         return 0;
>  }
>
> @@ -5454,6 +5557,8 @@ void i40e_down(struct i40e_vsi *vsi)
>
>         for (i = 0; i < vsi->num_queue_pairs; i++) {
>                 i40e_clean_tx_ring(vsi->tx_rings[i]);
> +               if (i40e_enabled_xdp_vsi(vsi))
> +                       i40e_clean_tx_ring(vsi->xdp_rings[i]);
>                 i40e_clean_rx_ring(vsi->rx_rings[i]);
>         }
>
> @@ -7475,6 +7580,7 @@ static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi)
>         switch (vsi->type) {
>         case I40E_VSI_MAIN:
>                 vsi->alloc_queue_pairs = pf->num_lan_qps;
> +

Looks like you somehow added a extra empty line here. This can be dropped.

>                 vsi->num_desc = ALIGN(I40E_DEFAULT_NUM_DESCRIPTORS,
>                                       I40E_REQ_DESCRIPTOR_MULTIPLE);
>                 if (pf->flags & I40E_FLAG_MSIX_ENABLED)
> @@ -7524,13 +7630,17 @@ static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi, bool alloc_qvectors)
>  {
>         int size;
>         int ret = 0;
> +       int nrings;
>
> -       /* allocate memory for both Tx and Rx ring pointers */
> -       size = sizeof(struct i40e_ring *) * vsi->alloc_queue_pairs * 2;
> +       /* allocate memory for both Tx, Rx and XDP Tx ring pointers */
> +       nrings = i40e_enabled_xdp_vsi(vsi) ? 3 : 2;
> +       size = sizeof(struct i40e_ring *) * vsi->alloc_queue_pairs * nrings;
>         vsi->tx_rings = kzalloc(size, GFP_KERNEL);
>         if (!vsi->tx_rings)
>                 return -ENOMEM;
>         vsi->rx_rings = &vsi->tx_rings[vsi->alloc_queue_pairs];
> +       if (i40e_enabled_xdp_vsi(vsi))
> +               vsi->xdp_rings = &vsi->rx_rings[vsi->alloc_queue_pairs];

I have a general thought here. Why not look at placing the xdp_rings
in the slot immediately after the Tx rings and before the Rx rings?
Doing that may help to simplify some of the code as you could then
just loop through alloc_queue_pairs * 2 in order to handle all of the
XDP rings from the Tx ring array pointer.

>         if (alloc_qvectors) {
>                 /* allocate memory for q_vector pointers */
> @@ -7650,6 +7760,7 @@ static void i40e_vsi_free_arrays(struct i40e_vsi *vsi, bool free_qvectors)
>         kfree(vsi->tx_rings);
>         vsi->tx_rings = NULL;
>         vsi->rx_rings = NULL;
> +       vsi->xdp_rings = NULL;
>  }
>
>  /**
> @@ -7733,6 +7844,8 @@ static void i40e_vsi_clear_rings(struct i40e_vsi *vsi)
>                         kfree_rcu(vsi->tx_rings[i], rcu);
>                         vsi->tx_rings[i] = NULL;
>                         vsi->rx_rings[i] = NULL;
> +                       if (vsi->xdp_rings)
> +                               vsi->xdp_rings[i] = NULL;
>                 }
>         }
>  }
> @@ -7743,14 +7856,15 @@ static void i40e_vsi_clear_rings(struct i40e_vsi *vsi)
>   **/
>  static int i40e_alloc_rings(struct i40e_vsi *vsi)
>  {
> -       struct i40e_ring *tx_ring, *rx_ring;
> +       struct i40e_ring *tx_ring, *rx_ring, *xdp_ring;

Instead of having multiple ring pointers we could probably simplify
this by just using one generic ring pointer since it doesn't matter if
it is really a Tx, Rx, or XDP ring. It simplifies the code below a
bit.

>         struct i40e_pf *pf = vsi->back;
> -       int i;
> +       int i, nr;
>
> +       nr = i40e_enabled_xdp_vsi(vsi) ? 3 : 2;
>         /* Set basic values in the rings to be used later during open() */
>         for (i = 0; i < vsi->alloc_queue_pairs; i++) {
>                 /* allocate space for both Tx and Rx in one shot */
> -               tx_ring = kzalloc(sizeof(struct i40e_ring) * 2, GFP_KERNEL);
> +               tx_ring = kcalloc(nr, sizeof(*tx_ring), GFP_KERNEL);

I would prefer it if we stayed with "sizeof(struct i40e_ring)" instead
of using "sizeof(*tx_ring)". I think it is more readable that way.

Also instead of using nr it might be more readable to use something
like "q_per_vector" or qpv just so it is clear that we are allocating
either 2 or 3 queues per vector.

>                 if (!tx_ring)
>                         goto err_out;
>
> @@ -7780,6 +7894,26 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
>                 rx_ring->dcb_tc = 0;
>                 rx_ring->rx_itr_setting = pf->rx_itr_default;
>                 vsi->rx_rings[i] = rx_ring;
> +
> +               if (!i40e_enabled_xdp_vsi(vsi))
> +                       continue;
> +
> +               xdp_ring = &rx_ring[1];

So for example instead of doing this you could just call ring++ and
update the the correct pointer. It also makes it easier to move this
up to be processed before the Rx ring.

> +               xdp_ring->queue_index = vsi->alloc_queue_pairs + i;
> +               xdp_ring->reg_idx = vsi->base_queue +
> +                                   vsi->alloc_queue_pairs + i;

This could be redefined as "xdp_ring->reg_idx = vsi->alloc_queue_pairs
+ xpd_ring->queue_index".

> +               xdp_ring->ring_active = false;
> +               xdp_ring->vsi = vsi;
> +               xdp_ring->netdev = NULL;
> +               xdp_ring->dev = &pf->pdev->dev;
> +               xdp_ring->count = vsi->num_desc;
> +               xdp_ring->size = 0;
> +               xdp_ring->dcb_tc = 0;
> +               if (vsi->back->flags & I40E_FLAG_WB_ON_ITR_CAPABLE)
> +                       xdp_ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
> +               set_ring_xdp(xdp_ring);
> +               xdp_ring->tx_itr_setting = pf->tx_itr_default;
> +               vsi->xdp_rings[i] = xdp_ring;

So I would really like to see this whole block moved up to before the
Rx configuration. That way we have all the Tx configured before we get
to the Rx.

>         }
>
>         return 0;
> @@ -9988,6 +10122,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
>         struct i40e_pf *pf;
>         u8 enabled_tc;
>         int ret;
> +       u16 alloc_queue_pairs;

Needs to be moved up for reverse Christmas tree layout.

>
>         if (!vsi)
>                 return NULL;
> @@ -10003,11 +10138,14 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
>         if (ret)
>                 goto err_vsi;
>
> -       ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs, vsi->idx);
> +       alloc_queue_pairs = vsi->alloc_queue_pairs *
> +                           (i40e_enabled_xdp_vsi(vsi) ? 2 : 1);
> +
> +       ret = i40e_get_lump(pf, pf->qp_pile, alloc_queue_pairs, vsi->idx);
>         if (ret < 0) {
>                 dev_info(&pf->pdev->dev,
>                          "failed to get tracking for %d queues for VSI %d err %d\n",
> -                        vsi->alloc_queue_pairs, vsi->seid, ret);
> +                        alloc_queue_pairs, vsi->seid, ret);
>                 goto err_vsi;
>         }
>         vsi->base_queue = ret;
> @@ -10065,6 +10203,7 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
>         struct i40e_veb *veb = NULL;
>         int ret, i;
>         int v_idx;
> +       u16 alloc_queue_pairs;
>
>         /* The requested uplink_seid must be either
>          *     - the PF's port seid
> @@ -10150,12 +10289,14 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
>         else if (type == I40E_VSI_SRIOV)
>                 vsi->vf_id = param1;
>         /* assign it some queues */
> -       ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs,
> -                               vsi->idx);
> +       alloc_queue_pairs = vsi->alloc_queue_pairs *
> +                           (i40e_enabled_xdp_vsi(vsi) ? 2 : 1);
> +
> +       ret = i40e_get_lump(pf, pf->qp_pile, alloc_queue_pairs, vsi->idx);
>         if (ret < 0) {
>                 dev_info(&pf->pdev->dev,
>                          "failed to get tracking for %d queues for VSI %d err=%d\n",
> -                        vsi->alloc_queue_pairs, vsi->seid, ret);
> +                        alloc_queue_pairs, vsi->seid, ret);
>                 goto err_vsi;
>         }
>         vsi->base_queue = ret;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 0c2f0230faf4..c025e517f7e5 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -630,6 +630,8 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
>         if (tx_buffer->skb) {
>                 if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
>                         kfree(tx_buffer->raw_buf);
> +               else if (ring_is_xdp(ring))
> +                       page_frag_free(tx_buffer->raw_buf);
>                 else
>                         dev_kfree_skb_any(tx_buffer->skb);
>                 if (dma_unmap_len(tx_buffer, len))
> @@ -771,8 +773,11 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>                 total_bytes += tx_buf->bytecount;
>                 total_packets += tx_buf->gso_segs;
>
> -               /* free the skb */
> -               napi_consume_skb(tx_buf->skb, napi_budget);
> +               /* free the skb/XDP data */
> +               if (ring_is_xdp(tx_ring))
> +                       page_frag_free(tx_buf->raw_buf);
> +               else
> +                       napi_consume_skb(tx_buf->skb, napi_budget);
>
>                 /* unmap skb header data */
>                 dma_unmap_single(tx_ring->dev,
> @@ -848,6 +853,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>                         tx_ring->arm_wb = true;
>         }
>
> +       if (ring_is_xdp(tx_ring))
> +               return !!budget;
> +
>         /* notify netdev of completed buffers */
>         netdev_tx_completed_queue(txring_txq(tx_ring),
>                                   total_packets, total_bytes);
> @@ -1969,6 +1977,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>
>  #define I40E_XDP_PASS 0
>  #define I40E_XDP_CONSUMED 1
> +#define I40E_XDP_TX 2
> +
> +static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
> +                             struct i40e_ring *xdp_ring);
>
>  /**
>   * i40e_run_xdp - run an XDP program
> @@ -1981,6 +1993,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>         int result = I40E_XDP_PASS;
>         struct bpf_prog *xdp_prog;
>         u32 act;
> +       struct i40e_ring *xdp_ring;

This should be moved up for reverse christmas tree.

>
>         rcu_read_lock();
>         xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> @@ -1989,12 +2002,16 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>                 goto xdp_out;
>
>         act = bpf_prog_run_xdp(xdp_prog, xdp);
> +

This white space should either have been added in patch 1 or can be
dropped here.

>         switch (act) {
>         case XDP_PASS:
>                 break;
> +       case XDP_TX:
> +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +               result = i40e_xmit_xdp_ring(xdp, xdp_ring);
> +               break;
>         default:
>                 bpf_warn_invalid_xdp_action(act);
> -       case XDP_TX:
>         case XDP_ABORTED:
>                 trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
>                 /* fallthrough -- handle aborts by dropping packet */
> @@ -2008,6 +2025,27 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>  }
>
>  /**
> + * i40e_rx_buffer_flip - adjusted rx_buffer to point to an unused region
> + * @rx_ring: Rx ring
> + * @rx_buffer: Rx buffer to adjust
> + * @size: Size of adjustment
> + **/
> +static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
> +                               struct i40e_rx_buffer *rx_buffer,
> +                               unsigned int size)
> +{
> +#if (PAGE_SIZE < 8192)
> +       unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
> +
> +       rx_buffer->page_offset ^= truesize;
> +#else
> +       unsigned int truesize = SKB_DATA_ALIGN(i40e_rx_offset(rx_ring) + size);
> +
> +       rx_buffer->page_offset += truesize;
> +#endif
> +}
> +
> +/**
>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>   * @rx_ring: rx descriptor ring to transact packets on
>   * @budget: Total limit on number of packets to process
> @@ -2024,7 +2062,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>         struct sk_buff *skb = rx_ring->skb;
>         u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> -       bool failure = false;
> +       bool failure = false, xdp_xmit = false;
>
>         while (likely(total_rx_packets < budget)) {
>                 struct i40e_rx_buffer *rx_buffer;
> @@ -2081,9 +2119,14 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                 }
>
>                 if (IS_ERR(skb)) {
> +                       if (PTR_ERR(skb) == -I40E_XDP_TX) {
> +                               xdp_xmit = true;
> +                               i40e_rx_buffer_flip(rx_ring, rx_buffer, size);
> +                       } else {
> +                               rx_buffer->pagecnt_bias++;
> +                       }
>                         total_rx_bytes += size;
>                         total_rx_packets++;
> -                       rx_buffer->pagecnt_bias++;
>                 } else if (skb) {
>                         i40e_add_rx_frag(rx_ring, rx_buffer, skb, size);
>                 } else if (ring_uses_build_skb(rx_ring)) {
> @@ -2131,6 +2174,19 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                 total_rx_packets++;
>         }
>
> +       if (xdp_xmit) {
> +               struct i40e_ring *xdp_ring;
> +
> +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +
> +               /* Force memory writes to complete before letting h/w
> +                * know there are new descriptors to fetch.
> +                */
> +               wmb();
> +
> +               writel(xdp_ring->next_to_use, xdp_ring->tail);
> +       }
> +
>         rx_ring->skb = skb;
>
>         u64_stats_update_begin(&rx_ring->syncp);
> @@ -3188,6 +3244,59 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
>  }
>
>  /**
> + * i40e_xmit_xdp_ring - transmits an XDP buffer to an XDP Tx ring
> + * @xdp: data to transmit
> + * @xdp_ring: XDP Tx ring
> + **/
> +static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
> +                             struct i40e_ring *xdp_ring)
> +{
> +       struct i40e_tx_buffer *tx_bi;
> +       struct i40e_tx_desc *tx_desc;
> +       u16 i = xdp_ring->next_to_use;
> +       dma_addr_t dma;
> +       u32 size = xdp->data_end - xdp->data;
> +
> +       if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
> +               xdp_ring->tx_stats.tx_busy++;
> +               return I40E_XDP_CONSUMED;
> +       }
> +
> +       dma = dma_map_single(xdp_ring->dev, xdp->data, size, DMA_TO_DEVICE);
> +       if (dma_mapping_error(xdp_ring->dev, dma))
> +               return I40E_XDP_CONSUMED;
> +
> +       tx_bi = &xdp_ring->tx_bi[i];
> +       tx_bi->bytecount = size;
> +       tx_bi->gso_segs = 1;
> +       tx_bi->raw_buf = xdp->data;
> +
> +       /* record length, and DMA address */
> +       dma_unmap_len_set(tx_bi, len, size);
> +       dma_unmap_addr_set(tx_bi, dma, dma);
> +
> +       tx_desc = I40E_TX_DESC(xdp_ring, i);
> +       tx_desc->buffer_addr = cpu_to_le64(dma);
> +       tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
> +                                                 | I40E_TXD_CMD,
> +                                                 0, size, 0);
> +
> +       /* Make certain all of the status bits have been updated
> +        * before next_to_watch is written.
> +        */
> +       smp_wmb();
> +
> +       i++;
> +       if (i == xdp_ring->count)
> +               i = 0;
> +
> +       tx_bi->next_to_watch = tx_desc;
> +       xdp_ring->next_to_use = i;
> +
> +       return I40E_XDP_TX;
> +}
> +
> +/**
>   * i40e_xmit_frame_ring - Sends buffer on Tx ring
>   * @skb:     send buffer
>   * @tx_ring: ring to send buffer on
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 31f0b162996f..b288d58313a6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -396,6 +396,7 @@ struct i40e_ring {
>         u16 flags;
>  #define I40E_TXR_FLAGS_WB_ON_ITR               BIT(0)
>  #define I40E_RXR_FLAGS_BUILD_SKB_ENABLED       BIT(1)
> +#define I40E_TXR_FLAGS_XDP                     BIT(2)
>
>         /* stats structs */
>         struct i40e_queue_stats stats;
> @@ -438,6 +439,16 @@ static inline void clear_ring_build_skb_enabled(struct i40e_ring *ring)
>         ring->flags &= ~I40E_RXR_FLAGS_BUILD_SKB_ENABLED;
>  }
>
> +static inline bool ring_is_xdp(struct i40e_ring *ring)
> +{
> +       return !!(ring->flags & I40E_TXR_FLAGS_XDP);
> +}
> +
> +static inline void set_ring_xdp(struct i40e_ring *ring)
> +{
> +       ring->flags |= I40E_TXR_FLAGS_XDP;
> +}
> +
>  enum i40e_latency_range {
>         I40E_LOWEST_LATENCY = 0,
>         I40E_LOW_LATENCY = 1,
> --
> 2.11.0
>

^ permalink raw reply

* [Intel-wired-lan] [PATCH v5 0/2] i40e: support for XDP
From: Alexander Duyck @ 2017-05-19 13:55 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <20170519070856.13900-1-bjorn.topel@gmail.com>

On Fri, May 19, 2017 at 12:08 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> From: Bj?rn T?pel <bjorn.topel@intel.com>
>
> This series adds XDP support for i40e-based NICs.
>
> The first patch wires up ndo_xdp and implements XDP_DROP semantics for
> all actions. The second patch adds egress support via the XDP_TX
> action.
>
> Performance numbers (40GbE port, 64B packets) for xdp1 and xdp2
> programs, from samples/bpf/:
>
>  IOMMU                      | xdp1      | xdp2
>  ---------------------------+-----------+-----------
>  iommu=off                  | 29.7 Mpps | 17.1 Mpps
>  iommu=pt intel_iommu=on    | 29.7 Mpps | 11.6 Mpps
>  iommu=on intel_iommu=on    | 21.8 Mpps |  3.7 Mpps

These numbers look pretty good. I wouldn't expect us to have much in
the way of performance with iommu enabled, and the iommu=off numbers
are about 20Gb/s for xdp1, and better than 10Gb/s for xdp2 so this is
a good starting point. I'm assuming this is a single queue throughput
test?

> Future improvements, not covered by the patches:
>   * Egress: Create the iova mappings upfront
>     (DMA_BIDIRECTIONAL/dma_sync_*), instead of creating a new iova
>     mapping in the transmit fast-path. This will improve performance
>     for the IOMMU-enabled case.

The problem with using DMA_BIDIRECTIONAL is that there are scenarios
where it makes DMA more expensive in general as we have to then push
the data every time we do a sync for CPU. If you take a look at the
swiotlb code it will give you an idea of what I am talking about.

Also when we start supporting redirection the DMA_BIDIRECTIONAL won't
be useful unless you want to map a buffer for multiple devices
simultaneously.

>   * Proper debugfs support.
>   * i40evf support.
>
> Thanks to Alex, Daniel, John and Scott for all feedback!
>
> v5:
>   * Aligned the implementation to ixgbe's XDP support: naming, favor
>     xchg instead of RCU semantics
>   * Support for XDP headroom (piggybacking on Alex' build_skb work)
>   * Added xdp tracepoints for exception states (as suggested by
>     Daniel)
>
> v4:
>   * Removed unused i40e_page_is_reserved function
>   * Prior running the XDP program, set the struct xdp_buff
>     data_hard_start member
>
> v3:
>   * Rebased patch set on Jeff's dev-queue branch
>   * MSI-X is no longer a prerequisite for XDP
>   * RCU locking for the XDP program and XDP_RX support is introduced
>     in the same patch
>   * Rx bytes is now bumped for XDP
>   * Removed pointer-to-pointer clunkiness
>   * Added comments to XDP preconditions in ndo_xdp
>   * When a non-EOF is received, log once, and drop the frame
>
> v2:
>   * Fixed kbuild error for PAGE_SIZE >= 8192.
>   * Renamed i40e_try_flip_rx_page to i40e_can_reuse_rx_page, which is
>     more in line to the other Intel Ethernet drivers (igb/fm10k).
>   * Validate xdp_adjust_head support in ndo_xdp/XDP_SETUP_PROG.
>
> Bj?rn T?pel (2):
>   i40e: add XDP support for pass and drop actions
>   i40e: add support for XDP_TX action
>
>  drivers/net/ethernet/intel/i40e/i40e.h         |   8 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  57 +++++-
>  drivers/net/ethernet/intel/i40e/i40e_main.c    | 270 ++++++++++++++++++++++---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 245 ++++++++++++++++++----
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |  12 ++
>  5 files changed, 530 insertions(+), 62 deletions(-)
>
> --
> 2.11.0
>

^ permalink raw reply

* [Intel-wired-lan] [PATCH v5 0/2] i40e: support for XDP
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2017-05-19 14:45 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <CAKgT0UeuQRLXbVqA+fKjzAycQXek8+bWUq_1WvXWFWJAc0q2MA@mail.gmail.com>

2017-05-19 15:55 GMT+02:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Fri, May 19, 2017 at 12:08 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>> From: Bj?rn T?pel <bjorn.topel@intel.com>
>>
>> This series adds XDP support for i40e-based NICs.
>>
>> The first patch wires up ndo_xdp and implements XDP_DROP semantics for
>> all actions. The second patch adds egress support via the XDP_TX
>> action.
>>
>> Performance numbers (40GbE port, 64B packets) for xdp1 and xdp2
>> programs, from samples/bpf/:
>>
>>  IOMMU                      | xdp1      | xdp2
>>  ---------------------------+-----------+-----------
>>  iommu=off                  | 29.7 Mpps | 17.1 Mpps
>>  iommu=pt intel_iommu=on    | 29.7 Mpps | 11.6 Mpps
>>  iommu=on intel_iommu=on    | 21.8 Mpps |  3.7 Mpps
>
> These numbers look pretty good. I wouldn't expect us to have much in
> the way of performance with iommu enabled, and the iommu=off numbers
> are about 20Gb/s for xdp1, and better than 10Gb/s for xdp2 so this is
> a good starting point. I'm assuming this is a single queue throughput
> test?

Correct, single queue/one core!

>
>> Future improvements, not covered by the patches:
>>   * Egress: Create the iova mappings upfront
>>     (DMA_BIDIRECTIONAL/dma_sync_*), instead of creating a new iova
>>     mapping in the transmit fast-path. This will improve performance
>>     for the IOMMU-enabled case.
>
> The problem with using DMA_BIDIRECTIONAL is that there are scenarios
> where it makes DMA more expensive in general as we have to then push
> the data every time we do a sync for CPU. If you take a look at the
> swiotlb code it will give you an idea of what I am talking about.
>
> Also when we start supporting redirection the DMA_BIDIRECTIONAL won't
> be useful unless you want to map a buffer for multiple devices
> simultaneously.
>
>>   * Proper debugfs support.
>>   * i40evf support.
>>
>> Thanks to Alex, Daniel, John and Scott for all feedback!
>>
>> v5:
>>   * Aligned the implementation to ixgbe's XDP support: naming, favor
>>     xchg instead of RCU semantics
>>   * Support for XDP headroom (piggybacking on Alex' build_skb work)
>>   * Added xdp tracepoints for exception states (as suggested by
>>     Daniel)
>>
>> v4:
>>   * Removed unused i40e_page_is_reserved function
>>   * Prior running the XDP program, set the struct xdp_buff
>>     data_hard_start member
>>
>> v3:
>>   * Rebased patch set on Jeff's dev-queue branch
>>   * MSI-X is no longer a prerequisite for XDP
>>   * RCU locking for the XDP program and XDP_RX support is introduced
>>     in the same patch
>>   * Rx bytes is now bumped for XDP
>>   * Removed pointer-to-pointer clunkiness
>>   * Added comments to XDP preconditions in ndo_xdp
>>   * When a non-EOF is received, log once, and drop the frame
>>
>> v2:
>>   * Fixed kbuild error for PAGE_SIZE >= 8192.
>>   * Renamed i40e_try_flip_rx_page to i40e_can_reuse_rx_page, which is
>>     more in line to the other Intel Ethernet drivers (igb/fm10k).
>>   * Validate xdp_adjust_head support in ndo_xdp/XDP_SETUP_PROG.
>>
>> Bj?rn T?pel (2):
>>   i40e: add XDP support for pass and drop actions
>>   i40e: add support for XDP_TX action
>>
>>  drivers/net/ethernet/intel/i40e/i40e.h         |   8 +
>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  57 +++++-
>>  drivers/net/ethernet/intel/i40e/i40e_main.c    | 270 ++++++++++++++++++++++---
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 245 ++++++++++++++++++----
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |  12 ++
>>  5 files changed, 530 insertions(+), 62 deletions(-)
>>
>> --
>> 2.11.0
>>

^ permalink raw reply

* [Intel-wired-lan] [PATCH] e1000e: use disable_hardirq() also for MSIX vectors in e1000_netpoll()
From: Cong Wang @ 2017-05-19 18:34 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149517832870.37403.16008624383220981105.stgit@buzz>

On Fri, May 19, 2017 at 12:18 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> Replace disable_irq() which waits for threaded irq handlers with
> disable_hardirq() which waits only for hardirq part.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Fixes: 311191297125 ("e1000: use disable_hardirq() for e1000_netpoll()")

Thomas had a similar patch, I don't know why he never sends it
out formally. Anyway,

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* [Intel-wired-lan] [PATCH v2] e1000e: Don't return uninitialized stats
From: Brown, Aaron F @ 2017-05-19 21:12 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <1495181811.2376.0.camel@intel.com>

> From: Kirsher, Jeffrey T
> Sent: Friday, May 19, 2017 1:17 AM
> To: David Miller <davem@davemloft.net>; bpoirier at suse.com
> Cc: s.priebe at profihost.ag; intel-wired-lan at lists.osuosl.org;
> netdev at vger.kernel.org; pmenzel at molgen.mpg.de; Neftin, Sasha
> <sasha.neftin@intel.com>; Brown, Aaron F <aaron.f.brown@intel.com>;
> stephen at networkplumber.org
> Subject: Re: [PATCH v2] e1000e: Don't return uninitialized stats
> 
> On Thu, 2017-05-18 at 10:46 -0400, David Miller wrote:
> > From: Benjamin Poirier <bpoirier@suse.com>
> > Date: Wed, 17 May 2017 16:24:13 -0400
> >
> > > Some statistics passed to ethtool are garbage because
> > > e1000e_get_stats64()
> > > doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
> > > memory to userspace and confuses users.
> > >
> > > Do like ixgbe and use dev_get_stats() which first zeroes out
> > > rtnl_link_stats64.
> > >
> > > Fixes: 5944701df90d ("net: remove useless memset's in drivers
> > > get_stats64")
> > > Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> > > Signed-off-by: Benjamin Poirier <bpoirier@suse.com>

Tested-by: Aaron Brown <aaron.f.brown@intel.com>

^ permalink raw reply

* [Intel-wired-lan] [PATCH] ixgbe: prevent driver configuration changes while XDP is loaded
From: Singh, Krishneil K @ 2017-05-19 22:16 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <20170502050634.7459.50968.stgit@john-Precision-Tower-5810>




-----Original Message-----
From: Intel-wired-lan [mailto:intel-wired-lan-bounces at lists.osuosl.org] On Behalf Of John Fastabend
Sent: Monday, May 1, 2017 10:07 PM
To: intel-wired-lan@lists.osuosl.org
Subject: [Intel-wired-lan] [PATCH] ixgbe: prevent driver configuration changes while XDP is loaded

XDP checks to ensure the MTU is valid and LRO is disabled when it is loaded. But user configuration after XDP is loaded could change these and cause a misconfiguration.

This patch adds checks to ensure config changes are valid.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

With this patch applied, we are seeing issue where setting initial MTU >= 1515 fails with Error: Setting MTU > 1536 with XDP is not supported, but setting MTU to 1514  , we can now set MTU till 3050. If we set anything greater than 3050 it errors out with the following message: Setting MTU > 3072 with XDP is not supported.

Is there a way to see if XDP is enabled and is XDP is supposed to be enabled by default on ixgbe ? only info about XDP seen on platform is in dmesg as follows. 

[    3.267909] ixgbe 0000:04:00.0: Multiqueue Enabled: Rx Queue count = 12, Tx Queue count = 12 XDP Queue count = 0
[    3.724686] ixgbe 0000:04:00.1: Multiqueue Enabled: Rx Queue count = 12, Tx Queue count = 12 XDP Queue count = 0
[    4.186271] ixgbe 0000:06:00.0: Multiqueue Enabled: Rx Queue count = 12, Tx Queue count = 12 XDP Queue count = 0
[    4.644388] ixgbe 0000:06:00.1: Multiqueue Enabled: Rx Queue count = 12, Tx Queue count = 12 XDP Queue count = 0

When using latest iproute2 tool to disable xdp ( ip link set ethX xdp off) , there is no error reported by iproute2 and nothing is reported in dmesg but we were still unable to set MTU > = 3051.

^ permalink raw reply

* [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
From: John Fastabend @ 2017-05-19 22:30 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149524122523.11022.4541073724650541658.stgit@anamdev.jf.intel.com>

On 05/19/2017 05:58 PM, Amritha Nambiar wrote:
> The following series introduces a new harware offload mode in
> tc/mqprio where the TCs, the queue configurations and bandwidth rate
> limits are offloaded to the hardware. The i40e driver enables the new
> mqprio hardware offload mechanism factoring the TCs, queue
> configuration and bandwidth rates by creating HW channel VSIs.
> 

nice work, fix your time stamp and line wrapping though.


> In this mode, the priority to traffic class mapping and the user
> specified queue ranges are used to configure the traffic class when
> the 'hw' option is set to 2. This is achieved by creating HW
> channels(VSI). A new channel is created for each of the traffic class
> configuration offloaded via mqprio framework except for the first TC
> (TC0) which is for the main VSI. TC0 for the main VSI is also
> reconfigured as per user provided queue parameters. Finally,
> bandwidth rate limits are set on these traffic classes through the
> mqprio offload framework by sending these rates in addition to the
> number of TCs and the queue configurations.
> Example:
> # tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
>   queues 4 at 0 4 at 4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2
> 
> To dump the bandwidth rates:
> 
> # tc qdisc show dev eth0
>   qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
>                queues:(0:3) (4:7)
>                min rates:0bit 0bit
>                max rates:55Mbit 60Mbit
> 

Looks reasonable to me thanks. Previously, rate limits were being set
via dcbnl but I guess this interface is slightly nicer in that it puts
all configuration in one spot. IMO it would be nice to push dcbnl users
over to this.

Thanks,
.John

^ permalink raw reply

* [Intel-wired-lan] [PATCH] ixgbe: Extend firmware version support
From: Greenwalt, Paul @ 2017-05-19 23:54 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <1494926592-53913-1-git-send-email-paul.greenwalt@intel.com>

NACK

FW version EEPROM blocks definitions may change, so I'll submit a new patch once this has been resolved.

-----Original Message-----
From: Greenwalt, Paul 
Sent: Tuesday, May 16, 2017 2:23 AM
To: intel-wired-lan@lists.osuosl.org
Cc: Greenwalt, Paul <paul.greenwalt@intel.com>
Subject: [PATCH] ixgbe: Extend firmware version support

Extend FW version reporting by displaying information from the eeprom iscsi or nvm OEM block EEPROM.

This will allow us to more accurately identify the FW.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |  3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |  7 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c    |  7 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    | 93 +++++++++++++++++++++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    | 17 +++++
 5 files changed, 112 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index dd55787..91ea773 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -712,8 +712,7 @@ struct ixgbe_adapter {
 
 	u16 bridge_mode;
 
-	u16 eeprom_verh;
-	u16 eeprom_verl;
+	char eeprom_id[NVM_VER_SIZE];
 	u16 eeprom_cap;
 
 	u32 interrupt_event;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 0b75d304..22d1bda 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1013,16 +1013,13 @@ static void ixgbe_get_drvinfo(struct net_device *netdev,
 			      struct ethtool_drvinfo *drvinfo)  {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
-	u32 nvm_track_id;
 
 	strlcpy(drvinfo->driver, ixgbe_driver_name, sizeof(drvinfo->driver));
 	strlcpy(drvinfo->version, ixgbe_driver_version,
 		sizeof(drvinfo->version));
 
-	nvm_track_id = (adapter->eeprom_verh << 16) |
-			adapter->eeprom_verl;
-	snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version), "0x%08x",
-		 nvm_track_id);
+	strlcpy(drvinfo->fw_version, adapter->eeprom_id,
+		sizeof(drvinfo->fw_version));
 
 	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
 		sizeof(drvinfo->bus_info));
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
index 2a653ec..e320621 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
@@ -1034,11 +1034,8 @@ int ixgbe_fcoe_get_hbainfo(struct net_device *netdev,
 		 ixgbe_driver_name,
 		 ixgbe_driver_version);
 	/* Firmware Version */
-	snprintf(info->firmware_version,
-		 sizeof(info->firmware_version),
-		 "0x%08x",
-		 (adapter->eeprom_verh << 16) |
-		  adapter->eeprom_verl);
+	strlcpy(info->firmware_version, adapter->eeprom_id,
+		sizeof(info->firmware_version));
 
 	/* Model */
 	if (hw->mac.type == ixgbe_mac_82599EB) { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ee20a2b..45e8b5b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9989,6 +9989,95 @@ bool ixgbe_wol_supported(struct ixgbe_adapter *adapter, u16 device_id,  }
 
 /**
+ * ixgbe_set_fw_version - Set FW version
+ * @adapter: the adapter private structure
+ *
+ * This function is used by probe and ethtool to determine the FW 
+version to
+ * format to display. The FW version is taken from the EEPROM/NVM.
+ *
+ **/
+static void ixgbe_set_fw_version(struct ixgbe_adapter *adapter) {
+	struct ixgbe_hw *hw = &adapter->hw;
+	u16 etk_id_h = 0, etk_id_l = 0;
+	u16 offset = 0;
+	u32 etk_id;
+
+	/* Check for OEM Product Version block format */
+	hw->eeprom.ops.read(hw, NVM_OEM_PROD_VER_PTR, &offset);
+
+	/* Make sure offset to OEM Product Version block is valid */
+	if (!(offset == 0x0) && !(offset == NVM_INVALID_PTR)) {
+		u16 mod_len = 0, cap = 0;
+
+		/* Read product version block */
+		hw->eeprom.ops.read(hw, offset, &mod_len);
+		hw->eeprom.ops.read(hw, offset + NVM_OEM_PROD_VER_CAP_OFF,
+				    &cap);
+
+		/* Only display OEM product version if valid block */
+		if (mod_len == NVM_OEM_PROD_VER_MOD_LEN &&
+		    (cap & NVM_OEM_PROD_VER_CAP_MASK) == 0x0) {
+			u16 build, major, patch, prod_ver, rel_num;
+
+			hw->eeprom.ops.read(hw, offset + NVM_OEM_PROD_VER_OFF_L,
+					    &prod_ver);
+			hw->eeprom.ops.read(hw, offset + NVM_OEM_PROD_VER_OFF_H,
+					    &rel_num);
+
+			major = prod_ver >> NVM_VER_SHIFT;
+			build = prod_ver & NVM_VER_MASK;
+			patch = rel_num;
+
+			snprintf(adapter->eeprom_id, sizeof(adapter->eeprom_id),
+				 "%x.%x.%x", major, build, patch);
+			return;
+		}
+	}
+
+	/* Save off EEPROM version number and Option Rom version which
+	 * together make a unique identify for the eeprom
+	 */
+	hw->eeprom.ops.read(hw, NVM_ETK_OFF_HI, &etk_id_h);
+	hw->eeprom.ops.read(hw, NVM_ETK_OFF_LOW, &etk_id_l);
+	etk_id = (etk_id_h << NVM_ETK_SHIFT) | etk_id_l;
+
+	/* Check for SCSI block version format */
+	hw->eeprom.ops.read(hw, NVM_ISCSI_BLCK_PTR, &offset);
+
+	/* Make sure offset to SCSI block is valid */
+	if (!(offset == 0x0) && !(offset == NVM_INVALID_PTR)) {
+		u16 nvm_cfg_blkh = 0, nvm_cfg_blkl = 0;
+
+		hw->eeprom.ops.read(hw,
+				    offset + NVM_ISCSI_IMG_VER_OFF_H,
+				    &nvm_cfg_blkh);
+		hw->eeprom.ops.read(hw,
+				    offset + NVM_ISCSI_IMG_VER_OFF_L,
+				    &nvm_cfg_blkl);
+
+		/* Only display Option Rom if exist */
+		if (nvm_cfg_blkl && nvm_cfg_blkh) {
+			u16 build, major, patch;
+
+			major = nvm_cfg_blkl >> NVM_VER_SHIFT;
+			build = (nvm_cfg_blkl << NVM_VER_SHIFT) |
+				(nvm_cfg_blkh >> NVM_VER_SHIFT);
+			patch = nvm_cfg_blkh & NVM_VER_MASK;
+
+			snprintf(adapter->eeprom_id, sizeof(adapter->eeprom_id),
+				 "0x%08x, %d.%d.%d", etk_id, major, build,
+				 patch);
+			return;
+		}
+	}
+
+	/* Set ETrack ID format */
+	snprintf(adapter->eeprom_id, sizeof(adapter->eeprom_id),
+		 "0x%08x", etk_id);
+}
+
+/**
  * ixgbe_probe - Device Initialization Routine
  * @pdev: PCI device information struct
  * @ent: entry in ixgbe_pci_tbl
@@ -10326,9 +10415,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
 
-	/* save off EEPROM version number */
-	hw->eeprom.ops.read(hw, 0x2e, &adapter->eeprom_verh);
-	hw->eeprom.ops.read(hw, 0x2d, &adapter->eeprom_verl);
+	ixgbe_set_fw_version(adapter);
 
 	/* pick up the PCI bus settings for reporting later */
 	if (ixgbe_pcie_from_parent(hw))
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index 9c2460c..a0a6d84 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -2083,6 +2083,23 @@ enum {
 #define IXGBE_NVM_POLL_WRITE       1  /* Flag for polling for write complete */
 #define IXGBE_NVM_POLL_READ        0  /* Flag for polling for read complete */
 
+#define NVM_ISCSI_BLCK_PTR	0x17  /* iSCSI configuration block pointer */
+#define NVM_ISCSI_IMG_VER_OFF_H	0x84  /* iSCSI combo image version high */
+#define NVM_ISCSI_IMG_VER_OFF_L	0x83  /* iSCSI combo image version low */
+#define NVM_VER_MASK		0x00FF /* version mask */
+#define NVM_VER_SHIFT		8     /* version bit shift */
+#define NVM_OEM_PROD_VER_PTR	0x1B  /* OEM Product version block pointer */
+#define NVM_OEM_PROD_VER_CAP_OFF 0x1  /* OEM Product version format offset */
+#define NVM_OEM_PROD_VER_OFF_L	0x2   /* OEM Product version offset low */
+#define NVM_OEM_PROD_VER_OFF_H	0x3   /* OEM Product version offset high */
+#define NVM_OEM_PROD_VER_CAP_MASK 0xF /* OEM Product version cap mask 
+*/ #define NVM_OEM_PROD_VER_MOD_LEN 0x3  /* OEM Product version module length */
+#define NVM_ETK_OFF_LOW		0x2D  /* version low order word */
+#define NVM_ETK_OFF_HI		0x2E  /* version high order word */
+#define NVM_ETK_SHIFT		16    /* high version word shift */
+#define NVM_INVALID_PTR		0xFFFF
+#define NVM_VER_SIZE		32    /* version sting size */
+
 #define NVM_INIT_CTRL_3			0x38
 #define NVM_INIT_CTRL_3_LPLU		0x8
 #define NVM_INIT_CTRL_3_D10GMP_PORT0	0x40
--
2.7.4


^ permalink raw reply related

* [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
From: Amritha Nambiar @ 2017-05-20  0:58 UTC (permalink / raw)
  To: intel-wired-lan

The following series introduces a new harware offload mode in tc/mqprio where the TCs, the queue configurations and bandwidth rate limits are offloaded to the hardware.
The i40e driver enables the new mqprio hardware offload mechanism factoring the TCs, queue configuration and bandwidth rates by creating HW channel VSIs. 

In this mode, the priority to traffic class mapping and the user specified queue ranges are used to configure the traffic class when the 'hw' option is set to 2. This is achieved by creating HW channels(VSI). A new channel is created for each of the traffic class configuration offloaded via mqprio framework except for the first TC (TC0) which is for the main VSI. TC0 for the main VSI is also reconfigured as per user provided queue parameters. Finally, bandwidth rate limits are set on these traffic classes through the mqprio offload framework by sending these rates in addition to the number of TCs and the queue configurations.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
  queues 4 at 0 4 at 4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2

To dump the bandwidth rates:

# tc qdisc show dev eth0
  qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
               queues:(0:3) (4:7)
               min rates:0bit 0bit
               max rates:55Mbit 60Mbit

---

Amritha Nambiar (4):
      [next-queue]net: mqprio: Introduce new hardware offload mode in mqprio for offloading full TC configurations
      [next-queue]net: i40e: Add infrastructure for queue channel support with the TCs and queue configurations offloaded via mqprio scheduler
      [next-queue]net: i40e: Enable mqprio full offload mode in the i40e driver for configuring TCs and queue mapping
      [next-queue]net: i40e: Add support to set max bandwidth rates for TCs offloaded via tc/mqprio


 drivers/net/ethernet/intel/i40e/i40e.h         |   42 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    6 
 drivers/net/ethernet/intel/i40e/i40e_main.c    | 1365 +++++++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |    2 
 include/linux/netdevice.h                      |    2 
 include/net/pkt_cls.h                          |    7 
 include/uapi/linux/pkt_sched.h                 |   13 
 net/sched/sch_mqprio.c                         |  169 +++
 8 files changed, 1449 insertions(+), 157 deletions(-)

--

^ permalink raw reply

* [Intel-wired-lan] [PATCH 1/4] [next-queue]net: mqprio: Introduce new hardware offload mode in mqprio for offloading full TC configurations
From: Amritha Nambiar @ 2017-05-20  0:58 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149524122523.11022.4541073724650541658.stgit@anamdev.jf.intel.com>

This patch introduces a new hardware offload mode in mqprio
which makes full use of the mqprio options, the TCs, the
queue configurations and the bandwidth rates for the TCs.
This is achieved by setting the value 2 for the "hw" option.
This new offload mode supports new attributes for traffic
class such as minimum and maximum values for bandwidth rate limits.

Introduces a new datastructure 'tc_mqprio_qopt_offload' for offloading
mqprio queue options and use this to be shared between the kernel and
device driver. This contains a copy of the exisiting datastructure
for mqprio queue options. This new datastructure can be extended when
adding new attributes for traffic class such as bandwidth rate limits. The
existing datastructure for mqprio queue options will be shared between the
kernel and userspace.

This patch enables configuring additional attributes associated
with a traffic class such as minimum and maximum bandwidth
rates and can be offloaded to the hardware in the new offload mode.
The min and max limits for bandwidth rates are provided
by the user along with the the TCs and the queue configurations
when creating the mqprio qdisc.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
  queues 4 at 0 4 at 4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2

To dump the bandwidth rates:

# tc qdisc show dev eth0
qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
             queues:(0:3) (4:7)
             min rates:0bit 0bit
             max rates:55Mbit 60Mbit

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/linux/netdevice.h      |    2 
 include/net/pkt_cls.h          |    7 ++
 include/uapi/linux/pkt_sched.h |   13 +++
 net/sched/sch_mqprio.c         |  169 +++++++++++++++++++++++++++++++++++++---
 4 files changed, 180 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0150b2d..17b9066 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -779,6 +779,7 @@ enum {
 	TC_SETUP_CLSFLOWER,
 	TC_SETUP_MATCHALL,
 	TC_SETUP_CLSBPF,
+	TC_SETUP_MQPRIO_EXT,
 };
 
 struct tc_cls_u32_offload;
@@ -791,6 +792,7 @@ struct tc_to_netdev {
 		struct tc_cls_matchall_offload *cls_mall;
 		struct tc_cls_bpf_offload *cls_bpf;
 		struct tc_mqprio_qopt *mqprio;
+		struct tc_mqprio_qopt_offload *mqprio_qopt;
 	};
 	bool egress_dev;
 };
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 2c213a6..5ab8052 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -549,6 +549,13 @@ struct tc_cls_bpf_offload {
 	u32 gen_flags;
 };
 
+struct tc_mqprio_qopt_offload {
+	/* struct tc_mqprio_qopt must always be the first element */
+	struct tc_mqprio_qopt qopt;
+	u32	flags;
+	u64	min_rate[TC_QOPT_MAX_QUEUE];
+	u64	max_rate[TC_QOPT_MAX_QUEUE];
+};
 
 /* This structure holds cookie structure that is passed from user
  * to the kernel for actions and classifiers
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 099bf55..cf2a146 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -620,6 +620,7 @@ struct tc_drr_stats {
 enum {
 	TC_MQPRIO_HW_OFFLOAD_NONE,	/* no offload requested */
 	TC_MQPRIO_HW_OFFLOAD_TCS,	/* offload TCs, no queue counts */
+	TC_MQPRIO_HW_OFFLOAD,		/* fully supported offload */
 	__TC_MQPRIO_HW_OFFLOAD_MAX
 };
 
@@ -633,6 +634,18 @@ struct tc_mqprio_qopt {
 	__u16	offset[TC_QOPT_MAX_QUEUE];
 };
 
+#define TC_MQPRIO_F_MIN_RATE  0x1
+#define TC_MQPRIO_F_MAX_RATE  0x2
+
+enum {
+	TCA_MQPRIO_UNSPEC,
+	TCA_MQPRIO_MIN_RATE64,
+	TCA_MQPRIO_MAX_RATE64,
+	__TCA_MQPRIO_MAX,
+};
+
+#define TCA_MQPRIO_MAX (__TCA_MQPRIO_MAX - 1)
+
 /* SFB */
 
 enum {
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 0a4cf27..6457ec9 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -18,10 +18,13 @@
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/sch_generic.h>
+#include <net/pkt_cls.h>
 
 struct mqprio_sched {
 	struct Qdisc		**qdiscs;
 	int hw_offload;
+	u32 flags;
+	u64 min_rate[TC_QOPT_MAX_QUEUE], max_rate[TC_QOPT_MAX_QUEUE];
 };
 
 static void mqprio_destroy(struct Qdisc *sch)
@@ -39,10 +42,21 @@ static void mqprio_destroy(struct Qdisc *sch)
 	}
 
 	if (priv->hw_offload && dev->netdev_ops->ndo_setup_tc) {
-		struct tc_mqprio_qopt offload = { 0 };
-		struct tc_to_netdev tc = { .type = TC_SETUP_MQPRIO,
-					   { .mqprio = &offload } };
+		struct tc_mqprio_qopt_offload offload = { 0 };
+		struct tc_to_netdev tc = { 0 };
 
+		switch (priv->hw_offload) {
+		case TC_MQPRIO_HW_OFFLOAD_TCS:
+			tc.type = TC_SETUP_MQPRIO;
+			tc.mqprio = &offload.qopt;
+			break;
+		case TC_MQPRIO_HW_OFFLOAD:
+			tc.type = TC_SETUP_MQPRIO_EXT;
+			tc.mqprio_qopt = &offload;
+			break;
+		default:
+			return;
+		}
 		dev->netdev_ops->ndo_setup_tc(dev, sch->handle, 0, &tc);
 	} else {
 		netdev_set_num_tc(dev, 0);
@@ -99,6 +113,24 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
 	return 0;
 }
 
+static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
+	[TCA_MQPRIO_MIN_RATE64] = { .type = NLA_NESTED },
+	[TCA_MQPRIO_MAX_RATE64] = { .type = NLA_NESTED },
+};
+
+static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
+		      const struct nla_policy *policy, int len)
+{
+	int nested_len = nla_len(nla) - NLA_ALIGN(len);
+
+	if (nested_len >= nla_attr_size(0))
+		return nla_parse(tb, maxtype, nla_data(nla) + NLA_ALIGN(len),
+				 nested_len, policy, NULL);
+
+	memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));
+	return 0;
+}
+
 static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	struct net_device *dev = qdisc_dev(sch);
@@ -107,6 +139,10 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 	struct Qdisc *qdisc;
 	int i, err = -EOPNOTSUPP;
 	struct tc_mqprio_qopt *qopt = NULL;
+	struct nlattr *tb[TCA_MQPRIO_MAX + 1];
+	struct nlattr *attr;
+	int rem;
+	int len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
 
 	BUILD_BUG_ON(TC_MAX_QUEUE != TC_QOPT_MAX_QUEUE);
 	BUILD_BUG_ON(TC_BITMASK != TC_QOPT_BITMASK);
@@ -124,6 +160,51 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 	if (mqprio_parse_opt(dev, qopt))
 		return -EINVAL;
 
+	if (len > 0) {
+		err = parse_attr(tb, TCA_MQPRIO_MAX, opt, mqprio_policy,
+				 sizeof(*qopt));
+		if (err < 0)
+			return err;
+
+		if (tb[TCA_MQPRIO_MIN_RATE64]) {
+			if (qopt->hw != TC_MQPRIO_HW_OFFLOAD)
+				return -EINVAL;
+
+			i = 0;
+			nla_for_each_nested(attr, tb[TCA_MQPRIO_MIN_RATE64],
+					    rem) {
+				if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64)
+					return -EINVAL;
+
+				if (i >= qopt->num_tc)
+					return -EINVAL;
+
+				priv->min_rate[i] = *(u64 *)nla_data(attr);
+				i++;
+			}
+			priv->flags |= TC_MQPRIO_F_MIN_RATE;
+		}
+
+		if (tb[TCA_MQPRIO_MAX_RATE64]) {
+			if (qopt->hw != TC_MQPRIO_HW_OFFLOAD)
+				return -EINVAL;
+
+			i = 0;
+			nla_for_each_nested(attr, tb[TCA_MQPRIO_MAX_RATE64],
+					    rem) {
+				if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64)
+					return -EINVAL;
+
+				if (i >= qopt->num_tc)
+					return -EINVAL;
+
+				priv->max_rate[i] = *(u64 *)nla_data(attr);
+				i++;
+			}
+			priv->flags |= TC_MQPRIO_F_MAX_RATE;
+		}
+	}
+
 	/* pre-allocate qdisc, attachment can't fail */
 	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
 			       GFP_KERNEL);
@@ -148,15 +229,36 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 	 * supplied and verified mapping
 	 */
 	if (qopt->hw) {
-		struct tc_mqprio_qopt offload = *qopt;
-		struct tc_to_netdev tc = { .type = TC_SETUP_MQPRIO,
-					   { .mqprio = &offload } };
+		struct tc_mqprio_qopt_offload offload = {.qopt = *qopt};
+		struct tc_to_netdev tc = { 0 };
+
+		switch (qopt->hw) {
+		case TC_MQPRIO_HW_OFFLOAD_TCS:
+			tc.type = TC_SETUP_MQPRIO;
+			tc.mqprio = &offload.qopt;
+			break;
+		case TC_MQPRIO_HW_OFFLOAD:
+			tc.type = TC_SETUP_MQPRIO_EXT;
+			tc.mqprio_qopt = &offload;
+
+			offload.flags = priv->flags;
+			if (priv->flags & TC_MQPRIO_F_MIN_RATE)
+				for (i = 0; i < offload.qopt.num_tc; i++)
+					offload.min_rate[i] = priv->min_rate[i];
+
+			if (priv->flags & TC_MQPRIO_F_MAX_RATE)
+				for (i = 0; i < offload.qopt.num_tc; i++)
+					offload.max_rate[i] = priv->max_rate[i];
+			break;
+		default:
+			return -EINVAL;
+		}
 
 		err = dev->netdev_ops->ndo_setup_tc(dev, sch->handle, 0, &tc);
 		if (err)
 			return err;
 
-		priv->hw_offload = offload.hw;
+		priv->hw_offload = offload.qopt.hw;
 	} else {
 		netdev_set_num_tc(dev, qopt->num_tc);
 		for (i = 0; i < qopt->num_tc; i++)
@@ -226,11 +328,51 @@ static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
 	return 0;
 }
 
+static int dump_rates(struct mqprio_sched *priv,
+		      struct tc_mqprio_qopt *opt, struct sk_buff *skb)
+{
+	struct nlattr *nest;
+	int i;
+
+	if (priv->flags & TC_MQPRIO_F_MIN_RATE) {
+		nest = nla_nest_start(skb, TCA_MQPRIO_MIN_RATE64);
+		if (!nest)
+			goto nla_put_failure;
+
+		for (i = 0; i < opt->num_tc; i++) {
+			if (nla_put(skb, TCA_MQPRIO_MIN_RATE64,
+				    sizeof(priv->min_rate[i]),
+				    &priv->min_rate[i]))
+				goto nla_put_failure;
+		}
+		nla_nest_end(skb, nest);
+	}
+
+	if (priv->flags & TC_MQPRIO_F_MAX_RATE) {
+		nest = nla_nest_start(skb, TCA_MQPRIO_MAX_RATE64);
+		if (!nest)
+			goto nla_put_failure;
+
+		for (i = 0; i < opt->num_tc; i++) {
+			if (nla_put(skb, TCA_MQPRIO_MAX_RATE64,
+				    sizeof(priv->max_rate[i]),
+				    &priv->max_rate[i]))
+				goto nla_put_failure;
+		}
+		nla_nest_end(skb, nest);
+	}
+	return 0;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
 static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct net_device *dev = qdisc_dev(sch);
 	struct mqprio_sched *priv = qdisc_priv(sch);
-	unsigned char *b = skb_tail_pointer(skb);
+	struct nlattr *nla = (struct nlattr *)skb_tail_pointer(skb);
 	struct tc_mqprio_qopt opt = { 0 };
 	struct Qdisc *qdisc;
 	unsigned int i;
@@ -261,12 +403,17 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 		opt.offset[i] = dev->tc_to_txq[i].offset;
 	}
 
-	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+	if (nla_put(skb, TCA_OPTIONS, NLA_ALIGN(sizeof(opt)), &opt))
 		goto nla_put_failure;
 
-	return skb->len;
+	if (priv->flags & TC_MQPRIO_F_MIN_RATE ||
+	    priv->flags & TC_MQPRIO_F_MAX_RATE)
+		if (dump_rates(priv, &opt, skb) != 0)
+			goto nla_put_failure;
+
+	return nla_nest_end(skb, nla);
 nla_put_failure:
-	nlmsg_trim(skb, b);
+	nlmsg_trim(skb, nla);
 	return -1;
 }
 


^ permalink raw reply related

* [Intel-wired-lan] [PATCH 2/4] [next-queue]net: i40e: Add infrastructure for queue channel support with the TCs and queue configurations offloaded via mqprio scheduler
From: Amritha Nambiar @ 2017-05-20  0:58 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149524122523.11022.4541073724650541658.stgit@anamdev.jf.intel.com>

This patch sets up the infrastructure for offloading TCs and
queue configurations to the hardware by creating HW channels(VSI).
A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC
(TC0). TC0 for the main VSI is also reconfigured as per user provided
queue parameters. Queue counts that are not power-of-2 are handled by
reconfiguring RSS by reprogramming LUTs using the queue count value.
This patch also handles configuring the TX rings for the channels,
setting up the RX queue map for channel.

Also, the channels so created are removed and all the queue
configuration is set to default when the qdisc is detached from the
root of the device.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |   36 +
 drivers/net/ethernet/intel/i40e/i40e_main.c |  740 +++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |    2 
 3 files changed, 771 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 395ca94..0915b02 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -330,6 +330,24 @@ struct i40e_flex_pit {
 	u8 pit_index;
 };
 
+struct i40e_channel {
+	struct list_head list;
+	bool initialized;
+	u8 type;
+	u16 vsi_number;
+	u16 stat_counter_idx;
+	u16 base_queue;
+	u16 num_queue_pairs; /* Requested by user */
+	u16 allowed_queue_pairs;
+	u16 seid;
+
+	u8 enabled_tc;
+	struct i40e_aqc_vsi_properties_data info;
+
+	/* track this channel belongs to which VSI */
+	struct i40e_vsi *parent_vsi;
+};
+
 /* struct that defines the Ethernet device */
 struct i40e_pf {
 	struct pci_dev *pdev;
@@ -442,6 +460,7 @@ struct i40e_pf {
 #define I40E_FLAG_CLIENT_L2_CHANGE		BIT_ULL(56)
 #define I40E_FLAG_WOL_MC_MAGIC_PKT_WAKE		BIT_ULL(57)
 #define I40E_FLAG_LEGACY_RX			BIT_ULL(58)
+#define I40E_FLAG_TC_MQPRIO			BIT_ULL(59)
 
 	struct i40e_client_instance *cinst;
 	bool stat_offsets_loaded;
@@ -523,6 +542,9 @@ struct i40e_pf {
 	u32 ioremap_len;
 	u32 fd_inv;
 	u16 phy_led_val;
+
+#define I40E_MAX_QUEUES_PER_CH	64
+	u16 override_q_count;
 };
 
 /**
@@ -684,6 +706,16 @@ struct i40e_vsi {
 	bool current_isup;	/* Sync 'link up' logging */
 	enum i40e_aq_link_speed current_speed;	/* Sync link speed logging */
 
+	/* channel specific fields */
+	u16 cnt_q_avail; /* num of queues available for channel usage */
+	u16 orig_rss_size;
+	u16 current_rss_size;
+
+	/* keeps track of next_base_queue to be used for channel setup */
+	atomic_t next_base_queue;
+
+	struct list_head ch_list;
+
 	void *priv;	/* client driver data reference. */
 
 	/* VSI specific handlers */
@@ -716,6 +748,9 @@ struct i40e_q_vector {
 	bool arm_wb_state;
 #define ITR_COUNTDOWN_START 100
 	u8 itr_countdown;	/* when 0 should adjust ITR */
+
+	/* Following field(s) are specific to channel usage */
+	bool is_an_atq;
 } ____cacheline_internodealigned_in_smp;
 
 /* lan device */
@@ -972,4 +1007,5 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
 i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
 i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
 void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
+int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 8d1d3b85..e1bea45 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2864,7 +2864,7 @@ static void i40e_config_xps_tx_ring(struct i40e_ring *ring)
 	struct i40e_vsi *vsi = ring->vsi;
 	cpumask_var_t mask;
 
-	if (!ring->q_vector || !ring->netdev)
+	if (!ring->q_vector || !ring->netdev || ring->ch)
 		return;
 
 	/* Single TC mode enable XPS */
@@ -2937,7 +2937,17 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	 * initialization. This has to be done regardless of
 	 * DCB as by default everything is mapped to TC0.
 	 */
-	tx_ctx.rdylist = le16_to_cpu(vsi->info.qs_handle[ring->dcb_tc]);
+
+	if (ring->ch) {
+		tx_ctx.rdylist =
+			le16_to_cpu(ring->ch->info.qs_handle[ring->dcb_tc]);
+
+		dev_dbg(&vsi->back->pdev->dev, "ch, pf_q %d, rdylist %d\n",
+			pf_q, tx_ctx.rdylist);
+	} else {
+		tx_ctx.rdylist = le16_to_cpu(vsi->info.qs_handle[ring->dcb_tc]);
+	}
+
 	tx_ctx.rdylist_act = 0;
 
 	/* clear the context in the HMC */
@@ -2959,12 +2969,25 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	}
 
 	/* Now associate this queue with this PCI function */
-	if (vsi->type == I40E_VSI_VMDQ2) {
-		qtx_ctl = I40E_QTX_CTL_VM_QUEUE;
-		qtx_ctl |= ((vsi->id) << I40E_QTX_CTL_VFVM_INDX_SHIFT) &
-			   I40E_QTX_CTL_VFVM_INDX_MASK;
+	if (ring->ch) {
+		if (ring->ch->type == I40E_VSI_VMDQ2)
+			qtx_ctl = I40E_QTX_CTL_VM_QUEUE;
+		else if (ring->ch->type == I40E_VSI_SRIOV)
+			qtx_ctl = I40E_QTX_CTL_VF_QUEUE;
+
+		qtx_ctl |= (ring->ch->vsi_number <<
+			    I40E_QTX_CTL_VFVM_INDX_SHIFT) &
+			    I40E_QTX_CTL_VFVM_INDX_MASK;
+		dev_dbg(&vsi->back->pdev->dev, "ch, pf_q %d, qtx_ctl 0x%x\n",
+			pf_q, qtx_ctl);
 	} else {
-		qtx_ctl = I40E_QTX_CTL_PF_QUEUE;
+		if (vsi->type == I40E_VSI_VMDQ2) {
+			qtx_ctl = I40E_QTX_CTL_VM_QUEUE;
+			qtx_ctl |= ((vsi->id) << I40E_QTX_CTL_VFVM_INDX_SHIFT) &
+				    I40E_QTX_CTL_VFVM_INDX_MASK;
+		} else {
+			qtx_ctl = I40E_QTX_CTL_PF_QUEUE;
+		}
 	}
 
 	qtx_ctl |= ((hw->pf_id << I40E_QTX_CTL_PF_INDX_SHIFT) &
@@ -5060,6 +5083,699 @@ static int i40e_vsi_config_tc(struct i40e_vsi *vsi, u8 enabled_tc)
 }
 
 /**
+ * i40e_remove_queue_channel - Remove queue channel for the TC
+ * @vsi: VSI to be configured
+ *
+ * Remove queue channel for the TC
+ **/
+static void i40e_remove_queue_channel(struct i40e_vsi *vsi)
+{
+	struct i40e_channel *ch, *ch_tmp;
+	int ret, i;
+
+	/* Reset rss size that was stored when reconfiguring rss for
+	 * channel VSIs with non-power-of-2 queue count.
+	 */
+	vsi->current_rss_size = 0;
+
+	/* perform cleanup for channels if they exist */
+	if (list_empty(&vsi->ch_list))
+		return;
+
+	list_for_each_entry_safe(ch, ch_tmp, &vsi->ch_list, list) {
+		struct i40e_vsi *p_vsi;
+
+		list_del(&ch->list);
+		p_vsi = ch->parent_vsi;
+		if (!p_vsi || !ch->initialized) {
+			kfree(ch);
+			continue;
+		}
+		/* Reset queue contexts */
+		for (i = 0; i < ch->num_queue_pairs; i++) {
+			struct i40e_ring *tx_ring, *rx_ring;
+			u16 pf_q;
+
+			pf_q = ch->base_queue + i;
+			tx_ring = vsi->tx_rings[pf_q];
+			tx_ring->ch = NULL;
+
+			rx_ring = vsi->rx_rings[pf_q];
+			rx_ring->ch = NULL;
+		}
+
+		/* delete VSI from FW */
+		ret = i40e_aq_delete_element(&vsi->back->hw, ch->seid,
+					     NULL);
+		if (ret)
+			dev_err(&vsi->back->pdev->dev,
+				"unable to remove channel (%d) for parent VSI(%d)\n",
+				ch->seid, p_vsi->seid);
+		kfree(ch);
+	}
+}
+
+/**
+ * i40e_is_any_channel - channel exist or not
+ * @vsi: ptr to VSI to which channels are associated with
+ *
+ * Returns true or false if channel(s) exist for associated VSI or not
+ **/
+static bool i40e_is_any_channel(struct i40e_vsi *vsi)
+{
+	struct i40e_channel *ch, *ch_tmp;
+
+	list_for_each_entry_safe(ch, ch_tmp, &vsi->ch_list, list) {
+		if (ch->initialized)
+			return true;
+	}
+
+	return false;
+}
+
+/**
+ * i40e_get_max_queues_for_channel
+ * @vsi: ptr to VSI to which channels are associated with
+ *
+ * Helper function which returns max_queues count ever used for any of the
+ * channel which are parent of specified VSI
+ **/
+static int i40e_get_max_queues_for_channel(struct i40e_vsi *vsi)
+{
+	struct i40e_channel *ch, *ch_tmp;
+	int max = 0;
+
+	list_for_each_entry_safe(ch, ch_tmp, &vsi->ch_list, list) {
+		if (!ch->initialized)
+			continue;
+		if (ch->allowed_queue_pairs > max)
+			max = ch->allowed_queue_pairs;
+	}
+
+	return max;
+}
+
+/**
+ * i40e_validate_num_queues - validate num_queues w.r.t channel
+ * @pf: ptr to PF device
+ * @num_queues: number of queues
+ * @vsi: the parent VSI
+ * @reconfig_rss: indicates should the RSS be reconfigured or not
+ *
+ * This function validates number of queues in the context of new channel
+ * which is being established and determines if RSS should be reconfigured
+ * or not for parent VSI.
+ **/
+static int i40e_validate_num_queues(struct i40e_pf *pf, int num_queues,
+				    struct i40e_vsi *vsi, bool *reconfig_rss)
+{
+	int max_ch_queues;
+
+	if (!reconfig_rss)
+		return -EINVAL;
+
+	*reconfig_rss = false;
+
+	if (num_queues > I40E_MAX_QUEUES_PER_CH) {
+		dev_info(&pf->pdev->dev,
+			 "Failed to create VMDq VSI. User requested num_queues (%d) > I40E_MAX_QUEUES_PER_VSI (%u)\n",
+			 num_queues, I40E_MAX_QUEUES_PER_CH);
+		return -EINVAL;
+	}
+
+	if (vsi->current_rss_size) {
+		if (num_queues > vsi->current_rss_size) {
+			dev_info(&pf->pdev->dev,
+				 "Error: num_queues (%d) > vsi's current_size(%d)\n",
+				 num_queues, vsi->current_rss_size);
+			return -EINVAL;
+		} else if ((num_queues < vsi->current_rss_size) &&
+			   (!is_power_of_2(num_queues))) {
+			dev_info(&pf->pdev->dev,
+				 "Error: num_queues (%d) < vsi's current_size(%d), but not power of 2\n",
+				 num_queues, vsi->current_rss_size);
+			return -EINVAL;
+		}
+	}
+
+	if (!is_power_of_2(num_queues)) {
+		/* Find the max num_queues configures for channel if channel
+		 * exist.
+		 * if channel exist, then enforce 'num_queues' to be more than
+		 * max ever num_queues configured for channel.
+		 */
+		max_ch_queues = i40e_get_max_queues_for_channel(vsi);
+		if (num_queues < max_ch_queues) {
+			dev_info(&pf->pdev->dev,
+				 "Error: num_queues (%d) > main_vsi's current_size(%d)\n",
+				 num_queues, vsi->current_rss_size);
+			return -EINVAL;
+		}
+		*reconfig_rss = true;
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_vsi_reconfig_rss - reconfig RSS based on specified rss_size
+ * @vsi: the VSI being setup
+ * @rss_size: size of RSS, accordingly LUT gets reprogrammed
+ *
+ * This function reconfigures RSS by reprogramming LUTs using 'rss_size'
+ **/
+static int i40e_vsi_reconfig_rss(struct i40e_vsi *vsi, u16 rss_size)
+{
+	struct i40e_pf *pf = vsi->back;
+	u8 seed[I40E_HKEY_ARRAY_SIZE];
+	struct i40e_hw *hw = &pf->hw;
+	int local_rss_size;
+	u8 *lut;
+	int ret;
+
+	if (!vsi->rss_size)
+		return -EINVAL;
+
+	if (rss_size > vsi->rss_size)
+		return -EINVAL;
+
+	local_rss_size = min_t(int, vsi->rss_size, rss_size);
+	lut = kzalloc(vsi->rss_table_size, GFP_KERNEL);
+	if (!lut)
+		return -ENOMEM;
+
+	/* Ignoring user configured lut if there is one */
+	i40e_fill_rss_lut(pf, lut, vsi->rss_table_size, local_rss_size);
+
+	/* Use user configured hash key if there is one, otherwise
+	 * use default.
+	 */
+	if (vsi->rss_hkey_user)
+		memcpy(seed, vsi->rss_hkey_user, I40E_HKEY_ARRAY_SIZE);
+	else
+		netdev_rss_key_fill((void *)seed, I40E_HKEY_ARRAY_SIZE);
+
+	ret = i40e_config_rss(vsi, seed, lut, vsi->rss_table_size);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "Cannot set RSS lut, err %s aq_err %s\n",
+			 i40e_stat_str(hw, ret),
+			 i40e_aq_str(hw, hw->aq.asq_last_status));
+		kfree(lut);
+		return ret;
+	}
+	kfree(lut);
+
+	/* Do the update w.r.t. storing rss_size */
+	if (!vsi->orig_rss_size)
+		vsi->orig_rss_size = vsi->rss_size;
+	vsi->current_rss_size = local_rss_size;
+
+	return ret;
+}
+
+/**
+ * i40e_channel_setup_queue_map - Setup a channel queue map based on enabled_tc
+ * @pf: ptr to PF device
+ * @vsi: the VSI being setup
+ * @ctxt: VSI context structure
+ * @enabled_tc: Enabled TCs bitmap
+ * @ch: ptr to channel structure
+ *
+ * Setup queue map based on enabled_tc for specific channel
+ **/
+static void i40e_channel_setup_queue_map(struct i40e_pf *pf,
+					 struct i40e_vsi_context *ctxt,
+					 u8 enabled_tc, struct i40e_channel *ch)
+{
+	u16 qcount, num_tc_qps, qmap;
+	int pow, num_qps;
+	u16 sections = 0;
+	/* At least TC0 is enabled in case of non-DCB case, non-MQPRIO */
+	u16 numtc = 1;
+	u8 offset = 0;
+
+	sections = I40E_AQ_VSI_PROP_QUEUE_MAP_VALID;
+	sections |= I40E_AQ_VSI_PROP_SCHED_VALID;
+
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		qcount = min_t(int, ch->num_queue_pairs,
+			       pf->num_lan_msix);
+		ch->allowed_queue_pairs = qcount;
+	} else {
+		qcount = 1;
+	}
+
+	/* find num of qps per traffic class */
+	num_tc_qps = qcount / numtc;
+	num_tc_qps = min_t(int, num_tc_qps, i40e_pf_get_max_q_per_tc(pf));
+	num_qps = qcount;
+
+	/* find the next higher power-of-2 of num queue pairs */
+	pow = ilog2(num_qps);
+	if (!is_power_of_2(num_qps))
+		pow++;
+
+	qmap = (offset << I40E_AQ_VSI_TC_QUE_OFFSET_SHIFT) |
+		(pow << I40E_AQ_VSI_TC_QUE_NUMBER_SHIFT);
+
+	/* Setup queue TC[0].qmap for given VSI context */
+	ctxt->info.tc_mapping[0] = cpu_to_le16(qmap);
+
+	ctxt->info.up_enable_bits = enabled_tc;
+	ctxt->info.mapping_flags |= cpu_to_le16(I40E_AQ_VSI_QUE_MAP_CONTIG);
+	ctxt->info.queue_mapping[0] = cpu_to_le16(ch->base_queue);
+	ctxt->info.valid_sections |= cpu_to_le16(sections);
+}
+
+/**
+ * i40e_add_channel - add a channel by adding VSI
+ * @pf: ptr to PF device
+ * @uplink_seid: underlying HW switching element (VEB) ID
+ * @ch: ptr to channel structure
+ *
+ * Add a channel (VSI) using add_vsi and queue_map
+ **/
+static int i40e_add_channel(struct i40e_pf *pf, u16 uplink_seid,
+			    struct i40e_channel *ch)
+{
+	struct i40e_hw *hw = &pf->hw;
+	struct i40e_vsi_context ctxt;
+	u8 enabled_tc = 0x1; /* TC0 enabled */
+	int ret;
+
+	if (!(ch->type == I40E_VSI_SRIOV || ch->type == I40E_VSI_VMDQ2)) {
+		dev_info(&pf->pdev->dev,
+			 "add new vsi failed, ch->type %d\n", ch->type);
+		return -EINVAL;
+	}
+
+	memset(&ctxt, 0, sizeof(ctxt));
+	ctxt.pf_num = hw->pf_id;
+	ctxt.vf_num = 0;
+	ctxt.uplink_seid = uplink_seid;
+	ctxt.connection_type = I40E_AQ_VSI_CONN_TYPE_NORMAL;
+	if (ch->type == I40E_VSI_SRIOV)
+		ctxt.flags = I40E_AQ_VSI_TYPE_VF;
+	else if (ch->type == I40E_VSI_VMDQ2)
+		ctxt.flags = I40E_AQ_VSI_TYPE_VMDQ2;
+
+	if (pf->flags & I40E_FLAG_VEB_MODE_ENABLED) {
+		ctxt.info.valid_sections |=
+		     cpu_to_le16(I40E_AQ_VSI_PROP_SWITCH_VALID);
+		ctxt.info.switch_id =
+		   cpu_to_le16(I40E_AQ_VSI_SW_ID_FLAG_ALLOW_LB);
+	}
+
+	/* Set queue map for a given VSI context */
+	i40e_channel_setup_queue_map(pf, &ctxt, enabled_tc, ch);
+
+	/* Now time to create VSI */
+	ret = i40e_aq_add_vsi(hw, &ctxt, NULL);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "add new vsi failed, err %s aq_err %s\n",
+			 i40e_stat_str(&pf->hw, ret),
+			 i40e_aq_str(&pf->hw,
+				     pf->hw.aq.asq_last_status));
+		return -ENOENT;
+	}
+
+	/* Success, update channel */
+	ch->enabled_tc = enabled_tc;
+	ch->seid = ctxt.seid;
+	ch->vsi_number = ctxt.vsi_number;
+	ch->stat_counter_idx = cpu_to_le16(ctxt.info.stat_counter_idx);
+
+	/* copy just the sections touched not the entire info
+	 * since not all sections are valid as returned by
+	 * update vsi params
+	 */
+	ch->info.mapping_flags = ctxt.info.mapping_flags;
+	memcpy(&ch->info.queue_mapping,
+	       &ctxt.info.queue_mapping, sizeof(ctxt.info.queue_mapping));
+	memcpy(&ch->info.tc_mapping, ctxt.info.tc_mapping,
+	       sizeof(ctxt.info.tc_mapping));
+
+	/* Now it's time to update 'num_queue_pairs' if it is more than
+	 * 'allowed_queue_pairs' because RX queue map is setup based on
+	 * value of 'allowed_queue_pairs' (min of num_queue_pairs,
+	 * num_lan_msix). This update is needed so that TX rings are setup
+	 * correctly.
+	 */
+	if (ch->num_queue_pairs > ch->allowed_queue_pairs)
+		ch->num_queue_pairs = ch->allowed_queue_pairs;
+
+	return 0;
+}
+
+static int i40e_channel_config_bw(struct i40e_vsi *vsi, struct i40e_channel *ch,
+				  u8 *bw_share)
+{
+	struct i40e_aqc_configure_vsi_tc_bw_data bw_data;
+	i40e_status ret;
+	int i;
+
+	bw_data.tc_valid_bits = ch->enabled_tc;
+	for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++)
+		bw_data.tc_bw_credits[i] = bw_share[i];
+
+	ret = i40e_aq_config_vsi_tc_bw(&vsi->back->hw, ch->seid,
+				       &bw_data, NULL);
+	if (ret) {
+		dev_info(&vsi->back->pdev->dev,
+			 "Config VSI BW allocation per TC failed, aq_err: %d for new_vsi->seid %u\n",
+			 vsi->back->hw.aq.asq_last_status, ch->seid);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++)
+		ch->info.qs_handle[i] = bw_data.qs_handles[i];
+
+	return 0;
+}
+
+/**
+ * i40e_channel_config_tx_ring - config TX ring associated with new channel
+ * @pf: ptr to PF device
+ * @vsi: the VSI being setup
+ * @ch: ptr to channel structure
+ *
+ * Configure TX rings associated with channel (VSI) since queues are being
+ * from parent VSI.
+ **/
+static int i40e_channel_config_tx_ring(struct i40e_pf *pf,
+				       struct i40e_vsi *vsi,
+				       struct i40e_channel *ch)
+{
+	i40e_status ret;
+	int i;
+	u8 bw_share[I40E_MAX_TRAFFIC_CLASS] = {0};
+
+	/* Enable ETS TCs with equal BW Share for now across all VSIs */
+	for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++) {
+		if (ch->enabled_tc & BIT(i))
+			bw_share[i] = 1;
+	}
+
+	/* configure BW for new VSi */
+	ret = i40e_channel_config_bw(vsi, ch, bw_share);
+	if (ret) {
+		dev_info(&vsi->back->pdev->dev,
+			 "Failed configuring TC map %d for channel (seid %u)\n",
+			 ch->enabled_tc, ch->seid);
+		return ret;
+	}
+
+	for (i = 0; i < ch->num_queue_pairs; i++) {
+		struct i40e_ring *tx_ring, *rx_ring;
+		u16 pf_q;
+
+		pf_q = ch->base_queue + i;
+
+		/* Get to TX ring ptr of main VSI, for re-setup TX queue
+		 * context
+		 */
+		tx_ring = vsi->tx_rings[pf_q];
+		tx_ring->ch = ch;
+
+		/* Get the RX ring ptr */
+		rx_ring = vsi->rx_rings[pf_q];
+		rx_ring->ch = ch;
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_setup_hw_channel - setup new channel
+ * @pf: ptr to PF device
+ * @vsi: the VSI being setup
+ * @ch: ptr to channel structure
+ * @uplink_seid: underlying HW switching element (VEB) ID
+ * @type: type of channel to be created (VMDq2/VF)
+ *
+ * Setup new channel (VSI) based on specified type (VMDq2/VF)
+ * and configures TX rings accordingly
+ **/
+static inline int i40e_setup_hw_channel(struct i40e_pf *pf,
+					struct i40e_vsi *vsi,
+					struct i40e_channel *ch,
+					u16 uplink_seid, u8 type)
+{
+	struct i40e_q_vector *q_vector;
+	int base_queue = 0;
+	int ret, i;
+
+	ch->initialized = false;
+	ch->base_queue = atomic_read(&vsi->next_base_queue);
+	ch->type = type;
+
+	/* Proceed with creation of channel (VMDq2/VF) VSI */
+	ret = i40e_add_channel(pf, uplink_seid, ch);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "failed to add_channel using uplink_seid %u\n",
+			 uplink_seid);
+		return ret;
+	}
+
+	/* Mark the successful creation of channel */
+	ch->initialized = true;
+
+	/* Mark q_vectors indicating that they are part of newly created
+	 * channel (VSI)
+	 */
+	base_queue = ch->base_queue;
+	for (i = 0; i < ch->num_queue_pairs; i++) {
+		q_vector = vsi->tx_rings[base_queue + i]->q_vector;
+
+		if (!q_vector)
+			continue;
+
+		q_vector->is_an_atq = true;
+	}
+
+	/* Reconfigure TX queues using QTX_CTL register */
+	ret = i40e_channel_config_tx_ring(pf, vsi, ch);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "failed to configure TX rings for channel %u\n",
+			 ch->seid);
+		return ret;
+	}
+
+	/* update 'next_base_queue' */
+	atomic_set(&vsi->next_base_queue,
+		   atomic_read(&vsi->next_base_queue) + ch->num_queue_pairs);
+
+	dev_dbg(&pf->pdev->dev,
+		"Added channel: vsi_seid %u, vsi_number %u, stat_counter_idx %u, num_queue_pairs %u, pf->next_base_queue %d\n",
+		ch->seid, ch->vsi_number, ch->stat_counter_idx,
+		ch->num_queue_pairs,
+		atomic_read(&vsi->next_base_queue));
+
+	return ret;
+}
+
+/**
+ * i40e_setup_channel - setup new channel using uplink element
+ * @pf: ptr to PF device
+ * @type: type of channel to be created (VMDq2/VF)
+ * @uplink_seid: underlying HW switching element (VEB) ID
+ * @ch: ptr to channel structure
+ *
+ * Setup new channel (VSI) based on specified type (VMDq2/VF)
+ * and uplink switching element (uplink_seid)
+ **/
+static bool i40e_setup_channel(struct i40e_pf *pf, struct i40e_vsi *vsi,
+			       struct i40e_channel *ch)
+{
+	u16 seid = vsi->seid;
+	u8 vsi_type;
+	int ret;
+
+	if (vsi->type == I40E_VSI_MAIN) {
+		vsi_type = I40E_VSI_VMDQ2;
+	} else if (vsi->type == I40E_VSI_SRIOV) {
+		vsi_type = I40E_VSI_SRIOV;
+	} else {
+		dev_err(&pf->pdev->dev, "unsupported vsi type(%d) of parent vsi\n",
+			vsi->type);
+		return false;
+	}
+
+	/* underlying switching element */
+	seid = pf->vsi[pf->lan_vsi]->uplink_seid;
+
+	/* create channel (VSI), configure TX rings */
+	ret = i40e_setup_hw_channel(pf, vsi, ch, seid, vsi_type);
+	if (ret) {
+		dev_err(&pf->pdev->dev, "failed to setup hw_channel\n");
+		return false;
+	}
+
+	return ch->initialized ? true : false;
+}
+
+/**
+ * i40e_create_queue_channel - function to create channel
+ * @vsi: VSI to be configured
+ * @ch: ptr to channel (it contains channel specific params)
+ *
+ * This function creates channel (VSI) using num_queues specified by user,
+ * reconfigs RSS if needed.
+ **/
+int i40e_create_queue_channel(struct i40e_vsi *vsi,
+			      struct i40e_channel *ch)
+{
+	struct i40e_pf *pf = vsi->back;
+	bool reconfig_rss;
+	int err;
+
+	if (!ch)
+		return -EINVAL;
+
+	if (!ch->num_queue_pairs) {
+		dev_err(&pf->pdev->dev, "Invalid num_queues requested: %d\n",
+			ch->num_queue_pairs);
+		return -EINVAL;
+	}
+
+	/* validate user requested num_queues for channel */
+	err = i40e_validate_num_queues(pf, ch->num_queue_pairs, vsi,
+				       &reconfig_rss);
+	if (err) {
+		dev_info(&pf->pdev->dev, "Failed to validate num_queues (%d)\n",
+			 ch->num_queue_pairs);
+		return -EINVAL;
+	}
+
+	/* By default we are in VEPA mode, if this is the first VF/VMDq
+	 * VSI to be added switch to VEB mode.
+	 */
+	if ((!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) ||
+	    (!i40e_is_any_channel(vsi))) {
+		if (!is_power_of_2(vsi->tc_config.tc_info[0].qcount)) {
+			dev_info(&pf->pdev->dev,
+				 "Failed to create channel. Override queues (%u) not power of 2\n",
+				 vsi->tc_config.tc_info[0].qcount);
+			return -EINVAL;
+		}
+
+		if (vsi->type == I40E_VSI_SRIOV) {
+			if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) {
+				dev_info(&pf->pdev->dev,
+					 "Expected to be VEB mode by this time\n");
+				return -EINVAL;
+			}
+		}
+		if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) {
+			pf->flags |= I40E_FLAG_VEB_MODE_ENABLED;
+
+			if (vsi->type == I40E_VSI_MAIN) {
+				if (pf->flags & I40E_FLAG_TC_MQPRIO)
+					i40e_do_reset(pf,
+					BIT_ULL(__I40E_PF_RESET_REQUESTED),
+						      true);
+				else
+					i40e_do_reset_safe(pf,
+					BIT_ULL(__I40E_PF_RESET_REQUESTED));
+			}
+		}
+		/* now onwards for main VSI, number of queues will be value
+		 * of TC0's queue count
+		 */
+	}
+
+	/* By this time, vsi->cnt_q_avail shall be set to non-zero and
+	 * it should be more than num_queues
+	 */
+	if (!vsi->cnt_q_avail || (vsi->cnt_q_avail < ch->num_queue_pairs)) {
+		dev_info(&pf->pdev->dev,
+			 "Error: cnt_q_avail (%u) less than num_queues %d\n",
+			 vsi->cnt_q_avail, ch->num_queue_pairs);
+		return -EINVAL;
+	}
+
+	/* reconfig_rss only if vsi type is MAIN_VSI, in case of SR_IOV VF VSI,
+	 * reconfig_rss to be handled by VF driver, if 'reconfig_rss' flag
+	 * is set to true
+	 */
+	if (reconfig_rss && (vsi->type == I40E_VSI_MAIN)) {
+		err = i40e_vsi_reconfig_rss(vsi, ch->num_queue_pairs);
+		if (err) {
+			dev_info(&pf->pdev->dev,
+				 "Error: unable to reconfig rss for num_queues (%u)\n",
+				 ch->num_queue_pairs);
+			return -EINVAL;
+		}
+	}
+
+	if (!i40e_setup_channel(pf, vsi, ch)) {
+		dev_info(&pf->pdev->dev, "Failed to setup channel\n");
+		return -EINVAL;
+	}
+
+	dev_info(&pf->pdev->dev,
+		 "Setup channel (id:%u) utilizing num_queues %d\n",
+		 ch->seid, ch->num_queue_pairs);
+
+	/* in case of VF, this will be main SRIOV VSI */
+	ch->parent_vsi = vsi;
+
+	/* and update main_vsi's count for queue_available to use */
+	vsi->cnt_q_avail -= ch->num_queue_pairs;
+
+	return 0;
+}
+
+/**
+ * i40e_configure_queue_channel - Add queue channel for the given TCs
+ * @vsi: VSI to be configured
+ *
+ * Configures queue channel mapping to the given TCs
+ **/
+static int i40e_configure_queue_channel(struct i40e_vsi *vsi)
+{
+	struct i40e_channel *ch;
+	int ret = 0, i;
+
+	INIT_LIST_HEAD(&vsi->ch_list);
+
+	/* Create app vsi with the TCs. Main VSI with TC0 is already set up */
+	for (i = 1; i < I40E_MAX_TRAFFIC_CLASS; i++)
+		if (vsi->tc_config.enabled_tc & BIT(i)) {
+			ch = kzalloc(sizeof(*ch), GFP_KERNEL);
+			if (!ch) {
+				ret = -ENOMEM;
+				goto err_free;
+			}
+
+			INIT_LIST_HEAD(&ch->list);
+			ch->num_queue_pairs =
+				vsi->tc_config.tc_info[i].qcount;
+			ch->base_queue =
+				vsi->tc_config.tc_info[i].qoffset;
+
+			list_add_tail(&ch->list, &vsi->ch_list);
+
+			ret = i40e_create_queue_channel(vsi, ch);
+			if (ret) {
+				dev_err(&vsi->back->pdev->dev,
+					"Failed creating queue channel with TC%d: queues %d\n",
+					i, ch->num_queue_pairs);
+				goto err_free;
+			}
+		}
+	return ret;
+
+err_free:
+	i40e_remove_queue_channel(vsi);
+	return ret;
+}
+
+/**
  * i40e_veb_config_tc - Configure TCs for given VEB
  * @veb: given VEB
  * @enabled_tc: TC bitmap
@@ -5502,10 +6218,20 @@ static int i40e_setup_tc(struct net_device *netdev, u8 tc)
 		goto exit;
 	}
 
+	if (pf->flags & I40E_FLAG_TC_MQPRIO) {
+		ret = i40e_configure_queue_channel(vsi);
+		if (ret) {
+			netdev_info(netdev,
+				    "Failed configuring queue channels\n");
+			goto exit;
+		}
+	}
+
 	/* Unquiesce VSI */
 	i40e_unquiesce_vsi(vsi);
 
 exit:
+	i40e_unquiesce_vsi(vsi);
 	return ret;
 }
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index f5de511..02e1e84 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -420,6 +420,8 @@ struct i40e_ring {
 					 * i40e_clean_rx_ring_irq() is called
 					 * for this ring.
 					 */
+
+	struct i40e_channel *ch;
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)


^ permalink raw reply related

* [Intel-wired-lan] [PATCH 3/4] [next-queue]net: i40e: Enable mqprio full offload mode in the i40e driver for configuring TCs and queue mapping
From: Amritha Nambiar @ 2017-05-20  0:58 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149524122523.11022.4541073724650541658.stgit@anamdev.jf.intel.com>

The i40e driver is modified to enable the new mqprio hardware
offload mode and factor the TCs and queue configuration by
creating channel VSIs. In this mode, the priority to traffic
class mapping and the user specified queue ranges are used
to configure the traffic classes when the 'hw' option is set
to 2.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 4\
  map 0 0 0 0 1 2 2 3 queues 2 at 0 2 at 2 1 at 4 1 at 5 hw 2

# tc qdisc show dev eth0
qdisc mqprio 8038: root  tc 4 map 0 0 0 0 1 2 2 3 0 0 0 0 0 0 0 0
             queues:(0:1) (2:3) (4:4) (5:5)

The HW channels created are removed and all the queue configuration
is set to default when the qdisc is detached from the root of the
device.

#tc qdisc del dev eth0 root

This patch also disables setting up channels via ethtool (ethtool -L)
when the TCs are confgured using mqprio scheduler.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h         |    4 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    6 
 drivers/net/ethernet/intel/i40e/i40e_main.c    |  311 ++++++++++++++++++++++--
 3 files changed, 292 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 0915b02..a62f65a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -54,6 +54,8 @@
 #include <linux/clocksource.h>
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
+#include <net/pkt_cls.h>
+
 #include "i40e_type.h"
 #include "i40e_prototype.h"
 #include "i40e_client.h"
@@ -685,6 +687,7 @@ struct i40e_vsi {
 	enum i40e_vsi_type type;  /* VSI type, e.g., LAN, FCoE, etc */
 	s16 vf_id;		/* Virtual function ID for SRIOV VSIs */
 
+	struct tc_mqprio_qopt_offload mqprio_qopt; /* queue parameters */
 	struct i40e_tc_configuration tc_config;
 	struct i40e_aqc_vsi_properties_data info;
 
@@ -710,6 +713,7 @@ struct i40e_vsi {
 	u16 cnt_q_avail; /* num of queues available for channel usage */
 	u16 orig_rss_size;
 	u16 current_rss_size;
+	bool reconfig_rss;
 
 	/* keeps track of next_base_queue to be used for channel setup */
 	atomic_t next_base_queue;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 3d58762..ab52979 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -3841,6 +3841,12 @@ static int i40e_set_channels(struct net_device *dev,
 	if (vsi->type != I40E_VSI_MAIN)
 		return -EINVAL;
 
+	/* We do not support setting channels via ethtool when TCs are
+	 * configured through mqprio
+	 */
+	if (pf->flags & I40E_FLAG_TC_MQPRIO)
+		return -EINVAL;
+
 	/* verify they are not requesting separate vectors */
 	if (!count || ch->rx_count || ch->tx_count)
 		return -EINVAL;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e1bea45..7f61d4f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -68,6 +68,7 @@ static int i40e_reset(struct i40e_pf *pf);
 static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired);
 static void i40e_fdir_sb_setup(struct i40e_pf *pf);
 static int i40e_veb_get_bw_info(struct i40e_veb *veb);
+static int i40e_vsi_config_rss(struct i40e_vsi *vsi);
 
 /* i40e_pci_tbl - PCI Device ID Table
  *
@@ -1560,6 +1561,105 @@ static int i40e_set_mac(struct net_device *netdev, void *p)
 }
 
 /**
+ * i40e_vsi_setup_queue_map_mqprio - Prepares VSI tc_config to have queue
+ * configurations based on MQPRIO options.
+ * @vsi: the VSI being configured,
+ * @ctxt: VSI context structure
+ * @enabled_tc: number of traffic classes to enable
+ **/
+static int i40e_vsi_setup_queue_map_mqprio(struct i40e_vsi *vsi,
+					   struct i40e_vsi_context *ctxt,
+					   u8 enabled_tc)
+{
+	u8 netdev_tc = 0, offset = 0;
+	u16 qcount = 0, max_qcount, qmap, sections = 0;
+	int i, override_q, pow, num_qps, ret;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	sections = I40E_AQ_VSI_PROP_QUEUE_MAP_VALID;
+	sections |= I40E_AQ_VSI_PROP_SCHED_VALID;
+
+	vsi->tc_config.numtc = vsi->mqprio_qopt.qopt.num_tc;
+	vsi->tc_config.enabled_tc = enabled_tc ? enabled_tc : 1;
+
+	num_qps = vsi->mqprio_qopt.qopt.count[0];
+
+	/* find the next higher power-of-2 of num queue pairs */
+	pow = ilog2(num_qps);
+	if (!is_power_of_2(num_qps))
+		pow++;
+
+	qmap = (offset << I40E_AQ_VSI_TC_QUE_OFFSET_SHIFT) |
+		(pow << I40E_AQ_VSI_TC_QUE_NUMBER_SHIFT);
+
+	/* Setup queue offset/count for all TCs for given VSI */
+	max_qcount = vsi->mqprio_qopt.qopt.count[0];
+
+	for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++) {
+		/* See if the given TC is enabled for the given VSI */
+		if (vsi->tc_config.enabled_tc & BIT(i)) {
+			offset = vsi->mqprio_qopt.qopt.offset[i];
+			qcount = vsi->mqprio_qopt.qopt.count[i];
+
+			if (qcount > max_qcount)
+				max_qcount = qcount;
+
+			vsi->tc_config.tc_info[i].qoffset = offset;
+			vsi->tc_config.tc_info[i].qcount = qcount;
+			vsi->tc_config.tc_info[i].netdev_tc = netdev_tc++;
+
+		} else {
+			/* TC is not enabled so set the offset to
+			 * default queue and allocate one queue
+			 * for the given TC.
+			 */
+			vsi->tc_config.tc_info[i].qoffset = 0;
+			vsi->tc_config.tc_info[i].qcount = 1;
+			vsi->tc_config.tc_info[i].netdev_tc = 0;
+		}
+	}
+
+	/* Set actual Tx/Rx queue pairs */
+	vsi->num_queue_pairs = offset + qcount;
+
+	/* Setup queue TC[0].qmap for given VSI context */
+	ctxt->info.tc_mapping[0] = cpu_to_le16(qmap);
+
+	ctxt->info.mapping_flags |=
+					cpu_to_le16(I40E_AQ_VSI_QUE_MAP_CONTIG);
+
+	ctxt->info.queue_mapping[0] = cpu_to_le16(vsi->base_queue);
+	ctxt->info.valid_sections |= cpu_to_le16(sections);
+
+	/* Reconfigure RSS for main VSI with max queue count */
+	vsi->rss_size = max_qcount;
+
+	ret = i40e_vsi_config_rss(vsi);
+	if (ret) {
+		dev_info(&vsi->back->pdev->dev,
+			 "Failed to reconfig rss for num_queues (%u)\n",
+			 max_qcount);
+		return ret;
+	}
+	vsi->reconfig_rss = true;
+	dev_dbg(&vsi->back->pdev->dev,
+		"Reconfigured rss with num_queues (%u)\n", max_qcount);
+
+	/* Find queue count available for channel VSIs and starting offset
+	 * for channel VSIs
+	 */
+	override_q = vsi->mqprio_qopt.qopt.count[0];
+	if (override_q && (override_q < vsi->num_queue_pairs)) {
+		vsi->cnt_q_avail = vsi->num_queue_pairs - override_q;
+		atomic_set(&vsi->next_base_queue, override_q);
+	}
+
+	return 0;
+}
+
+/**
  * i40e_vsi_setup_queue_map - Setup a VSI queue map based on enabled_tc
  * @vsi: the VSI being setup
  * @ctxt: VSI context structure
@@ -1597,7 +1697,7 @@ static void i40e_vsi_setup_queue_map(struct i40e_vsi *vsi,
 			numtc = 1;
 		}
 	} else {
-		/* At least TC0 is enabled in case of non-DCB case */
+		/* At least TC0 is enabled in non-DCB, non-MQPRIO case */
 		numtc = 1;
 	}
 
@@ -3150,6 +3250,7 @@ static void i40e_vsi_config_dcb_rings(struct i40e_vsi *vsi)
 			rx_ring->dcb_tc = 0;
 			tx_ring->dcb_tc = 0;
 		}
+		return;
 	}
 
 	for (n = 0; n < I40E_MAX_TRAFFIC_CLASS; n++) {
@@ -4777,6 +4878,25 @@ static u8 i40e_dcb_get_enabled_tc(struct i40e_dcbx_config *dcbcfg)
 }
 
 /**
+ * i40e_mqprio_get_enabled_tc - Get enabled traffic classes
+ * @pf: PF being queried
+ *
+ * Query the current MQPRIO configuration and return the number of
+ * traffic classes enabled.
+ **/
+static u8 i40e_mqprio_get_enabled_tc(struct i40e_pf *pf)
+{
+	struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
+	u8 num_tc = vsi->mqprio_qopt.qopt.num_tc;
+	u8 enabled_tc = 1, i;
+
+	for (i = 1; i < num_tc; i++)
+		enabled_tc |= BIT(i);
+
+	return enabled_tc;
+}
+
+/**
  * i40e_pf_get_num_tc - Get enabled traffic classes for PF
  * @pf: PF being queried
  *
@@ -4789,7 +4909,10 @@ static u8 i40e_pf_get_num_tc(struct i40e_pf *pf)
 	u8 num_tc = 0;
 	struct i40e_dcbx_config *dcbcfg = &hw->local_dcbx_config;
 
-	/* If DCB is not enabled then always in single TC */
+	if (pf->flags & I40E_FLAG_TC_MQPRIO)
+		return pf->vsi[pf->lan_vsi]->mqprio_qopt.qopt.num_tc;
+
+	/* If neither MQPRIO nor DCB is enabled, then always in single TC */
 	if (!(pf->flags & I40E_FLAG_DCB_ENABLED))
 		return 1;
 
@@ -4818,7 +4941,12 @@ static u8 i40e_pf_get_num_tc(struct i40e_pf *pf)
  **/
 static u8 i40e_pf_get_tc_map(struct i40e_pf *pf)
 {
-	/* If DCB is not enabled for this PF then just return default TC */
+	if (pf->flags & I40E_FLAG_TC_MQPRIO)
+		return i40e_mqprio_get_enabled_tc(pf);
+
+	/* If neither MQPRIO nor DCB is enabled for this PF then just return
+	 * default TC
+	 */
 	if (!(pf->flags & I40E_FLAG_DCB_ENABLED))
 		return I40E_DEFAULT_TRAFFIC_CLASS;
 
@@ -4912,6 +5040,10 @@ static int i40e_vsi_configure_bw_alloc(struct i40e_vsi *vsi, u8 enabled_tc,
 	for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++)
 		bw_data.tc_bw_credits[i] = bw_share[i];
 
+	if ((vsi->back->flags & I40E_FLAG_TC_MQPRIO) ||
+	    !vsi->mqprio_qopt.qopt.hw)
+		return 0;
+
 	ret = i40e_aq_config_vsi_tc_bw(&vsi->back->hw, vsi->seid, &bw_data,
 				       NULL);
 	if (ret) {
@@ -4970,6 +5102,9 @@ static void i40e_vsi_config_netdev_tc(struct i40e_vsi *vsi, u8 enabled_tc)
 					vsi->tc_config.tc_info[i].qoffset);
 	}
 
+	if (pf->flags & I40E_FLAG_TC_MQPRIO)
+		return;
+
 	/* Assign UP2TC map for the VSI */
 	for (i = 0; i < I40E_MAX_USER_PRIORITY; i++) {
 		/* Get the actual TC# for the UP */
@@ -5020,7 +5155,8 @@ static int i40e_vsi_config_tc(struct i40e_vsi *vsi, u8 enabled_tc)
 	int i;
 
 	/* Check if enabled_tc is same as existing or new TCs */
-	if (vsi->tc_config.enabled_tc == enabled_tc)
+	if (vsi->tc_config.enabled_tc == enabled_tc &&
+	    vsi->mqprio_qopt.qopt.hw != TC_MQPRIO_HW_OFFLOAD)
 		return ret;
 
 	/* Enable ETS TCs with equal BW Share for now across all VSIs */
@@ -5043,7 +5179,30 @@ static int i40e_vsi_config_tc(struct i40e_vsi *vsi, u8 enabled_tc)
 	ctxt.vf_num = 0;
 	ctxt.uplink_seid = vsi->uplink_seid;
 	ctxt.info = vsi->info;
-	i40e_vsi_setup_queue_map(vsi, &ctxt, enabled_tc, false);
+
+	if (vsi->back->flags & I40E_FLAG_TC_MQPRIO) {
+		ret = i40e_vsi_setup_queue_map_mqprio(vsi, &ctxt, enabled_tc);
+		if (ret)
+			goto out;
+
+	} else {
+		i40e_vsi_setup_queue_map(vsi, &ctxt, enabled_tc, false);
+	}
+
+	/* On destroying the qdisc, reset vsi->rss_size, as number of enabled
+	 * queues changed.
+	 */
+	if (!vsi->mqprio_qopt.qopt.hw && vsi->reconfig_rss) {
+		vsi->rss_size = min_t(int, vsi->back->alloc_rss_size,
+				      vsi->num_queue_pairs);
+		ret = i40e_vsi_config_rss(vsi);
+		if (ret) {
+			dev_info(&vsi->back->pdev->dev,
+				 "Failed to reconfig rss for num_queues\n");
+			return ret;
+		}
+		vsi->reconfig_rss = false;
+	}
 
 	if (vsi->back->flags & I40E_FLAG_IWARP_ENABLED) {
 		ctxt.info.valid_sections |=
@@ -5051,7 +5210,9 @@ static int i40e_vsi_config_tc(struct i40e_vsi *vsi, u8 enabled_tc)
 		ctxt.info.queueing_opt_flags |= I40E_AQ_VSI_QUE_OPT_TCP_ENA;
 	}
 
-	/* Update the VSI after updating the VSI queue-mapping information */
+	/* Update the VSI after updating the VSI queue-mapping
+	 * information
+	 */
 	ret = i40e_aq_update_vsi_params(&vsi->back->hw, &ctxt, NULL);
 	if (ret) {
 		dev_info(&vsi->back->pdev->dev,
@@ -6168,48 +6329,142 @@ void i40e_down(struct i40e_vsi *vsi)
 }
 
 /**
+ * i40e_validate_mqprio_queue_mapping - validate queue mapping info
+ * @vsi: the VSI being configured
+ * @mqprio_qopt: queue parametrs
+ **/
+int i40e_validate_mqprio_queue_mapping(struct i40e_vsi *vsi,
+				struct tc_mqprio_qopt_offload *mqprio_qopt)
+{
+	int i;
+
+	if ((mqprio_qopt->qopt.offset[0] != 0) ||
+	    (mqprio_qopt->qopt.num_tc < 1))
+		return -EINVAL;
+
+	for (i = 0; ; i++) {
+		if (!mqprio_qopt->qopt.count[i])
+			return -EINVAL;
+
+		if (mqprio_qopt->min_rate[i] || mqprio_qopt->max_rate[i])
+			return -EINVAL;
+
+		if (i >= mqprio_qopt->qopt.num_tc - 1)
+			break;
+
+		if (mqprio_qopt->qopt.offset[i + 1] !=
+		    (mqprio_qopt->qopt.offset[i] + mqprio_qopt->qopt.count[i]))
+			return -EINVAL;
+	}
+
+	if (vsi->num_queue_pairs <
+	    (mqprio_qopt->qopt.offset[i] + mqprio_qopt->qopt.count[i])) {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
  * i40e_setup_tc - configure multiple traffic classes
  * @netdev: net device to configure
- * @tc: number of traffic classes to enable
+ * @tc: pointer to struct tc_to_netdev
  **/
-static int i40e_setup_tc(struct net_device *netdev, u8 tc)
+static int i40e_setup_tc(struct net_device *netdev, struct tc_to_netdev *tc)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
 	struct i40e_pf *pf = vsi->back;
-	u8 enabled_tc = 0;
+	u8 enabled_tc = 0, num_tc = 0, hw = 0;
 	int ret = -EINVAL;
 	int i;
 
-	/* Check if DCB enabled to continue */
-	if (!(pf->flags & I40E_FLAG_DCB_ENABLED)) {
-		netdev_info(netdev, "DCB is not enabled for adapter\n");
-		goto exit;
+	if (tc->type == TC_SETUP_MQPRIO) {
+		hw = tc->mqprio->hw;
+		num_tc = tc->mqprio->num_tc;
+	} else if (tc->type == TC_SETUP_MQPRIO_EXT) {
+		hw = tc->mqprio_qopt->qopt.hw;
+		num_tc = tc->mqprio_qopt->qopt.num_tc;
 	}
 
-	/* Check if MFP enabled */
-	if (pf->flags & I40E_FLAG_MFP_ENABLED) {
-		netdev_info(netdev, "Configuring TC not supported in MFP mode\n");
-		goto exit;
+	if (!hw) {
+		pf->flags &= ~I40E_FLAG_TC_MQPRIO;
+		if (tc->type == TC_SETUP_MQPRIO_EXT)
+			memcpy(&vsi->mqprio_qopt, tc->mqprio_qopt,
+			       sizeof(*tc->mqprio_qopt));
+		goto config_tc;
 	}
 
-	/* Check whether tc count is within enabled limit */
-	if (tc > i40e_pf_get_num_tc(pf)) {
-		netdev_info(netdev, "TC count greater than enabled on link for adapter\n");
-		goto exit;
+	switch (hw) {
+	case TC_MQPRIO_HW_OFFLOAD_TCS:
+		pf->flags &= ~I40E_FLAG_TC_MQPRIO;
+		/* Check if DCB enabled to continue */
+		if (!(pf->flags & I40E_FLAG_DCB_ENABLED)) {
+			netdev_info(netdev,
+				    "DCB is not enabled for adapter\n");
+			goto exit;
+		}
+
+		/* Check if MFP enabled */
+		if (pf->flags & I40E_FLAG_MFP_ENABLED) {
+			netdev_info(netdev,
+				    "Configuring TC not supported in MFP mode\n");
+			goto exit;
+		}
+
+		/* Check whether tc count is within enabled limit */
+		if (num_tc > i40e_pf_get_num_tc(pf)) {
+			netdev_info(netdev,
+				    "TC count greater than enabled on link for adapter\n");
+			goto exit;
+		}
+		break;
+	case TC_MQPRIO_HW_OFFLOAD:
+		if (pf->flags & I40E_FLAG_DCB_ENABLED) {
+			netdev_info(netdev,
+				    "Full offload of TC Mqprio options is not supported when DCB is enabled\n");
+			goto exit;
+		}
+
+		/* Check if MFP enabled */
+		if (pf->flags & I40E_FLAG_MFP_ENABLED) {
+			netdev_info(netdev,
+				    "Configuring TC not supported in MFP mode\n");
+			goto exit;
+		}
+
+		ret = i40e_validate_mqprio_queue_mapping(vsi,
+							 tc->mqprio_qopt);
+		if (ret)
+			goto exit;
+
+		memcpy(&vsi->mqprio_qopt, tc->mqprio_qopt,
+		       sizeof(*tc->mqprio_qopt));
+
+		pf->flags |= I40E_FLAG_TC_MQPRIO;
+		pf->flags &= ~I40E_FLAG_DCB_ENABLED;
+
+		break;
+	default:
+		return -EINVAL;
 	}
 
+config_tc:
 	/* Generate TC map for number of tc requested */
-	for (i = 0; i < tc; i++)
+	for (i = 0; i < num_tc; i++)
 		enabled_tc |= BIT(i);
 
 	/* Requesting same TC configuration as already enabled */
-	if (enabled_tc == vsi->tc_config.enabled_tc)
+	if (enabled_tc == vsi->tc_config.enabled_tc &&
+	    hw != TC_MQPRIO_HW_OFFLOAD)
 		return 0;
 
 	/* Quiesce VSI queues */
 	i40e_quiesce_vsi(vsi);
 
+	if (!hw && !(pf->flags & I40E_FLAG_TC_MQPRIO))
+		i40e_remove_queue_channel(vsi);
+
 	/* Configure VSI for enabled TCs */
 	ret = i40e_vsi_config_tc(vsi, enabled_tc);
 	if (ret) {
@@ -6229,8 +6484,11 @@ static int i40e_setup_tc(struct net_device *netdev, u8 tc)
 
 	/* Unquiesce VSI */
 	i40e_unquiesce_vsi(vsi);
+	return ret;
 
 exit:
+	/* Reset the configuration data */
+	memset(&vsi->tc_config, 0, sizeof(vsi->tc_config));
 	i40e_unquiesce_vsi(vsi);
 	return ret;
 }
@@ -6238,12 +6496,7 @@ static int i40e_setup_tc(struct net_device *netdev, u8 tc)
 static int __i40e_setup_tc(struct net_device *netdev, u32 handle, __be16 proto,
 			   struct tc_to_netdev *tc)
 {
-	if (tc->type != TC_SETUP_MQPRIO)
-		return -EINVAL;
-
-	tc->mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
-
-	return i40e_setup_tc(netdev, tc->mqprio->num_tc);
+	return i40e_setup_tc(netdev, tc);
 }
 
 /**


^ permalink raw reply related

* [Intel-wired-lan] [PATCH 4/4] [next-queue]net: i40e: Add support to set max bandwidth rates for TCs offloaded via tc/mqprio
From: Amritha Nambiar @ 2017-05-20  0:58 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149524122523.11022.4541073724650541658.stgit@anamdev.jf.intel.com>

This patch enables setting up maximum Tx rates for the traffic
classes in i40e. The maximum rate offloaded to the hardware through
the mqprio framework is configured for the VSI. Configuring
minimum Tx rate limit is not supported in the device. The minimum
usable value for Tx rate is 50Mbps.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
  queues 4 at 0 4 at 4 min_rate 0Mbit 0Mbit max_rate 55Mbit 60Mbit hw 2

To dump the bandwidth rates:

# tc qdisc show dev eth0
qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
             queues:(0:3) (4:7)
             min rates:0bit 0bit
             max rates:55Mbit 60Mbit

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |    2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c |  102 ++++++++++++++++++++++++++-
 2 files changed, 100 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index a62f65a..83a060d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -346,6 +346,8 @@ struct i40e_channel {
 	u8 enabled_tc;
 	struct i40e_aqc_vsi_properties_data info;
 
+	u32 max_tx_rate;
+
 	/* track this channel belongs to which VSI */
 	struct i40e_vsi *parent_vsi;
 };
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7f61d4f..3261dab 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -69,6 +69,8 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired);
 static void i40e_fdir_sb_setup(struct i40e_pf *pf);
 static int i40e_veb_get_bw_info(struct i40e_veb *veb);
 static int i40e_vsi_config_rss(struct i40e_vsi *vsi);
+static int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 ch_seid,
+			     u32 max_tx_rate);
 
 /* i40e_pci_tbl - PCI Device ID Table
  *
@@ -5033,7 +5035,7 @@ static int i40e_vsi_configure_bw_alloc(struct i40e_vsi *vsi, u8 enabled_tc,
 				       u8 *bw_share)
 {
 	struct i40e_aqc_configure_vsi_tc_bw_data bw_data;
-	i40e_status ret;
+	i40e_status ret = 0;
 	int i;
 
 	bw_data.tc_valid_bits = enabled_tc;
@@ -5041,8 +5043,20 @@ static int i40e_vsi_configure_bw_alloc(struct i40e_vsi *vsi, u8 enabled_tc,
 		bw_data.tc_bw_credits[i] = bw_share[i];
 
 	if ((vsi->back->flags & I40E_FLAG_TC_MQPRIO) ||
-	    !vsi->mqprio_qopt.qopt.hw)
-		return 0;
+	    !vsi->mqprio_qopt.qopt.hw) {
+		if (vsi->mqprio_qopt.max_rate[0]) {
+			u32 max_tx_rate = vsi->mqprio_qopt.max_rate[0];
+
+			max_tx_rate = (max_tx_rate * 8) / 1000000;
+
+			ret = i40e_set_bw_limit(vsi, vsi->seid, max_tx_rate);
+			if (ret)
+				dev_err(&vsi->back->pdev->dev,
+					"Failed to set tx rate (%u Mbps) for vsi->seid %u, error code %d.\n",
+					max_tx_rate, vsi->seid, ret);
+		}
+		return ret;
+	}
 
 	ret = i40e_aq_config_vsi_tc_bw(&vsi->back->hw, vsi->seid, &bw_data,
 				       NULL);
@@ -5297,6 +5311,71 @@ static void i40e_remove_queue_channel(struct i40e_vsi *vsi)
 }
 
 /**
+ * i40e_set_bw_limit - setup BW limit based on max_tx_rate
+ * @vsi: the VSI being setup
+ * @ch_seid: seid of the channel (VSI)
+ * @max_tx_rate: max TX rate to be configured as BW limit
+ *
+ * This function sets up BW limit for a given channel (ch_seid)
+ * based on max TX rate specified.
+ **/
+static int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 ch_seid, u32 max_tx_rate)
+{
+	struct i40e_pf *pf = vsi->back;
+	int speed = 0;
+	int ret = 0;
+
+	switch (pf->hw.phy.link_info.link_speed) {
+	case I40E_LINK_SPEED_40GB:
+		speed = 40000;
+		break;
+	case I40E_LINK_SPEED_20GB:
+		speed = 20000;
+		break;
+	case I40E_LINK_SPEED_10GB:
+		speed = 10000;
+		break;
+	case I40E_LINK_SPEED_1GB:
+		speed = 1000;
+		break;
+	default:
+		break;
+	}
+
+	if (max_tx_rate > speed) {
+		dev_err(&pf->pdev->dev,
+			"Invalid tx rate %d specified for channel seid %d.",
+			max_tx_rate, ch_seid);
+		return -EINVAL;
+	}
+
+	if ((max_tx_rate < 50) && (max_tx_rate > 0)) {
+		dev_warn(&pf->pdev->dev,
+			 "Setting tx rate to minimum usable value of 50Mbps.\n");
+		max_tx_rate = 50;
+	}
+
+#define I40E_BW_CREDIT_DIVISOR 50     /* 50Mbps per BW credit */
+#define I40E_MAX_BW_INACTIVE_ACCUM 1
+
+	/* TX rate credits are in values of 50Mbps, 0 is disabled*/
+	ret = i40e_aq_config_vsi_bw_limit(&pf->hw, ch_seid,
+					  max_tx_rate / I40E_BW_CREDIT_DIVISOR,
+					  I40E_MAX_BW_INACTIVE_ACCUM,
+					  NULL);
+	if (ret)
+		dev_err(&pf->pdev->dev,
+			"Failed set tx rate (%u Mbps) for vsi->seid %u, error code %d.\n",
+			max_tx_rate, ch_seid, ret);
+	else
+		dev_info(&pf->pdev->dev,
+			 "Set tx rate of %u Mbps (count of 50Mbps %u) for vsi->seid %u\n",
+			 max_tx_rate, max_tx_rate / I40E_BW_CREDIT_DIVISOR,
+			 ch_seid);
+	return ret;
+}
+
+/**
  * i40e_is_any_channel - channel exist or not
  * @vsi: ptr to VSI to which channels are associated with
  *
@@ -5882,6 +5961,11 @@ int i40e_create_queue_channel(struct i40e_vsi *vsi,
 		 "Setup channel (id:%u) utilizing num_queues %d\n",
 		 ch->seid, ch->num_queue_pairs);
 
+	/* configure VSI for BW limit */
+	if (ch->max_tx_rate)
+		if (i40e_set_bw_limit(vsi, ch->seid, ch->max_tx_rate))
+			return -EINVAL;
+
 	/* in case of VF, this will be main SRIOV VSI */
 	ch->parent_vsi = vsi;
 
@@ -5918,6 +6002,13 @@ static int i40e_configure_queue_channel(struct i40e_vsi *vsi)
 				vsi->tc_config.tc_info[i].qcount;
 			ch->base_queue =
 				vsi->tc_config.tc_info[i].qoffset;
+			ch->max_tx_rate =
+				vsi->mqprio_qopt.max_rate[i];
+
+			/* Bandwidth limit through tc interface is in bytes/s,
+			 * change to Mbit/s
+			 */
+			ch->max_tx_rate = (ch->max_tx_rate * 8) / 1000000;
 
 			list_add_tail(&ch->list, &vsi->ch_list);
 
@@ -6346,8 +6437,11 @@ int i40e_validate_mqprio_queue_mapping(struct i40e_vsi *vsi,
 		if (!mqprio_qopt->qopt.count[i])
 			return -EINVAL;
 
-		if (mqprio_qopt->min_rate[i] || mqprio_qopt->max_rate[i])
+		if (mqprio_qopt->min_rate[i]) {
+			dev_err(&vsi->back->pdev->dev,
+				"Invalid min tx rate (greater than 0) specified\n");
 			return -EINVAL;
+		}
 
 		if (i >= mqprio_qopt->qopt.num_tc - 1)
 			break;


^ permalink raw reply related

* [Intel-wired-lan] [PATCH] i40e: Fix incorrect pf->flags
From: Tushar Dave @ 2017-05-20  1:01 UTC (permalink / raw)
  To: intel-wired-lan

Fix bug introduced by 'commit 47994c119a36e ("i40e: remove
hw_disabled_flags in favor of using separate flag bits")' that
mistakenly wipes out pf->flags.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d5c9c9e..6b98d34 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8821,9 +8821,9 @@ static int i40e_sw_init(struct i40e_pf *pf)
 		    (pf->hw.aq.api_min_ver > 4))) {
 		/* Supported in FW API version higher than 1.4 */
 		pf->flags |= I40E_FLAG_GENEVE_OFFLOAD_CAPABLE;
-		pf->flags = I40E_FLAG_HW_ATR_EVICT_CAPABLE;
+		pf->flags |= I40E_FLAG_HW_ATR_EVICT_CAPABLE;
 	} else {
-		pf->flags = I40E_FLAG_HW_ATR_EVICT_CAPABLE;
+		pf->flags |= I40E_FLAG_HW_ATR_EVICT_CAPABLE;
 	}
 
 	pf->eeprom_version = 0xDEAD;
-- 
1.9.1


^ permalink raw reply related

* [Intel-wired-lan] [PATCH] i40evf: Use le32_to_cpu before evaluating HW desc fields.
From: tndave @ 2017-05-20  1:12 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <1494976037.40313.1.camel@intel.com>



On 05/16/2017 04:07 PM, Jeff Kirsher wrote:
> On Mon, 2017-05-01 at 16:15 -0700, Tushar Dave wrote:
>> i40e hardware descriptor fields are in little-endian format. Driver
>> must use le32_to_cpu while evaluating these fields otherwise on
>> big-endian arch we end up evaluating incorrect values, cause errors
>> like:
>>
>> i40evf 0001:04:02.0: Expected response 0 from PF, received 285212672
>> i40evf 0001:04:02.1: Expected response 0 from PF, received 285212672
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> ---
>>  drivers/net/ethernet/intel/i40evf/i40evf_main.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> Due to recent patches already applied to my tree, this patch does not
> apply cleanly.  Feel free to send an updated patch, if the change is
> still needed.
Hi Jeff,

I just tried the same patch on Dave's latest linux-net and it applies
cleanly. Wondering if I should clone your tree and make sure it applies?
If so, I may need to do same for all my future patches as well, thats a
extra bit of work!

Most of the time, mine are bug fixes and I always use linux-net when
upstreaming my patches for Intel Ethernet drivers.

-Tushar
>

^ permalink raw reply

* [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
From: Or Gerlitz @ 2017-05-20 21:15 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <149524122523.11022.4541073724650541658.stgit@anamdev.jf.intel.com>

On Sat, May 20, 2017 at 3:58 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> The following series introduces a new harware offload mode in tc/mqprio

Wait, we have already a HW QoS model introduced by John F and Co
couple of years ago,  right?

Please elaborate in few sentence if you are extending it/replacing it,
why we want to do that and what are the implications on existing
applications UAPI wise.

Below you just say in the new mode Qos is configured with knobs XYZ --
this is way not enough

> where the TCs, the queue configurations and bandwidth rate limits
> are offloaded to the hardware.


> The i40e driver enables the new mqprio hardware offload mechanism factoring the TCs, queue configuration and bandwidth rates by creating HW channel VSIs.
>
> In this mode, the priority to traffic class mapping and the user specified queue ranges are used to configure the traffic class when the 'hw' option is set to 2. This is achieved by creating HW channels(VSI). A new channel is created for each of the traffic class configuration offloaded via mqprio framework except for the first TC (TC0) which is for the main VSI. TC0 for the main VSI is also reconfigured as per user provided queue parameters. Finally, bandwidth rate limits are set on these traffic classes through the mqprio offload framework by sending these rates in addition to the number of TCs and the queue configurations.

^ permalink raw reply

* [Intel-wired-lan] [jkirsher-next-queue:dev-queue] BUILD INCOMPLETE 6c76cd8d8a7a6dc38bb19f29294982ab0fb6be16
From: kbuild test robot @ 2017-05-21  3:08 UTC (permalink / raw)
  To: intel-wired-lan

https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git  dev-queue
6c76cd8d8a7a6dc38bb19f29294982ab0fb6be16  ixgbe: fix incorrect status check

TIMEOUT after 2510m


Sorry we cannot finish the testset for your branch within a reasonable time.
It's our fault -- either some build server is down or some build worker is busy
doing bisects for _other_ trees. The branch will get more complete coverage and
possible error reports when our build infrastructure is restored or catches up.
There will be no more build success notification for this branch head, but you
can expect reasonably good test coverage after waiting for 1 day.

configs timed out: 141

alpha                            allmodconfig
alpha                            allyesconfig
alpha                               defconfig
arm64                             allnoconfig
arm64                               defconfig
arm                               allnoconfig
arm                         at91_dt_defconfig
arm                           efm32_defconfig
arm                          exynos_defconfig
arm                        multi_v5_defconfig
arm                        multi_v7_defconfig
arm                        shmobile_defconfig
arm                           sunxi_defconfig
blackfin                         allmodconfig
blackfin                         allyesconfig
blackfin                BF526-EZBRD_defconfig
blackfin                BF533-EZKIT_defconfig
blackfin            BF561-EZKIT-SMP_defconfig
blackfin                  TCM-BF537_defconfig
c6x                        evmc6678_defconfig
cris                             allmodconfig
cris                             allyesconfig
cris                 etrax-100lx_v2_defconfig
customconfig                                0
customconfig                                1
customconfig                                2
customconfig                                3
customconfig                                4
customconfig                                5
customconfig                                6
customconfig                                7
customconfig                                8
customconfig                                9
frv                                 defconfig
h8300                    h8300h-sim_defconfig
i386                             alldefconfig
i386                             allmodconfig
i386                              allnoconfig
i386                                defconfig
i386                            randconfig-a0
i386                            randconfig-i0
i386                            randconfig-i1
i386                            randconfig-n0
i386                            randconfig-r0
i386                            randconfig-s0
i386                            randconfig-s1
i386                          randconfig-x000
i386                          randconfig-x001
i386                          randconfig-x002
i386                          randconfig-x003
i386                          randconfig-x004
i386                          randconfig-x005
i386                          randconfig-x006
i386                          randconfig-x007
i386                          randconfig-x008
i386                          randconfig-x009
i386                          randconfig-x010
i386                          randconfig-x011
i386                          randconfig-x012
i386                          randconfig-x013
i386                          randconfig-x014
i386                          randconfig-x015
i386                          randconfig-x016
i386                          randconfig-x017
i386                          randconfig-x018
i386                          randconfig-x019
ia64                             alldefconfig
ia64                             allmodconfig
ia64                              allnoconfig
ia64                             allyesconfig
ia64                                defconfig
m32r                       m32104ut_defconfig
m32r                     mappi3.smp_defconfig
m32r                         opsput_defconfig
m32r                           usrv_defconfig
m68k                             allmodconfig
m68k                             allyesconfig
m68k                       m5475evb_defconfig
m68k                          multi_defconfig
m68k                           sun3_defconfig
microblaze                      mmu_defconfig
microblaze                    nommu_defconfig
mips                           32r2_defconfig
mips                         64r6el_defconfig
mips                             allmodconfig
mips                              allnoconfig
mips                             allyesconfig
mips                      fuloong2e_defconfig
mips                                   jz4740
mips                      malta_kvm_defconfig
mips                                     txx9
mn10300                     asb2364_defconfig
nios2                         10m50_defconfig
openrisc                    or1ksim_defconfig
parisc                           allmodconfig
parisc                            allnoconfig
parisc                           allyesconfig
parisc                         b180_defconfig
parisc                        c3000_defconfig
parisc                              defconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
powerpc                          allyesconfig
powerpc                             defconfig
powerpc                       ppc64_defconfig
s390                             allmodconfig
s390                             allyesconfig
s390                        default_defconfig
score                      spct6600_defconfig
sh                               allmodconfig
sh                                allnoconfig
sh                               allyesconfig
sh                          rsk7269_defconfig
sh                  sh7785lcr_32bit_defconfig
sh                            titan_defconfig
sparc64                          allmodconfig
sparc64                           allnoconfig
sparc64                          allyesconfig
sparc64                             defconfig
sparc                            allmodconfig
sparc                            allyesconfig
sparc                               defconfig
tile                             allmodconfig
tile                             allyesconfig
tile                         tilegx_defconfig
um                             i386_defconfig
um                           x86_64_defconfig
x86_64                             acpi-redef
x86_64                           allmodconfig
x86_64                           allyesconfig
x86_64                           allyesdebian
x86_64                                  kexec
x86_64                                    lkp
x86_64                                nfsroot
x86_64                          randconfig-i0
x86_64                                   rhel
x86_64                               rhel-7.2
xtensa                           allmodconfig
xtensa                           allyesconfig
xtensa                       common_defconfig
xtensa                          iss_defconfig

configs tested: 31

i386                               tinyconfig
x86_64                 randconfig-x019-201720
x86_64                 randconfig-x010-201720
x86_64                 randconfig-x015-201720
x86_64                 randconfig-x014-201720
x86_64                 randconfig-x012-201720
x86_64                 randconfig-x018-201720
x86_64                 randconfig-x017-201720
x86_64                 randconfig-x016-201720
x86_64                 randconfig-x013-201720
x86_64                 randconfig-x011-201720
x86_64                 randconfig-x006-201720
x86_64                 randconfig-x007-201720
x86_64                 randconfig-x001-201720
x86_64                 randconfig-x004-201720
x86_64                 randconfig-x005-201720
x86_64                 randconfig-x000-201720
x86_64                 randconfig-x008-201720
x86_64                 randconfig-x002-201720
x86_64                 randconfig-x003-201720
x86_64                 randconfig-x009-201720
i386                 randconfig-x074-05150639
i386                 randconfig-x076-05150639
i386                 randconfig-x070-05150639
i386                 randconfig-x077-05150639
i386                 randconfig-x078-05150639
i386                 randconfig-x075-05150639
i386                 randconfig-x072-05150639
i386                 randconfig-x071-05150639
i386                 randconfig-x079-05150639
i386                 randconfig-x073-05150639

Thanks,
Fengguang

^ permalink raw reply

* [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
From: Alexander Duyck @ 2017-05-21 22:35 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <CAJ3xEMjTDW7sQ5CFmTyR5xMpin_WFFQg-S8imWMBpXWuS1zvVw@mail.gmail.com>

On Sat, May 20, 2017 at 2:15 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Sat, May 20, 2017 at 3:58 AM, Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>> The following series introduces a new harware offload mode in tc/mqprio
>
> Wait, we have already a HW QoS model introduced by John F and Co
> couple of years ago,  right?

I assume you are referring to the ETS portion of DCBX? If so then yes
we have something there, but there is a fairly high level of
complexity and dependencies in order to enable that. What we have in
mind doesn't really fit with DCBX and the problem is the spec calls
out that you really have to have support for DCBX in order to make use
of ETS. What we are trying to do here is provide a lightweight way of
doing this configuration similar to how mqprio was already providing a
lightweight way of enabling multiple traffic classes.

> Please elaborate in few sentence if you are extending it/replacing it,
> why we want to do that and what are the implications on existing
> applications UAPI wise.

This is meant to be an extension of the existing structure. It can be
ignored by the driver and instead only have the basic hw offload
supported. In such a case the command will still return success but
any rate limits and queue configuration can be ignored. In the case of
the current implementation the queue configuration was already being
ignored so we opted to re-purpose the "hw" flag so that if you pass a
value greater than 1 and the driver doesn't recognize the value it has
the option to just ignore the extra bits it doesn't recognize and
return 1 as it did before for the hw flag in mqprio.

> Below you just say in the new mode Qos is configured with knobs XYZ --
> this is way not enough

Can you clarify what you are asking for here? Amritha included an
example in patch 1 of a command line that enabled 2 traffic classes
with rate limits. Is there something more specific you are looking
for?

Thanks.

- Alex

^ permalink raw reply

* [Intel-wired-lan] [PATCH] ixgbe: prevent driver configuration changes while XDP is loaded
From: Alexander Duyck @ 2017-05-21 23:02 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <20170502050634.7459.50968.stgit@john-Precision-Tower-5810>

On Mon, May 1, 2017 at 10:06 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> XDP checks to ensure the MTU is valid and LRO is disabled when it is
> loaded. But user configuration after XDP is loaded could change these
> and cause a misconfiguration.
>
> This patch adds checks to ensure config changes are valid.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index ee20a2b..156357e 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -6437,6 +6437,19 @@ static void ixgbe_free_all_rx_resources(struct ixgbe_adapter *adapter)
>  static int ixgbe_change_mtu(struct net_device *netdev, int new_mtu)
>  {
>         struct ixgbe_adapter *adapter = netdev_priv(netdev);
> +       int i, frame_size = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
> +
> +       /* If XDP is running enforce MTU limitations */
> +       for (i = 0; i < adapter->num_rx_queues; i++) {
> +               struct ixgbe_ring *ring = adapter->rx_ring[i];
> +
> +               if (frame_size > ixgbe_rx_bufsz(ring)) {
> +                       e_warn(probe,
> +                              "Setting MTU > %i with XDP is not supported\n",
> +                              ixgbe_rx_bufsz(ring));
> +                       return -EINVAL;
> +               }
> +       }
>

There are a few issues with this piece.

First the Rx buffer size gets updated depending on if we are using
jumbo frames or not. We either use 1.5K or 3K to hold the contents of
the frame. Instead of using ixgbe_rx_bufsz you might just make it so
that you use IXGBE_MAX_FRAME_BUILD_SKB if the adapter flag for
LEGACY_RX is not set, otherwise you can probably just use
IXGBE_RXBUFFER_2K.

Also you should only be performing this check if xdp_prog is set and
it doesn't look like you are checking for that.

>         /*
>          * For 82599EB we cannot allow legacy VFs to enable their receive
> @@ -9291,6 +9304,10 @@ static netdev_features_t ixgbe_fix_features(struct net_device *netdev,
>         if (!(adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE))
>                 features &= ~NETIF_F_LRO;
>
> +       /* If XDP is enabled we can not enable LRO */
> +       if (adapter->xdp_prog)
> +               features &= ~NETIF_F_LRO;
> +
>         return features;
>  }
>
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

* [Intel-wired-lan] [PATCH] i40evf: Use le32_to_cpu before evaluating HW desc fields.
From: Alexander Duyck @ 2017-05-21 23:08 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <8607da8f-51fa-b3e5-2b71-a28863f1fe84@oracle.com>

On Fri, May 19, 2017 at 6:12 PM, tndave <tushar.n.dave@oracle.com> wrote:
>
>
> On 05/16/2017 04:07 PM, Jeff Kirsher wrote:
>>
>> On Mon, 2017-05-01 at 16:15 -0700, Tushar Dave wrote:
>>>
>>> i40e hardware descriptor fields are in little-endian format. Driver
>>> must use le32_to_cpu while evaluating these fields otherwise on
>>> big-endian arch we end up evaluating incorrect values, cause errors
>>> like:
>>>
>>> i40evf 0001:04:02.0: Expected response 0 from PF, received 285212672
>>> i40evf 0001:04:02.1: Expected response 0 from PF, received 285212672
>>>
>>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>>> ---
>>>  drivers/net/ethernet/intel/i40evf/i40evf_main.c | 6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>>
>> Due to recent patches already applied to my tree, this patch does not
>> apply cleanly.  Feel free to send an updated patch, if the change is
>> still needed.
>
> Hi Jeff,
>
> I just tried the same patch on Dave's latest linux-net and it applies
> cleanly. Wondering if I should clone your tree and make sure it applies?
> If so, I may need to do same for all my future patches as well, thats a
> extra bit of work!
>
> Most of the time, mine are bug fixes and I always use linux-net when
> upstreaming my patches for Intel Ethernet drivers.
>

You probably should just be cloning the dev-queue branch of Jeff's
tree. The issue is I believe he has some virtchnl changes from Jesse
and they are conflicting with your change.

- Alex

^ permalink raw reply

* [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
From: Or Gerlitz @ 2017-05-22  3:25 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <CAKgT0Ue6tDOE8L4thktyje-erGkVMmtO7zyp0FRgmRPYOppjQA@mail.gmail.com>

On Mon, May 22, 2017 at 1:35 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Sat, May 20, 2017 at 2:15 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Sat, May 20, 2017 at 3:58 AM, Amritha Nambiar
>> <amritha.nambiar@intel.com> wrote:
>>> The following series introduces a new harware offload mode in tc/mqprio
>>
>> Wait, we have already a HW QoS model introduced by John F and Co
>> couple of years ago,  right?
>
> I assume you are referring to the ETS portion of DCBX? If so then yes
> we have something there, but there is a fairly high level of
> complexity and dependencies in order to enable that. What we have in
> mind doesn't really fit with DCBX and the problem is the spec calls
> out that you really have to have support for DCBX in order to make use
> of ETS. What we are trying to do here is provide a lightweight way of
> doing this configuration similar to how mqprio was already providing a
> lightweight way of enabling multiple traffic classes.

UAPI wise, we are talking on DCB, not DCBX, right? the X-portion comes
into play if some user-space entity run LLDP traffic and calls into
the kernel to configure (say) ETS. If there are some issues to use the
existing user-space tool (lldpad tool/daemon) to do DCB without X, one
can/should fix them or introduce another/simpler tool that in a
lightweight manner only configures things w.o exchanging.

So to clarify, the essence of this series is introducing a 2nd way to
configure ETS and rate limit?

>> Please elaborate in few sentence if you are extending it/replacing it,
>> why we want to do that and what are the implications on existing
>> applications UAPI wise.

> This is meant to be an extension of the existing structure. It can be
> ignored by the driver and instead only have the basic hw offload
> supported. In such a case the command will still return success but
> any rate limits and queue configuration can be ignored. In the case of
> the current implementation the queue configuration was already being
> ignored so we opted to re-purpose the "hw" flag so that if you pass a
> value greater than 1 and the driver doesn't recognize the value it has
> the option to just ignore the extra bits it doesn't recognize and
> return 1 as it did before for the hw flag in mqprio.

So the user asked to configure something and the kernel returned
success but the config was not plugged to the hw? sounds to me like
possible (probable) source of troubles and complaints.

>> Below you just say in the new mode Qos is configured with knobs XYZ --
>> this is way not enough

> Can you clarify what you are asking for here? Amritha included an
> example in patch 1 of a command line that enabled 2 traffic classes
> with rate limits. Is there something more specific you are looking for?

you were referring to the questions I posted, so we can continue the
discussion from here.

^ permalink raw reply

* [Intel-wired-lan] [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
From: Duyck, Alexander H @ 2017-05-22 16:40 UTC (permalink / raw)
  To: intel-wired-lan
In-Reply-To: <CAJ3xEMhtgt4==-GbAWjT8fGMQhoRAzHhFbaVgrfWgJhSaxXr9g@mail.gmail.com>

On Mon, 2017-05-22 at 06:25 +0300, Or Gerlitz wrote:
> On Mon, May 22, 2017 at 1:35 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > 
> > On Sat, May 20, 2017 at 2:15 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> > > 
> > > On Sat, May 20, 2017 at 3:58 AM, Amritha Nambiar
> > > <amritha.nambiar@intel.com> wrote:
> > > > 
> > > > The following series introduces a new harware offload mode in tc/mqprio
> > > 
> > > Wait, we have already a HW QoS model introduced by John F and Co
> > > couple of years ago,  right?
> > 
> > I assume you are referring to the ETS portion of DCBX? If so then yes
> > we have something there, but there is a fairly high level of
> > complexity and dependencies in order to enable that. What we have in
> > mind doesn't really fit with DCBX and the problem is the spec calls
> > out that you really have to have support for DCBX in order to make use
> > of ETS. What we are trying to do here is provide a lightweight way of
> > doing this configuration similar to how mqprio was already providing a
> > lightweight way of enabling multiple traffic classes.
> 
> UAPI wise, we are talking on DCB, not DCBX, right? the X-portion comes
> into play if some user-space entity run LLDP traffic and calls into
> the kernel to configure (say) ETS. If there are some issues to use the
> existing user-space tool (lldpad tool/daemon) to do DCB without X, one
> can/should fix them or introduce another/simpler tool that in a
> lightweight manner only configures things w.o exchanging.
> 
> So to clarify, the essence of this series is introducing a 2nd way to
> configure ETS and rate limit?

Sort of. Basically the idea is we can use several different approaches
to enable queue configuration and rate limits. So we are adding two
pieces of functionality.

The first block allows for configuring queue counts and layout.
Historically DCB/DCBX hasn't allowed us to specify that.

The second bit is that we added support for rate limiting. I am
actually basing it on what we had for SR-IOV rate limiting as that is
how this is working in i40e. However the basic attributes we are adding
should work for most ETS type use cases though it might need to use the
min rates instead of the max rates as we do in i40e.

> > 
> > > 
> > > Please elaborate in few sentence if you are extending it/replacing it,
> > > why we want to do that and what are the implications on existing
> > > applications UAPI wise.
> 
> > 
> > This is meant to be an extension of the existing structure. It can be
> > ignored by the driver and instead only have the basic hw offload
> > supported. In such a case the command will still return success but
> > any rate limits and queue configuration can be ignored. In the case of
> > the current implementation the queue configuration was already being
> > ignored so we opted to re-purpose the "hw" flag so that if you pass a
> > value greater than 1 and the driver doesn't recognize the value it has
> > the option to just ignore the extra bits it doesn't recognize and
> > return 1 as it did before for the hw flag in mqprio.
> 
> So the user asked to configure something and the kernel returned
> success but the config was not plugged to the hw? sounds to me like
> possible (probable) source of troubles and complaints.

That is possible, however the issue already existed. The queue
configuration could be specified with the mqprio configuration and be
totally ignored. I opted for just maintaining the existing behavior and
moving forward and providing some input via the ability to report what
"version" of the hardware offload we are supporting.

The plan is that legacy devices can fall back into the setup where they
support mode 1, however if they support hw mode 2 then we will fail to
initialize if we don't support something that is being requested.

> > 
> > > 
> > > Below you just say in the new mode Qos is configured with knobs XYZ --
> > > this is way not enough
> 
> > 
> > Can you clarify what you are asking for here? Amritha included an
> > example in patch 1 of a command line that enabled 2 traffic classes
> > with rate limits. Is there something more specific you are looking for?
> 
> you were referring to the questions I posted, so we can continue the
> discussion from here.
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox