Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: Ben Hutchings @ 2011-11-14 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, eilong, pstaszewski, netdev, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher
In-Reply-To: <1321286614.2272.47.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 2011-11-14 at 17:03 +0100, Eric Dumazet wrote:
> One of the thing we discussed during netdev 2011 conference was the idea
> to change some network drivers to allocate/populate their skb at RX
> completion time, right before feeding the skb to network stack.
> 
> In old days, we allocated skbs when populating the RX ring.
> 
> This means bringing into cpu cache sk_buff and skb_shared_info cache
> lines (since we clear/initialize them), then 'queue' skb->data to NIC.
> 
> By the time NIC fills a frame in skb->data buffer and host can process
> it, cpu probably threw away the cache lines from its caches, because lot
> of things happened between the allocation and final use.

This is definitely a win so long as skbs are getting merged by GRO.
However, those cache misses take less time than skb allocation, so
preallocation of skbs normally reduces latency.  This is why sfc has an
adaptive buffer allocation behaviour.

> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.
> 
> build_skb() is the function to allocate an skb, caller providing the
> data buffer that should be attached to it. Drivers are expected to call 
> skb_reserve() right after build_skb() to adjust skb->data to the
> Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
> drivers might add a hardware provided alignment)
[...]

The option to attach a heap-allocated buffer rather than allocating a
header buffer and attaching a page might shift the balance.  I'll have
to test this (but don't hold your breath).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eilon Greenstein @ 2011-11-14 16:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev@vger.kernel.org, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher, pstaszewski@itcare.pl
In-Reply-To: <1321286734.2272.48.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 2011-11-14 at 08:05 -0800, Eric Dumazet wrote:
> bnx2x uses following formula to compute its rx_buf_sz :
> 
> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> 
> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info))
> 
> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> 
> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> false sharing because of mem_reclaim in UDP stack.
> 
> One possible way to half truesize is to reduce the need by 64 bytes
> (2112 -> 2048 bytes)
> 
> Instead of allocating a full cache line at the end of packet for
> alignment, we can use the fact that skb_shared_info sits at the end of
> skb->head, and we can use this room, if we convert bnx2x to new
> build_skb() infrastructure.
> 
> skb_shared_info will be initialized after hardware finished its
> transfert, so we can eventually overwrite the final padding.
> 
> Using build_skb() also reduces cache line misses in the driver, since we
> use cache hot skb instead of cold ones. Number of in-flight sk_buff
> structures is lower, they are recycled while still hot.
> 
> Performance results :
> 
> (820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)

Wow! This is very impressive. The only problem will be to back port this
driver now... But it is definitely worth the effort :)

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eilon Greenstein <eilong@broadcom.com>

> CC: Eilon Greenstein <eilong@broadcom.com>
> CC: Ben Hutchings <bhutchings@solarflare.com>
> CC: Tom Herbert <therbert@google.com>
> CC: Jamal Hadi Salim <hadi@mojatatu.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>
> CC: Thomas Graf <tgraf@infradead.org>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h         |   30 -
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c     |  265 ++++------
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h     |   33 -
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |    6
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c    |    4
>  5 files changed, 170 insertions(+), 168 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> index 4b90b51..0f7b7a4 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> @@ -293,8 +293,13 @@ enum {
>  #define FCOE_TXQ_IDX(bp)       (MAX_ETH_TXQ_IDX(bp))
> 
>  /* fast path */
> +/*
> + * This driver uses new build_skb() API :
> + * RX ring buffer contains pointer to kmalloc() data only,
> + * skb are built only after Hardware filled the frame.
> + */
>  struct sw_rx_bd {
> -       struct sk_buff  *skb;
> +       u8              *data;
>         DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
> 
> @@ -424,8 +429,8 @@ union host_hc_status_block {
> 
>  struct bnx2x_agg_info {
>         /*
> -        * First aggregation buffer is an skb, the following - are pages.
> -        * We will preallocate the skbs for each aggregation when
> +        * First aggregation buffer is a data buffer, the following - are pages.
> +        * We will preallocate the data buffer for each aggregation when
>          * we open the interface and will replace the BD at the consumer
>          * with this one when we receive the TPA_START CQE in order to
>          * keep the Rx BD ring consistent.
> @@ -439,6 +444,7 @@ struct bnx2x_agg_info {
>         u16                     parsing_flags;
>         u16                     vlan_tag;
>         u16                     len_on_bd;
> +       u32                     rxhash;
>  };
> 
>  #define Q_STATS_OFFSET32(stat_name) \
> @@ -1187,10 +1193,20 @@ struct bnx2x {
>  #define ETH_MAX_JUMBO_PACKET_SIZE      9600
> 
>         /* Max supported alignment is 256 (8 shift) */
> -#define BNX2X_RX_ALIGN_SHIFT           ((L1_CACHE_SHIFT < 8) ? \
> -                                        L1_CACHE_SHIFT : 8)
> -       /* FW use 2 Cache lines Alignment for start packet and size  */
> -#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
> +#define BNX2X_RX_ALIGN_SHIFT           min(8, L1_CACHE_SHIFT)
> +
> +       /* FW uses 2 Cache lines Alignment for start packet and size
> +        *
> +        * We assume skb_build() uses sizeof(struct skb_shared_info) bytes
> +        * at the end of skb->data, to avoid wasting a full cache line.
> +        * This reduces memory use (skb->truesize).
> +        */
> +#define BNX2X_FW_RX_ALIGN_START        (1UL << BNX2X_RX_ALIGN_SHIFT)
> +
> +#define BNX2X_FW_RX_ALIGN_END                                  \
> +       max(1UL << BNX2X_RX_ALIGN_SHIFT,                        \
> +           SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +
>  #define BNX2X_PXP_DRAM_ALIGN           (BNX2X_RX_ALIGN_SHIFT - 5)
> 
>         struct host_sp_status_block *def_status_blk;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index 13dad92..0d60b9e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -294,8 +294,21 @@ static void bnx2x_update_sge_prod(struct bnx2x_fastpath *fp,
>            fp->last_max_sge, fp->rx_sge_prod);
>  }
> 
> +/* Set Toeplitz hash value in the skb using the value from the
> + * CQE (calculated by HW).
> + */
> +static u32 bnx2x_get_rxhash(const struct bnx2x *bp,
> +                           const struct eth_fast_path_rx_cqe *cqe)
> +{
> +       /* Set Toeplitz hash from CQE */
> +       if ((bp->dev->features & NETIF_F_RXHASH) &&
> +           (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> +               return le32_to_cpu(cqe->rss_hash_result);
> +       return 0;
> +}
> +
>  static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
> -                           struct sk_buff *skb, u16 cons, u16 prod,
> +                           u16 cons, u16 prod,
>                             struct eth_fast_path_rx_cqe *cqe)
>  {
>         struct bnx2x *bp = fp->bp;
> @@ -310,9 +323,9 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
>         if (tpa_info->tpa_state != BNX2X_TPA_STOP)
>                 BNX2X_ERR("start of bin not in stop [%d]\n", queue);
> 
> -       /* Try to map an empty skb from the aggregation info  */
> +       /* Try to map an empty data buffer from the aggregation info  */
>         mapping = dma_map_single(&bp->pdev->dev,
> -                                first_buf->skb->data,
> +                                first_buf->data + NET_SKB_PAD,
>                                  fp->rx_buf_size, DMA_FROM_DEVICE);
>         /*
>          *  ...if it fails - move the skb from the consumer to the producer
> @@ -322,15 +335,15 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
> 
>         if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>                 /* Move the BD from the consumer to the producer */
> -               bnx2x_reuse_rx_skb(fp, cons, prod);
> +               bnx2x_reuse_rx_data(fp, cons, prod);
>                 tpa_info->tpa_state = BNX2X_TPA_ERROR;
>                 return;
>         }
> 
> -       /* move empty skb from pool to prod */
> -       prod_rx_buf->skb = first_buf->skb;
> +       /* move empty data from pool to prod */
> +       prod_rx_buf->data = first_buf->data;
>         dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
> -       /* point prod_bd to new skb */
> +       /* point prod_bd to new data */
>         prod_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>         prod_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> 
> @@ -344,6 +357,7 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
>         tpa_info->tpa_state = BNX2X_TPA_START;
>         tpa_info->len_on_bd = le16_to_cpu(cqe->len_on_bd);
>         tpa_info->placement_offset = cqe->placement_offset;
> +       tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe);
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>         fp->tpa_queue_used |= (1 << queue);
> @@ -471,11 +485,12 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>  {
>         struct bnx2x_agg_info *tpa_info = &fp->tpa_info[queue];
>         struct sw_rx_bd *rx_buf = &tpa_info->first_buf;
> -       u8 pad = tpa_info->placement_offset;
> +       u32 pad = tpa_info->placement_offset;
>         u16 len = tpa_info->len_on_bd;
> -       struct sk_buff *skb = rx_buf->skb;
> +       struct sk_buff *skb = NULL;
> +       u8 *data = rx_buf->data;
>         /* alloc new skb */
> -       struct sk_buff *new_skb;
> +       u8 *new_data;
>         u8 old_tpa_state = tpa_info->tpa_state;
> 
>         tpa_info->tpa_state = BNX2X_TPA_STOP;
> @@ -486,18 +501,18 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>         if (old_tpa_state == BNX2X_TPA_ERROR)
>                 goto drop;
> 
> -       /* Try to allocate the new skb */
> -       new_skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
> +       /* Try to allocate the new data */
> +       new_data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
> 
>         /* Unmap skb in the pool anyway, as we are going to change
>            pool entry status to BNX2X_TPA_STOP even if new skb allocation
>            fails. */
>         dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
>                          fp->rx_buf_size, DMA_FROM_DEVICE);
> +       if (likely(new_data))
> +               skb = build_skb(data);
> 
> -       if (likely(new_skb)) {
> -               prefetch(skb);
> -               prefetch(((char *)(skb)) + L1_CACHE_BYTES);
> +       if (likely(skb)) {
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>                 if (pad + len > fp->rx_buf_size) {
> @@ -509,8 +524,9 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>                 }
>  #endif
> 
> -               skb_reserve(skb, pad);
> +               skb_reserve(skb, pad + NET_SKB_PAD);
>                 skb_put(skb, len);
> +               skb->rxhash = tpa_info->rxhash;
> 
>                 skb->protocol = eth_type_trans(skb, bp->dev);
>                 skb->ip_summed = CHECKSUM_UNNECESSARY;
> @@ -526,8 +542,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>                 }
> 
> 
> -               /* put new skb in bin */
> -               rx_buf->skb = new_skb;
> +               /* put new data in bin */
> +               rx_buf->data = new_data;
> 
>                 return;
>         }
> @@ -539,19 +555,6 @@ drop:
>         fp->eth_q_stats.rx_skb_alloc_failed++;
>  }
> 
> -/* Set Toeplitz hash value in the skb using the value from the
> - * CQE (calculated by HW).
> - */
> -static inline void bnx2x_set_skb_rxhash(struct bnx2x *bp, union eth_rx_cqe *cqe,
> -                                       struct sk_buff *skb)
> -{
> -       /* Set Toeplitz hash from CQE */
> -       if ((bp->dev->features & NETIF_F_RXHASH) &&
> -           (cqe->fast_path_cqe.status_flags &
> -            ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> -               skb->rxhash =
> -               le32_to_cpu(cqe->fast_path_cqe.rss_hash_result);
> -}
> 
>  int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>  {
> @@ -594,6 +597,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 u8 cqe_fp_flags;
>                 enum eth_rx_cqe_type cqe_fp_type;
>                 u16 len, pad;
> +               u8 *data;
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>                 if (unlikely(bp->panic))
> @@ -604,13 +608,6 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 bd_prod = RX_BD(bd_prod);
>                 bd_cons = RX_BD(bd_cons);
> 
> -               /* Prefetch the page containing the BD descriptor
> -                  at producer's index. It will be needed when new skb is
> -                  allocated */
> -               prefetch((void *)(PAGE_ALIGN((unsigned long)
> -                                            (&fp->rx_desc_ring[bd_prod])) -
> -                                 PAGE_SIZE + 1));
> -
>                 cqe = &fp->rx_comp_ring[comp_ring_cons];
>                 cqe_fp = &cqe->fast_path_cqe;
>                 cqe_fp_flags = cqe_fp->type_error_flags;
> @@ -626,125 +623,110 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 if (unlikely(CQE_TYPE_SLOW(cqe_fp_type))) {
>                         bnx2x_sp_event(fp, cqe);
>                         goto next_cqe;
> +               }
> +               rx_buf = &fp->rx_buf_ring[bd_cons];
> +               data = rx_buf->data;
> 
> -               /* this is an rx packet */
> -               } else {
> -                       rx_buf = &fp->rx_buf_ring[bd_cons];
> -                       skb = rx_buf->skb;
> -                       prefetch(skb);
> -
> -                       if (!CQE_TYPE_FAST(cqe_fp_type)) {
> +               if (!CQE_TYPE_FAST(cqe_fp_type)) {
>  #ifdef BNX2X_STOP_ON_ERROR
> -                               /* sanity check */
> -                               if (fp->disable_tpa &&
> -                                   (CQE_TYPE_START(cqe_fp_type) ||
> -                                    CQE_TYPE_STOP(cqe_fp_type)))
> -                                       BNX2X_ERR("START/STOP packet while "
> -                                                 "disable_tpa type %x\n",
> -                                                 CQE_TYPE(cqe_fp_type));
> +                       /* sanity check */
> +                       if (fp->disable_tpa &&
> +                           (CQE_TYPE_START(cqe_fp_type) ||
> +                            CQE_TYPE_STOP(cqe_fp_type)))
> +                               BNX2X_ERR("START/STOP packet while "
> +                                         "disable_tpa type %x\n",
> +                                         CQE_TYPE(cqe_fp_type));
>  #endif
> 
> -                               if (CQE_TYPE_START(cqe_fp_type)) {
> -                                       u16 queue = cqe_fp->queue_index;
> -                                       DP(NETIF_MSG_RX_STATUS,
> -                                          "calling tpa_start on queue %d\n",
> -                                          queue);
> -
> -                                       bnx2x_tpa_start(fp, queue, skb,
> -                                                       bd_cons, bd_prod,
> -                                                       cqe_fp);
> -
> -                                       /* Set Toeplitz hash for LRO skb */
> -                                       bnx2x_set_skb_rxhash(bp, cqe, skb);
> -
> -                                       goto next_rx;
> +                       if (CQE_TYPE_START(cqe_fp_type)) {
> +                               u16 queue = cqe_fp->queue_index;
> +                               DP(NETIF_MSG_RX_STATUS,
> +                                  "calling tpa_start on queue %d\n",
> +                                  queue);
> 
> -                               } else {
> -                                       u16 queue =
> -                                               cqe->end_agg_cqe.queue_index;
> -                                       DP(NETIF_MSG_RX_STATUS,
> -                                          "calling tpa_stop on queue %d\n",
> -                                          queue);
> +                               bnx2x_tpa_start(fp, queue,
> +                                               bd_cons, bd_prod,
> +                                               cqe_fp);
> +                               goto next_rx;
> +                       } else {
> +                               u16 queue =
> +                                       cqe->end_agg_cqe.queue_index;
> +                               DP(NETIF_MSG_RX_STATUS,
> +                                  "calling tpa_stop on queue %d\n",
> +                                  queue);
> 
> -                                       bnx2x_tpa_stop(bp, fp, queue,
> -                                                      &cqe->end_agg_cqe,
> -                                                      comp_ring_cons);
> +                               bnx2x_tpa_stop(bp, fp, queue,
> +                                              &cqe->end_agg_cqe,
> +                                              comp_ring_cons);
>  #ifdef BNX2X_STOP_ON_ERROR
> -                                       if (bp->panic)
> -                                               return 0;
> +                               if (bp->panic)
> +                                       return 0;
>  #endif
> 
> -                                       bnx2x_update_sge_prod(fp, cqe_fp);
> -                                       goto next_cqe;
> -                               }
> +                               bnx2x_update_sge_prod(fp, cqe_fp);
> +                               goto next_cqe;
>                         }
> -                       /* non TPA */
> -                       len = le16_to_cpu(cqe_fp->pkt_len);
> -                       pad = cqe_fp->placement_offset;
> -                       dma_sync_single_for_cpu(&bp->pdev->dev,
> +               }
> +               /* non TPA */
> +               len = le16_to_cpu(cqe_fp->pkt_len);
> +               pad = cqe_fp->placement_offset;
> +               dma_sync_single_for_cpu(&bp->pdev->dev,
>                                         dma_unmap_addr(rx_buf, mapping),
> -                                                      pad + RX_COPY_THRESH,
> -                                                      DMA_FROM_DEVICE);
> -                       prefetch(((char *)(skb)) + L1_CACHE_BYTES);
> +                                       pad + RX_COPY_THRESH,
> +                                       DMA_FROM_DEVICE);
> +               pad += NET_SKB_PAD;
> +               prefetch(data + pad); /* speedup eth_type_trans() */
> +               /* is this an error packet? */
> +               if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
> +                       DP(NETIF_MSG_RX_ERR,
> +                          "ERROR  flags %x  rx packet %u\n",
> +                          cqe_fp_flags, sw_comp_cons);
> +                       fp->eth_q_stats.rx_err_discard_pkt++;
> +                       goto reuse_rx;
> +               }
> 
> -                       /* is this an error packet? */
> -                       if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
> +               /* Since we don't have a jumbo ring
> +                * copy small packets if mtu > 1500
> +                */
> +               if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
> +                   (len <= RX_COPY_THRESH)) {
> +                       skb = netdev_alloc_skb_ip_align(bp->dev, len);
> +                       if (skb == NULL) {
>                                 DP(NETIF_MSG_RX_ERR,
> -                                  "ERROR  flags %x  rx packet %u\n",
> -                                  cqe_fp_flags, sw_comp_cons);
> -                               fp->eth_q_stats.rx_err_discard_pkt++;
> +                                  "ERROR  packet dropped because of alloc failure\n");
> +                               fp->eth_q_stats.rx_skb_alloc_failed++;
>                                 goto reuse_rx;
>                         }
> -
> -                       /* Since we don't have a jumbo ring
> -                        * copy small packets if mtu > 1500
> -                        */
> -                       if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
> -                           (len <= RX_COPY_THRESH)) {
> -                               struct sk_buff *new_skb;
> -
> -                               new_skb = netdev_alloc_skb(bp->dev, len + pad);
> -                               if (new_skb == NULL) {
> -                                       DP(NETIF_MSG_RX_ERR,
> -                                          "ERROR  packet dropped "
> -                                          "because of alloc failure\n");
> -                                       fp->eth_q_stats.rx_skb_alloc_failed++;
> -                                       goto reuse_rx;
> -                               }
> -
> -                               /* aligned copy */
> -                               skb_copy_from_linear_data_offset(skb, pad,
> -                                                   new_skb->data + pad, len);
> -                               skb_reserve(new_skb, pad);
> -                               skb_put(new_skb, len);
> -
> -                               bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
> -
> -                               skb = new_skb;
> -
> -                       } else
> -                       if (likely(bnx2x_alloc_rx_skb(bp, fp, bd_prod) == 0)) {
> +                       memcpy(skb->data, data + pad, len);
> +                       bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
> +               } else {
> +                       if (likely(bnx2x_alloc_rx_data(bp, fp, bd_prod) == 0)) {
>                                 dma_unmap_single(&bp->pdev->dev,
> -                                       dma_unmap_addr(rx_buf, mapping),
> +                                                dma_unmap_addr(rx_buf, mapping),
>                                                  fp->rx_buf_size,
>                                                  DMA_FROM_DEVICE);
> +                               skb = build_skb(data);
> +                               if (unlikely(!skb)) {
> +                                       kfree(data);
> +                                       fp->eth_q_stats.rx_skb_alloc_failed++;
> +                                       goto next_rx;
> +                               }
>                                 skb_reserve(skb, pad);
> -                               skb_put(skb, len);
> -
>                         } else {
>                                 DP(NETIF_MSG_RX_ERR,
>                                    "ERROR  packet dropped because "
>                                    "of alloc failure\n");
>                                 fp->eth_q_stats.rx_skb_alloc_failed++;
>  reuse_rx:
> -                               bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
> +                               bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
>                                 goto next_rx;
>                         }
> 
> +                       skb_put(skb, len);
>                         skb->protocol = eth_type_trans(skb, bp->dev);
> 
>                         /* Set Toeplitz hash for a none-LRO skb */
> -                       bnx2x_set_skb_rxhash(bp, cqe, skb);
> +                       skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp);
> 
>                         skb_checksum_none_assert(skb);
> 
> @@ -767,7 +749,7 @@ reuse_rx:
> 
> 
>  next_rx:
> -               rx_buf->skb = NULL;
> +               rx_buf->data = NULL;
> 
>                 bd_cons = NEXT_RX_IDX(bd_cons);
>                 bd_prod = NEXT_RX_IDX(bd_prod);
> @@ -1013,9 +995,9 @@ void bnx2x_init_rx_rings(struct bnx2x *bp)
>                                 struct sw_rx_bd *first_buf =
>                                         &tpa_info->first_buf;
> 
> -                               first_buf->skb = netdev_alloc_skb(bp->dev,
> -                                                      fp->rx_buf_size);
> -                               if (!first_buf->skb) {
> +                               first_buf->data = kmalloc(fp->rx_buf_size + NET_SKB_PAD,
> +                                                         GFP_ATOMIC);
> +                               if (!first_buf->data) {
>                                         BNX2X_ERR("Failed to allocate TPA "
>                                                   "skb pool for queue[%d] - "
>                                                   "disabling TPA on this "
> @@ -1118,16 +1100,16 @@ static void bnx2x_free_rx_bds(struct bnx2x_fastpath *fp)
> 
>         for (i = 0; i < NUM_RX_BD; i++) {
>                 struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[i];
> -               struct sk_buff *skb = rx_buf->skb;
> +               u8 *data = rx_buf->data;
> 
> -               if (skb == NULL)
> +               if (data == NULL)
>                         continue;
>                 dma_unmap_single(&bp->pdev->dev,
>                                  dma_unmap_addr(rx_buf, mapping),
>                                  fp->rx_buf_size, DMA_FROM_DEVICE);
> 
> -               rx_buf->skb = NULL;
> -               dev_kfree_skb(skb);
> +               rx_buf->data = NULL;
> +               kfree(data);
>         }
>  }
> 
> @@ -1509,6 +1491,7 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
> 
>         for_each_queue(bp, i) {
>                 struct bnx2x_fastpath *fp = &bp->fp[i];
> +               u32 mtu;
> 
>                 /* Always use a mini-jumbo MTU for the FCoE L2 ring */
>                 if (IS_FCOE_IDX(i))
> @@ -1518,13 +1501,15 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
>                          * IP_HEADER_ALIGNMENT_PADDING to prevent a buffer
>                          * overrun attack.
>                          */
> -                       fp->rx_buf_size =
> -                               BNX2X_FCOE_MINI_JUMBO_MTU + ETH_OVREHEAD +
> -                               BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
> +                       mtu = BNX2X_FCOE_MINI_JUMBO_MTU;
>                 else
> -                       fp->rx_buf_size =
> -                               bp->dev->mtu + ETH_OVREHEAD +
> -                               BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
> +                       mtu = bp->dev->mtu;
> +               fp->rx_buf_size = BNX2X_FW_RX_ALIGN_START +
> +                                 IP_HEADER_ALIGNMENT_PADDING +
> +                                 ETH_OVREHEAD +
> +                                 mtu +
> +                                 BNX2X_FW_RX_ALIGN_END;
> +               /* Note : rx_buf_size doesnt take into account NET_SKB_PAD */
>         }
>  }
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> index 260226d..41eb17e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> @@ -910,26 +910,27 @@ static inline int bnx2x_alloc_rx_sge(struct bnx2x *bp,
>         return 0;
>  }
> 
> -static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
> -                                    struct bnx2x_fastpath *fp, u16 index)
> +static inline int bnx2x_alloc_rx_data(struct bnx2x *bp,
> +                                     struct bnx2x_fastpath *fp, u16 index)
>  {
> -       struct sk_buff *skb;
> +       u8 *data;
>         struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[index];
>         struct eth_rx_bd *rx_bd = &fp->rx_desc_ring[index];
>         dma_addr_t mapping;
> 
> -       skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
> -       if (unlikely(skb == NULL))
> +       data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
> +       if (unlikely(data == NULL))
>                 return -ENOMEM;
> 
> -       mapping = dma_map_single(&bp->pdev->dev, skb->data, fp->rx_buf_size,
> +       mapping = dma_map_single(&bp->pdev->dev, data + NET_SKB_PAD,
> +                                fp->rx_buf_size,
>                                  DMA_FROM_DEVICE);
>         if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
> -               dev_kfree_skb_any(skb);
> +               kfree(data);
>                 return -ENOMEM;
>         }
> 
> -       rx_buf->skb = skb;
> +       rx_buf->data = data;
>         dma_unmap_addr_set(rx_buf, mapping, mapping);
> 
>         rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> @@ -938,12 +939,12 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
>         return 0;
>  }
> 
> -/* note that we are not allocating a new skb,
> +/* note that we are not allocating a new buffer,
>   * we are just moving one from cons to prod
>   * we are not creating a new mapping,
>   * so there is no need to check for dma_mapping_error().
>   */
> -static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
> +static inline void bnx2x_reuse_rx_data(struct bnx2x_fastpath *fp,
>                                       u16 cons, u16 prod)
>  {
>         struct sw_rx_bd *cons_rx_buf = &fp->rx_buf_ring[cons];
> @@ -953,7 +954,7 @@ static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
> 
>         dma_unmap_addr_set(prod_rx_buf, mapping,
>                            dma_unmap_addr(cons_rx_buf, mapping));
> -       prod_rx_buf->skb = cons_rx_buf->skb;
> +       prod_rx_buf->data = cons_rx_buf->data;
>         *prod_bd = *cons_bd;
>  }
> 
> @@ -1029,9 +1030,9 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
>         for (i = 0; i < last; i++) {
>                 struct bnx2x_agg_info *tpa_info = &fp->tpa_info[i];
>                 struct sw_rx_bd *first_buf = &tpa_info->first_buf;
> -               struct sk_buff *skb = first_buf->skb;
> +               u8 *data = first_buf->data;
> 
> -               if (skb == NULL) {
> +               if (data == NULL) {
>                         DP(NETIF_MSG_IFDOWN, "tpa bin %d empty on free\n", i);
>                         continue;
>                 }
> @@ -1039,8 +1040,8 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
>                         dma_unmap_single(&bp->pdev->dev,
>                                          dma_unmap_addr(first_buf, mapping),
>                                          fp->rx_buf_size, DMA_FROM_DEVICE);
> -               dev_kfree_skb(skb);
> -               first_buf->skb = NULL;
> +               kfree(data);
> +               first_buf->data = NULL;
>         }
>  }
> 
> @@ -1148,7 +1149,7 @@ static inline int bnx2x_alloc_rx_bds(struct bnx2x_fastpath *fp,
>          * fp->eth_q_stats.rx_skb_alloc_failed = 0
>          */
>         for (i = 0; i < rx_ring_size; i++) {
> -               if (bnx2x_alloc_rx_skb(bp, fp, ring_prod) < 0) {
> +               if (bnx2x_alloc_rx_data(bp, fp, ring_prod) < 0) {
>                         fp->eth_q_stats.rx_skb_alloc_failed++;
>                         continue;
>                 }
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> index f6402fa..ec31871 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> @@ -1740,6 +1740,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>         struct sw_rx_bd *rx_buf;
>         u16 len;
>         int rc = -ENODEV;
> +       u8 *data;
> 
>         /* check the loopback mode */
>         switch (loopback_mode) {
> @@ -1865,10 +1866,9 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>         dma_sync_single_for_cpu(&bp->pdev->dev,
>                                    dma_unmap_addr(rx_buf, mapping),
>                                    fp_rx->rx_buf_size, DMA_FROM_DEVICE);
> -       skb = rx_buf->skb;
> -       skb_reserve(skb, cqe->fast_path_cqe.placement_offset);
> +       data = rx_buf->data + NET_SKB_PAD + cqe->fast_path_cqe.placement_offset;
>         for (i = ETH_HLEN; i < pkt_size; i++)
> -               if (*(skb->data + i) != (unsigned char) (i & 0xff))
> +               if (*(data + i) != (unsigned char) (i & 0xff))
>                         goto test_loopback_rx_exit;
> 
>         rc = 0;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> index 33ff60d..80fb9b5 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> @@ -2789,8 +2789,8 @@ static void bnx2x_pf_rx_q_prep(struct bnx2x *bp,
>         /* This should be a maximum number of data bytes that may be
>          * placed on the BD (not including paddings).
>          */
> -       rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN -
> -               IP_HEADER_ALIGNMENT_PADDING;
> +       rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN_START -
> +               BNX2X_FW_RX_ALIGN_END - IP_HEADER_ALIGNMENT_PADDING;
> 
>         rxq_init->cl_qzone_id = fp->cl_qzone_id;
>         rxq_init->tpa_agg_sz = tpa_agg_size;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: net: Add network priority cgroup
From: David Täht @ 2011-11-14 16:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: Neil Horman, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller
In-Reply-To: <4EC143FC.5060704@intel.com>

[-- Attachment #1: Type: text/plain, Size: 304 bytes --]

On 11/14/2011 05:38 PM, John Fastabend wrote:
> Acked-by: John Fastabend <john.r.fastabend@intel.com> 

Alright. Do as you will.

I would comfort me to know if 802.1Qau (QCN) was going to be enabled by 
default.

http://blog.ioshints.info/2010/11/data-center-bridging-dcb-congestion.html

-- 
Dave Täht


[-- Attachment #2: dave_taht.vcf --]
[-- Type: text/x-vcard, Size: 204 bytes --]

begin:vcard
fn;quoted-printable:Dave T=C3=A4ht
n;quoted-printable:T=C3=A4ht;Dave
email;internet:dave.taht@gmail.com
tel;home:1-239-829-5608
tel;cell:0638645374
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply

* RE: net: Add network priority cgroup
From: Shyam_Iyer @ 2011-11-14 16:43 UTC (permalink / raw)
  To: nhorman, dave.taht; +Cc: netdev, john.r.fastabend, robert.w.love, davem
In-Reply-To: <20111114144358.GB27284@hmsreliant.think-freely.org>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Neil Horman
> Sent: Monday, November 14, 2011 9:44 AM
> To: Dave Taht
> Cc: netdev@vger.kernel.org; John Fastabend; Robert Love; David S.
> Miller
> Subject: Re: net: Add network priority cgroup
> 
> On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com>
> wrote:
> > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > >> Data Center Bridging environments are currently somewhat limited
> in their
> > >> ability to provide a general mechanism for controlling traffic
> priority.
> > >> Specifically they are unable to administratively control the
> priority at which
> > >> various types of network traffic are sent.
> > >>
> > >> Currently, the only ways to set the priority of a network buffer
> are:
> > >>
> > >> 1) Through the use of the SO_PRIORITY socket option
> > >> 2) By using low level hooks, like a tc action
> > >>
> > >> (1) is difficult from an administrative perspective because it
> requires that the
> > >> application to be coded to not just assume the default priority is
> sufficient,
> > >> and must expose an administrative interface to allow priority
> adjustment.  Such
> > >> a solution is not scalable in a DCB environment
> > >>
> > >> (2) is also difficult, as it requires constant administrative
> oversight of
> > >> applications so as to build appropriate rules to match traffic
> belonging to
> > >> various classes, so that priority can be appropriately set. It is
> further
> > >> limiting when DCB enabled hardware is in use, due to the fact that
> tc rules are
> > >> only run after a root qdisc has been selected (DCB enabled
> hardware may reserve
> > >> hw queues for various traffic classes and needs the priority to be
> set prior to
> > >> selecting the root qdisc)
> > >>
> > >>
> > >> I've discussed various solutions with John Fastabend, and we saw a
> cgroup as
> > >> being a good general solution to this problem.  The network
> priority cgroup
> > >> allows for a per-interface priority map to be built per cgroup.
>  Any traffic
> > >> originating from an application in a cgroup, that does not
> explicitly set its
> > >> priority with SO_PRIORITY will have its priority assigned to the
> value
> > >> designated for that group on that interface.  This allows a user
> space daemon,
> > >> when conducting LLDP negotiation with a DCB enabled peer to create
> a cgroup
> > >> based on the APP_TLV value received and administratively assign
> applications to
> > >> that priority using the existing cgroup utility infrastructure.
> > >>
> > >> Tested by John and myself, with good results
> > >>
> > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > >> CC: Robert Love <robert.w.love@intel.com>
> > >> CC: "David S. Miller" <davem@davemloft.net>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe netdev"
> in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >
> > > Bump, any other thoughts here?  Dave T. has some reasonable
> thoughts regarding
> > > the use of skb->priority, but IMO they really seem orthogonal to
> the purpose of
> > > this change.  Any other reviews would be welcome.
> >
> > Well, in part I've been playing catchup in the hope that lldp and
> > openlldp and/or this dcb netlink layer that I don't know anything
> > about (pointers please?) could help somehow to resolve the semantic
> > mess skb->priority has become in the first place.
> >
> > I liked what was described here.
> >
> > "What if we did at least carve out the DCB functionality away from
> > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > locally generated traffic here, we're talking
> > about skbs that have a socket attached to them.  We could, instead of
> indexing
> > the prio_tc_map with skb->priority, we could index it with
> > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The
> cgroup
> > then could be, instead of a strict priority cgroup, a queue_selector
> cgroup (or
> > something more appropriately named), and we don't have to touch skb-
> >priority at
> > all.  I'd really rather not start down that road until I got more
> opinions and
> > consensus on that, but it seems like a pretty good solution, one that
> would
> > allow hardware queue selection in systems that use things like DCB to
> co-exist
> > with software queueing features."
> >
> I was initially ok with this, but the more I think about it, the more I
> think
> its just not needed (see further down in this email for my reasoning).
> John,
> Rob, do you have any thoughts here?
> 
> > The piece that still kind of bothered me about the original proposal
> > (and perhaps this one) was that setting SO_PRIORITY in an app means
> > 'give my packets more mojo'.
> >
> > Taking something that took unprioritized packets and assigned them
> and
> > *them only* to a hardware queue struck me as possibly deprioritizing
> > the 'more mojo wanted' packets in the app(s), as they would end up in
> > some other, possibly overloaded, hardware queue.
> >
> I don't really see what you mean by this at all.  Taking packets with
> no
> priority and assigning them a priority doesn't really have an effect on
> pre-prioritized packets.  Or rather it shouldn't.  You can certainly
> create a
> problem by having apps prioritized according to conflicting semantic
> rules, but
> that strikes me as administrative error.  Garbage in...Garbage out.
> 
> > So a cgroup that moves all of the packets from an application into a
> > given hardware queue, and then gets scheduled normally according to
> > skb->priority and friends (software queue, default of pfifo_fast,
> > etc), seems to make some sense to me. (I wouldn't mind if we had
> > abstractions for software queues, too, like, I need a software queue
> > with these properties, find me a place for it on the hardware - but
> > I'm dreaming)
> >
> > One open question is where do packets generated from other subsystems
> > end up, if you are using a cgroup for the app? arp, dns, etc?
> >
> The overriding rule is the association of an skb to a socket.  If a
> transmitted
> frame has skb->sk set in dev_queue_xmit, then we interrogate its
> priority index
> as set when we passed through the sendmsg code at the top of the stack.
> Otherwise its behavior is unchanged from its current standpoint.
> 
> > So to rephrase your original description from this:
> >
> > >> Any traffic originating from an application in a cgroup, that does
> not explicitly set its
> > >> priority with SO_PRIORITY will have its priority assigned to the
> value
> > >> designated for that group on that interface.  This allows a user
> space daemon,
> > >> when conducting LLDP negotiation with a DCB enabled peer to create
> a cgroup
> > >> based on the APP_TLV value received and administratively assign
> applications to
> > >> that priority using the existing cgroup utility infrastructure.
> > > John, Robert, if you're supportive of these changes, some Acks
> would be
> > > appreciated.
> >
> > To this:
> >
> > "Any traffic originating from an application in a cgroup,  will have
> > its hardware queue  assigned to the value designated for that group
> on
> > that interface.  This allows a user space daemon, when conducting
> LLDP
> > negotiation with a DCB enabled peer to create a cgroup based on the
> > APP_TLV value received and administratively assign applications to
> > that hardware queue using the existing cgroup utility
> infrastructure."
> >
> As above, I'm split brained about this.  I'm ok with the idea of making
> this a
> queue selection cgroup, and separating it from priority, but at the
> same time,
> in the context of DCB, we really are assigning priority here, so it
> seems a bit
> false to do something that is not priority.  I also like the fact that
> it
> provides administrative control in a way that netfilter and tc don't
> really
> enable.
> 
> > Assuming we're on the same page here, what the heck is APP_TLV?
> >
> LLDP does layer 2 discovery with peer networking devices. It does so
> using sets
> of Type/length/value tuples.  The types carry various bits of
> information, such
> as which priority groups are available on the network.  The APP tlv
> conveys
> application or feature specific information.  for instance, There is an
> ISCSI
> app tlv that tells the host that  "on the interface you received this
> tlv, iscsi
> traffic must be sent at priority X".  The idea being that, on receipt
> of this
> tlv, the DCB daemon can create an ISCSI network priority cgroup
> instance, and
> augment the cgroup rules file such that, when the user space iscsi
> daemon is
> started, its traffic automatically transmits at the appropriate
> priority.

Love this !

I guess if this is integrated to libvirt via libcgroups VMs could be assigned a network priority..

http://linuxplumbersconf.org/2010/ocw/proposals/843

^ permalink raw reply

* Re: net: Add network priority cgroup
From: John Fastabend @ 2011-11-14 16:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Taht, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller
In-Reply-To: <20111114144358.GB27284@hmsreliant.think-freely.org>

On 11/14/2011 6:43 AM, Neil Horman wrote:
> On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
>> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>>> On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
>>>> Data Center Bridging environments are currently somewhat limited in their
>>>> ability to provide a general mechanism for controlling traffic priority.
>>>> Specifically they are unable to administratively control the priority at which
>>>> various types of network traffic are sent.
>>>>
>>>> Currently, the only ways to set the priority of a network buffer are:
>>>>
>>>> 1) Through the use of the SO_PRIORITY socket option
>>>> 2) By using low level hooks, like a tc action
>>>>
>>>> (1) is difficult from an administrative perspective because it requires that the
>>>> application to be coded to not just assume the default priority is sufficient,
>>>> and must expose an administrative interface to allow priority adjustment.  Such
>>>> a solution is not scalable in a DCB environment
>>>>
>>>> (2) is also difficult, as it requires constant administrative oversight of
>>>> applications so as to build appropriate rules to match traffic belonging to
>>>> various classes, so that priority can be appropriately set. It is further
>>>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>>>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>>>> hw queues for various traffic classes and needs the priority to be set prior to
>>>> selecting the root qdisc)
>>>>
>>>>
>>>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>>>> being a good general solution to this problem.  The network priority cgroup
>>>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>>>> originating from an application in a cgroup, that does not explicitly set its
>>>> priority with SO_PRIORITY will have its priority assigned to the value
>>>> designated for that group on that interface.  This allows a user space daemon,
>>>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>>>> based on the APP_TLV value received and administratively assign applications to
>>>> that priority using the existing cgroup utility infrastructure.
>>>>
>>>> Tested by John and myself, with good results
>>>>
>>>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>>> CC: John Fastabend <john.r.fastabend@intel.com>
>>>> CC: Robert Love <robert.w.love@intel.com>
>>>> CC: "David S. Miller" <davem@davemloft.net>
>>>> --

Acked-by: John Fastabend <john.r.fastabend@intel.com>

>>>
>>> Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
>>> the use of skb->priority, but IMO they really seem orthogonal to the purpose of
>>> this change.  Any other reviews would be welcome.
>>
>> Well, in part I've been playing catchup in the hope that lldp and
>> openlldp and/or this dcb netlink layer that I don't know anything
>> about (pointers please?) could help somehow to resolve the semantic
>> mess skb->priority has become in the first place.
>>
>> I liked what was described here.
>>
>> "What if we did at least carve out the DCB functionality away from
>> skb->priority?  Since, AIUI, we're only concerning ourselves with
>> locally generated traffic here, we're talking
>> about skbs that have a socket attached to them.  We could, instead of indexing
>> the prio_tc_map with skb->priority, we could index it with
>> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
>> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
>> something more appropriately named), and we don't have to touch skb->priority at
>> all.  I'd really rather not start down that road until I got more opinions and
>> consensus on that, but it seems like a pretty good solution, one that would
>> allow hardware queue selection in systems that use things like DCB to co-exist
>> with software queueing features."
>>
> I was initially ok with this, but the more I think about it, the more I think
> its just not needed (see further down in this email for my reasoning).  John,
> Rob, do you have any thoughts here?
> 

I agree the original mechanism seems sufficient. skb->priority already has
all the qdisc and netfilter infrastructure in place to be used and using
it to prioritize "steer" packets at queues seems reasonable to me. Using
the skb->priority this way is not new pfifo_fast uses it to pick a band
and sch_prio does some similar prioritization, mqprio is a multiqueue
variant.

>> The piece that still kind of bothered me about the original proposal
>> (and perhaps this one) was that setting SO_PRIORITY in an app means
>> 'give my packets more mojo'.
>>
>> Taking something that took unprioritized packets and assigned them and
>> *them only* to a hardware queue struck me as possibly deprioritizing
>> the 'more mojo wanted' packets in the app(s), as they would end up in
>> some other, possibly overloaded, hardware queue.
>>
> I don't really see what you mean by this at all.  Taking packets with no
> priority and assigning them a priority doesn't really have an effect on
> pre-prioritized packets.  Or rather it shouldn't.  You can certainly create a
> problem by having apps prioritized according to conflicting semantic rules, but
> that strikes me as administrative error.  Garbage in...Garbage out.
> 
>> So a cgroup that moves all of the packets from an application into a
>> given hardware queue, and then gets scheduled normally according to
>> skb->priority and friends (software queue, default of pfifo_fast,
>> etc), seems to make some sense to me. (I wouldn't mind if we had
>> abstractions for software queues, too, like, I need a software queue
>> with these properties, find me a place for it on the hardware - but
>> I'm dreaming)
>>
>> One open question is where do packets generated from other subsystems
>> end up, if you are using a cgroup for the app? arp, dns, etc?
>>
> The overriding rule is the association of an skb to a socket.  If a transmitted
> frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
> as set when we passed through the sendmsg code at the top of the stack.
> Otherwise its behavior is unchanged from its current standpoint.
> 

Having a queue selection (skb->queue_mapping?) cgroup also would defeat
any hashing across multiple queues. With mqprio we can assign many hardware
queues to a skb->priority.

w.r.t. software queue abstractions don't we already have this? mq and
mqprio enumerate a software qdisc per hardware queue. You can attach
your favorite qdisc to these. This is likely off-topic for this thread though.

[...]

> In the end, I think its just plain old more useful to assign priorty here than
> some new thing.

I agree.

Thanks,
John


> 
> Neil
>  
>>> John, Robert, if you're supportive of these changes, some Acks would be
>>> appreciated.
>>
>>
>>>
>>>
>>> Regards
>>> Neil
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>>
>> -- 
>> Dave Täht
>> SKYPE: davetaht
>> US Tel: 1-239-829-5608
>> FR Tel: 0638645374
>> http://www.bufferbloat.net
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply

* RE: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eric Dumazet @ 2011-11-14 16:36 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher, pstaszewski, eilong
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AEC2@saturn3.aculab.com>

Le lundi 14 novembre 2011 à 16:21 +0000, David Laight a écrit :
> > bnx2x uses following formula to compute its rx_buf_sz :
> > 
> > dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> > 
> > Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info))
> > 
> > Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> > MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> > 
> > Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> > false sharing because of mem_reclaim in UDP stack.
> > 
> > One possible way to half truesize is to reduce the need by 64 bytes
> > (2112 -> 2048 bytes)
> 
> Would seem to be worth trying to reduce the size for systems
> where the L1 caches lines are 128 bytes - I think that includes
> some of the large ppc systems.
> 
> If the alignment of the malloced memory can't be assumed/forced
> maybe one of the sub-structures could be dynamically placed
> before the data buffer?
> 

What do you call sub-structure ?

Assuming you speak of skb_shared_info, moving at the start wont change
overall head size.

large ppc systems have 64Kb pages and SK_MEM_QUANTUM=64Kb, so it's less
a problem.

One other way would be to have a shinfo->nr_max_frags field, to reduce
max number of fragments, instead of assumin MAX_SKB_FRAGS, for 'legacy'
drivers.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: reduce percpu needs for icmpv6msg mibs
From: Eric Dumazet @ 2011-11-14 16:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20111114081253.226e2b66@nehalam.linuxnetplumber.net>

Le lundi 14 novembre 2011 à 08:12 -0800, Stephen Hemminger a écrit :

> Probably should do this for ICMP for IPv4 and all the other
> non hot counters. The whole per-cpu SNMP MIB stuff for all counters
> is massive overkill.

Well, this (ICMP for IPv4) was already done some days ago, I did the
IPv6 part after IPv4, once David agreed it was a useful patch.


Thanks !

^ permalink raw reply

* RE: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: David Laight @ 2011-11-14 16:21 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: netdev, Ben Hutchings, Tom Herbert, Jamal Hadi Salim,
	Stephen Hemminger, Thomas Graf, Herbert Xu, Jeff Kirsher,
	pstaszewski, eilong
In-Reply-To: <1321286734.2272.48.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

 
> bnx2x uses following formula to compute its rx_buf_sz :
> 
> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> 
> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info))
> 
> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> 
> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> false sharing because of mem_reclaim in UDP stack.
> 
> One possible way to half truesize is to reduce the need by 64 bytes
> (2112 -> 2048 bytes)

Would seem to be worth trying to reduce the size for systems
where the L1 caches lines are 128 bytes - I think that includes
some of the large ppc systems.

If the alignment of the malloced memory can't be assumed/forced
maybe one of the sub-structures could be dynamically placed
before the data buffer?

	David

^ permalink raw reply

* Re: [PATCH RFC] ndo: ndo_queue_xmit/ndo_flush_xmit (was Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net)
From: Michael S. Tsirkin @ 2011-11-14 16:21 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev, davem, kvm, virtualization
In-Reply-To: <20111113174827.GA23310@redhat.com>

On Sun, Nov 13, 2011 at 07:48:28PM +0200, Michael S. Tsirkin wrote:
> @@ -863,6 +872,9 @@ struct net_device_ops {
>  	int			(*ndo_stop)(struct net_device *dev);
>  	netdev_tx_t		(*ndo_start_xmit) (struct sk_buff *skb,
>  						   struct net_device *dev);
> +	netdev_tx_t		(*ndo_queue_xmit)(struct sk_buff *skb,
> +						  struct net_device *dev);
> +	void			(*ndo_flush_xmit)(struct net_device *dev);
>  	u16			(*ndo_select_queue)(struct net_device *dev,
>  						    struct sk_buff *skb);
>  	void			(*ndo_change_rx_flags)(struct net_device *dev,
> @@ -2099,6 +2111,10 @@ extern int		dev_set_mac_address(struct net_device *,

An alternative I considered was to add a boolean flag to ndo_start_xmit
'bool queue' or something like this, plus ndo_flush_xmit.
This will lead to cleaner code I think but will require all drivers
to be changed, so for a proof of concept I decided to go for one
that is less work.

Let me know what looks more palatable ... 

-- 
MST

^ permalink raw reply

* Re: tc: deleting single entry from filter hashtable
From: Eric Dumazet @ 2011-11-14 16:15 UTC (permalink / raw)
  To: Miroslav Kratochvil; +Cc: netdev
In-Reply-To: <CAO0uZ+9N9zwt07EdQzz6KjGcpRT7rdmpaBvHoE0TLDXe-0kxMg@mail.gmail.com>

Le dimanche 13 novembre 2011 à 22:19 +0100, Miroslav Kratochvil a
écrit :
> Hello,
> 
> I have following problem: For performance reasons I've been using a
> setup known from LARTC as 'hashing filters', just like here:
> 
> http://lartc.org/howto/lartc.adv-filter.hashing.html
> 
> For the same performance reasons, I'd like to be able to delete or
> change a _single_ entry from the hash table, so I don't need to refill
> the whole table on every change I need to do (which can get pretty
> slow for amount of entries over 10k, even when using a C program that
> pushes the commands right into 'tc -batch')
> 
> Question: Is the single deletion/modification possible? If not --
> judging from the lack of the documentation on given subject I kindof
> expect that it's not possible -- is there any other possibility to
> delete at least some reasonably small subset of the hash table that
> could be modified and recreated quickly?
> 
> Also, here's my (failed) approach: I tried to assign different prio's
> to all hashtable entries so they could be deleted by reference to that
> prio, but after passing through TC they got created with the same prio
> anyway..
> 
> Thanks for any help with this problem,

Not sure I understand the exact problem, but it is possible to delete a
single filter, and add a new one.

Be careful, since adding a new one really adds a new filter : duplicates
are not detected.

$ tc filter add dev eth0 parent 8001:0 protocol ip prio 100 u32 match ip src  1.2.0.0 classid 1:3
$ tc -s -d filter show dev eth0
filter parent 8001: protocol ip pref 100 u32 
filter parent 8001: protocol ip pref 100 u32 fh 800: ht divisor 1 
filter parent 8001: protocol ip pref 100 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:3  (rule hit 4 success 0)
  match 01020000/ffffffff at 12 (success 0 ) 
$ tc filter add dev eth0 parent 8001:0 protocol ip prio 100 u32 match ip src  1.2.0.0 classid 1:3
$ tc -s -d filter show dev eth0
filter parent 8001: protocol ip pref 100 u32 
filter parent 8001: protocol ip pref 100 u32 fh 800: ht divisor 1 
filter parent 8001: protocol ip pref 100 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:3  (rule hit 4 success 0)
  match 01020000/ffffffff at 12 (success 0 ) 
filter parent 8001: protocol ip pref 100 u32 fh 800::801 order 2049 key ht 800 bkt 0 flowid 1:3  (rule hit 0 success 0)
  match 01020000/ffffffff at 12 (success 0 ) 

^ permalink raw reply

* Re: [PATCH net-next] ipv6: reduce percpu needs for icmpv6msg mibs
From: Stephen Hemminger @ 2011-11-14 16:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1321183444.17837.21.camel@edumazet-laptop>

On Sun, 13 Nov 2011 12:24:04 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Reading /proc/net/snmp6 on a machine with a lot of cpus is very
> expensive (can be ~88000 us).
> 
> This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding
> values for all possible cpus can read 16 Mbytes of memory (32MBytes on
> non x86 arches)
> 
> ICMP messages are not considered as fast path on a typical server, and
> eventually few cpus handle them anyway. We can afford an atomic
> operation instead of using percpu data.
> 
> This saves 4096 bytes per cpu and per network namespace.

Probably should do this for ICMP for IPv4 and all the other
non hot counters. The whole per-cpu SNMP MIB stuff for all counters
is massive overkill.

^ permalink raw reply

* [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eric Dumazet @ 2011-11-14 16:05 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Ben Hutchings, Tom Herbert, Jamal Hadi Salim,
	Stephen Hemminger, Thomas Graf, Herbert Xu, Jeff Kirsher,
	pstaszewski, eilong

bnx2x uses following formula to compute its rx_buf_sz :

dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2

Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
skb_shared_info))

Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.

Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
false sharing because of mem_reclaim in UDP stack.

One possible way to half truesize is to reduce the need by 64 bytes
(2112 -> 2048 bytes)

Instead of allocating a full cache line at the end of packet for
alignment, we can use the fact that skb_shared_info sits at the end of
skb->head, and we can use this room, if we convert bnx2x to new
build_skb() infrastructure.

skb_shared_info will be initialized after hardware finished its
transfert, so we can eventually overwrite the final padding.

Using build_skb() also reduces cache line misses in the driver, since we
use cache hot skb instead of cold ones. Number of in-flight sk_buff
structures is lower, they are recycled while still hot.

Performance results :

(820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h         |   30 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c     |  265 ++++------
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h     |   33 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |    6 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c    |    4 
 5 files changed, 170 insertions(+), 168 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 4b90b51..0f7b7a4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -293,8 +293,13 @@ enum {
 #define FCOE_TXQ_IDX(bp)	(MAX_ETH_TXQ_IDX(bp))
 
 /* fast path */
+/*
+ * This driver uses new build_skb() API :
+ * RX ring buffer contains pointer to kmalloc() data only,
+ * skb are built only after Hardware filled the frame.
+ */
 struct sw_rx_bd {
-	struct sk_buff	*skb;
+	u8		*data;
 	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 
@@ -424,8 +429,8 @@ union host_hc_status_block {
 
 struct bnx2x_agg_info {
 	/*
-	 * First aggregation buffer is an skb, the following - are pages.
-	 * We will preallocate the skbs for each aggregation when
+	 * First aggregation buffer is a data buffer, the following - are pages.
+	 * We will preallocate the data buffer for each aggregation when
 	 * we open the interface and will replace the BD at the consumer
 	 * with this one when we receive the TPA_START CQE in order to
 	 * keep the Rx BD ring consistent.
@@ -439,6 +444,7 @@ struct bnx2x_agg_info {
 	u16			parsing_flags;
 	u16			vlan_tag;
 	u16			len_on_bd;
+	u32			rxhash;
 };
 
 #define Q_STATS_OFFSET32(stat_name) \
@@ -1187,10 +1193,20 @@ struct bnx2x {
 #define ETH_MAX_JUMBO_PACKET_SIZE	9600
 
 	/* Max supported alignment is 256 (8 shift) */
-#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
-					 L1_CACHE_SHIFT : 8)
-	/* FW use 2 Cache lines Alignment for start packet and size  */
-#define BNX2X_FW_RX_ALIGN		(2 << BNX2X_RX_ALIGN_SHIFT)
+#define BNX2X_RX_ALIGN_SHIFT		min(8, L1_CACHE_SHIFT)
+
+	/* FW uses 2 Cache lines Alignment for start packet and size
+	 *
+	 * We assume skb_build() uses sizeof(struct skb_shared_info) bytes
+	 * at the end of skb->data, to avoid wasting a full cache line.
+	 * This reduces memory use (skb->truesize).
+	 */
+#define BNX2X_FW_RX_ALIGN_START	(1UL << BNX2X_RX_ALIGN_SHIFT)
+
+#define BNX2X_FW_RX_ALIGN_END					\
+	max(1UL << BNX2X_RX_ALIGN_SHIFT, 			\
+	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
 	struct host_sp_status_block *def_status_blk;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 13dad92..0d60b9e 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -294,8 +294,21 @@ static void bnx2x_update_sge_prod(struct bnx2x_fastpath *fp,
 	   fp->last_max_sge, fp->rx_sge_prod);
 }
 
+/* Set Toeplitz hash value in the skb using the value from the
+ * CQE (calculated by HW).
+ */
+static u32 bnx2x_get_rxhash(const struct bnx2x *bp,
+			    const struct eth_fast_path_rx_cqe *cqe)
+{
+	/* Set Toeplitz hash from CQE */
+	if ((bp->dev->features & NETIF_F_RXHASH) &&
+	    (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
+		return le32_to_cpu(cqe->rss_hash_result);
+	return 0;
+}
+
 static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
-			    struct sk_buff *skb, u16 cons, u16 prod,
+			    u16 cons, u16 prod,
 			    struct eth_fast_path_rx_cqe *cqe)
 {
 	struct bnx2x *bp = fp->bp;
@@ -310,9 +323,9 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 	if (tpa_info->tpa_state != BNX2X_TPA_STOP)
 		BNX2X_ERR("start of bin not in stop [%d]\n", queue);
 
-	/* Try to map an empty skb from the aggregation info  */
+	/* Try to map an empty data buffer from the aggregation info  */
 	mapping = dma_map_single(&bp->pdev->dev,
-				 first_buf->skb->data,
+				 first_buf->data + NET_SKB_PAD,
 				 fp->rx_buf_size, DMA_FROM_DEVICE);
 	/*
 	 *  ...if it fails - move the skb from the consumer to the producer
@@ -322,15 +335,15 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 
 	if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
 		/* Move the BD from the consumer to the producer */
-		bnx2x_reuse_rx_skb(fp, cons, prod);
+		bnx2x_reuse_rx_data(fp, cons, prod);
 		tpa_info->tpa_state = BNX2X_TPA_ERROR;
 		return;
 	}
 
-	/* move empty skb from pool to prod */
-	prod_rx_buf->skb = first_buf->skb;
+	/* move empty data from pool to prod */
+	prod_rx_buf->data = first_buf->data;
 	dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
-	/* point prod_bd to new skb */
+	/* point prod_bd to new data */
 	prod_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
 	prod_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
 
@@ -344,6 +357,7 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 	tpa_info->tpa_state = BNX2X_TPA_START;
 	tpa_info->len_on_bd = le16_to_cpu(cqe->len_on_bd);
 	tpa_info->placement_offset = cqe->placement_offset;
+	tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe);
 
 #ifdef BNX2X_STOP_ON_ERROR
 	fp->tpa_queue_used |= (1 << queue);
@@ -471,11 +485,12 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 {
 	struct bnx2x_agg_info *tpa_info = &fp->tpa_info[queue];
 	struct sw_rx_bd *rx_buf = &tpa_info->first_buf;
-	u8 pad = tpa_info->placement_offset;
+	u32 pad = tpa_info->placement_offset;
 	u16 len = tpa_info->len_on_bd;
-	struct sk_buff *skb = rx_buf->skb;
+	struct sk_buff *skb = NULL;
+	u8 *data = rx_buf->data;
 	/* alloc new skb */
-	struct sk_buff *new_skb;
+	u8 *new_data;
 	u8 old_tpa_state = tpa_info->tpa_state;
 
 	tpa_info->tpa_state = BNX2X_TPA_STOP;
@@ -486,18 +501,18 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 	if (old_tpa_state == BNX2X_TPA_ERROR)
 		goto drop;
 
-	/* Try to allocate the new skb */
-	new_skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
+	/* Try to allocate the new data */
+	new_data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
 
 	/* Unmap skb in the pool anyway, as we are going to change
 	   pool entry status to BNX2X_TPA_STOP even if new skb allocation
 	   fails. */
 	dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
 			 fp->rx_buf_size, DMA_FROM_DEVICE);
+	if (likely(new_data))
+		skb = build_skb(data);
 
-	if (likely(new_skb)) {
-		prefetch(skb);
-		prefetch(((char *)(skb)) + L1_CACHE_BYTES);
+	if (likely(skb)) {
 
 #ifdef BNX2X_STOP_ON_ERROR
 		if (pad + len > fp->rx_buf_size) {
@@ -509,8 +524,9 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 		}
 #endif
 
-		skb_reserve(skb, pad);
+		skb_reserve(skb, pad + NET_SKB_PAD);
 		skb_put(skb, len);
+		skb->rxhash = tpa_info->rxhash;
 
 		skb->protocol = eth_type_trans(skb, bp->dev);
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -526,8 +542,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 		}
 
 
-		/* put new skb in bin */
-		rx_buf->skb = new_skb;
+		/* put new data in bin */
+		rx_buf->data = new_data;
 
 		return;
 	}
@@ -539,19 +555,6 @@ drop:
 	fp->eth_q_stats.rx_skb_alloc_failed++;
 }
 
-/* Set Toeplitz hash value in the skb using the value from the
- * CQE (calculated by HW).
- */
-static inline void bnx2x_set_skb_rxhash(struct bnx2x *bp, union eth_rx_cqe *cqe,
-					struct sk_buff *skb)
-{
-	/* Set Toeplitz hash from CQE */
-	if ((bp->dev->features & NETIF_F_RXHASH) &&
-	    (cqe->fast_path_cqe.status_flags &
-	     ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
-		skb->rxhash =
-		le32_to_cpu(cqe->fast_path_cqe.rss_hash_result);
-}
 
 int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 {
@@ -594,6 +597,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		u8 cqe_fp_flags;
 		enum eth_rx_cqe_type cqe_fp_type;
 		u16 len, pad;
+		u8 *data;
 
 #ifdef BNX2X_STOP_ON_ERROR
 		if (unlikely(bp->panic))
@@ -604,13 +608,6 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		bd_prod = RX_BD(bd_prod);
 		bd_cons = RX_BD(bd_cons);
 
-		/* Prefetch the page containing the BD descriptor
-		   at producer's index. It will be needed when new skb is
-		   allocated */
-		prefetch((void *)(PAGE_ALIGN((unsigned long)
-					     (&fp->rx_desc_ring[bd_prod])) -
-				  PAGE_SIZE + 1));
-
 		cqe = &fp->rx_comp_ring[comp_ring_cons];
 		cqe_fp = &cqe->fast_path_cqe;
 		cqe_fp_flags = cqe_fp->type_error_flags;
@@ -626,125 +623,110 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		if (unlikely(CQE_TYPE_SLOW(cqe_fp_type))) {
 			bnx2x_sp_event(fp, cqe);
 			goto next_cqe;
+		}
+		rx_buf = &fp->rx_buf_ring[bd_cons];
+		data = rx_buf->data;
 
-		/* this is an rx packet */
-		} else {
-			rx_buf = &fp->rx_buf_ring[bd_cons];
-			skb = rx_buf->skb;
-			prefetch(skb);
-
-			if (!CQE_TYPE_FAST(cqe_fp_type)) {
+		if (!CQE_TYPE_FAST(cqe_fp_type)) {
 #ifdef BNX2X_STOP_ON_ERROR
-				/* sanity check */
-				if (fp->disable_tpa &&
-				    (CQE_TYPE_START(cqe_fp_type) ||
-				     CQE_TYPE_STOP(cqe_fp_type)))
-					BNX2X_ERR("START/STOP packet while "
-						  "disable_tpa type %x\n",
-						  CQE_TYPE(cqe_fp_type));
+			/* sanity check */
+			if (fp->disable_tpa &&
+			    (CQE_TYPE_START(cqe_fp_type) ||
+			     CQE_TYPE_STOP(cqe_fp_type)))
+				BNX2X_ERR("START/STOP packet while "
+					  "disable_tpa type %x\n",
+					  CQE_TYPE(cqe_fp_type));
 #endif
 
-				if (CQE_TYPE_START(cqe_fp_type)) {
-					u16 queue = cqe_fp->queue_index;
-					DP(NETIF_MSG_RX_STATUS,
-					   "calling tpa_start on queue %d\n",
-					   queue);
-
-					bnx2x_tpa_start(fp, queue, skb,
-							bd_cons, bd_prod,
-							cqe_fp);
-
-					/* Set Toeplitz hash for LRO skb */
-					bnx2x_set_skb_rxhash(bp, cqe, skb);
-
-					goto next_rx;
+			if (CQE_TYPE_START(cqe_fp_type)) {
+				u16 queue = cqe_fp->queue_index;
+				DP(NETIF_MSG_RX_STATUS,
+				   "calling tpa_start on queue %d\n",
+				   queue);
 
-				} else {
-					u16 queue =
-						cqe->end_agg_cqe.queue_index;
-					DP(NETIF_MSG_RX_STATUS,
-					   "calling tpa_stop on queue %d\n",
-					   queue);
+				bnx2x_tpa_start(fp, queue,
+						bd_cons, bd_prod,
+						cqe_fp);
+				goto next_rx;
+			} else {
+				u16 queue =
+					cqe->end_agg_cqe.queue_index;
+				DP(NETIF_MSG_RX_STATUS,
+				   "calling tpa_stop on queue %d\n",
+				   queue);
 
-					bnx2x_tpa_stop(bp, fp, queue,
-						       &cqe->end_agg_cqe,
-						       comp_ring_cons);
+				bnx2x_tpa_stop(bp, fp, queue,
+					       &cqe->end_agg_cqe,
+					       comp_ring_cons);
 #ifdef BNX2X_STOP_ON_ERROR
-					if (bp->panic)
-						return 0;
+				if (bp->panic)
+					return 0;
 #endif
 
-					bnx2x_update_sge_prod(fp, cqe_fp);
-					goto next_cqe;
-				}
+				bnx2x_update_sge_prod(fp, cqe_fp);
+				goto next_cqe;
 			}
-			/* non TPA */
-			len = le16_to_cpu(cqe_fp->pkt_len);
-			pad = cqe_fp->placement_offset;
-			dma_sync_single_for_cpu(&bp->pdev->dev,
+		}
+		/* non TPA */
+		len = le16_to_cpu(cqe_fp->pkt_len);
+		pad = cqe_fp->placement_offset;
+		dma_sync_single_for_cpu(&bp->pdev->dev,
 					dma_unmap_addr(rx_buf, mapping),
-						       pad + RX_COPY_THRESH,
-						       DMA_FROM_DEVICE);
-			prefetch(((char *)(skb)) + L1_CACHE_BYTES);
+					pad + RX_COPY_THRESH,
+					DMA_FROM_DEVICE);
+		pad += NET_SKB_PAD;
+		prefetch(data + pad); /* speedup eth_type_trans() */
+		/* is this an error packet? */
+		if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
+			DP(NETIF_MSG_RX_ERR,
+			   "ERROR  flags %x  rx packet %u\n",
+			   cqe_fp_flags, sw_comp_cons);
+			fp->eth_q_stats.rx_err_discard_pkt++;
+			goto reuse_rx;
+		}
 
-			/* is this an error packet? */
-			if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
+		/* Since we don't have a jumbo ring
+		 * copy small packets if mtu > 1500
+		 */
+		if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
+		    (len <= RX_COPY_THRESH)) {
+			skb = netdev_alloc_skb_ip_align(bp->dev, len);
+			if (skb == NULL) {
 				DP(NETIF_MSG_RX_ERR,
-				   "ERROR  flags %x  rx packet %u\n",
-				   cqe_fp_flags, sw_comp_cons);
-				fp->eth_q_stats.rx_err_discard_pkt++;
+				   "ERROR  packet dropped because of alloc failure\n");
+				fp->eth_q_stats.rx_skb_alloc_failed++;
 				goto reuse_rx;
 			}
-
-			/* Since we don't have a jumbo ring
-			 * copy small packets if mtu > 1500
-			 */
-			if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
-			    (len <= RX_COPY_THRESH)) {
-				struct sk_buff *new_skb;
-
-				new_skb = netdev_alloc_skb(bp->dev, len + pad);
-				if (new_skb == NULL) {
-					DP(NETIF_MSG_RX_ERR,
-					   "ERROR  packet dropped "
-					   "because of alloc failure\n");
-					fp->eth_q_stats.rx_skb_alloc_failed++;
-					goto reuse_rx;
-				}
-
-				/* aligned copy */
-				skb_copy_from_linear_data_offset(skb, pad,
-						    new_skb->data + pad, len);
-				skb_reserve(new_skb, pad);
-				skb_put(new_skb, len);
-
-				bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
-
-				skb = new_skb;
-
-			} else
-			if (likely(bnx2x_alloc_rx_skb(bp, fp, bd_prod) == 0)) {
+			memcpy(skb->data, data + pad, len);
+			bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
+		} else {
+			if (likely(bnx2x_alloc_rx_data(bp, fp, bd_prod) == 0)) {
 				dma_unmap_single(&bp->pdev->dev,
-					dma_unmap_addr(rx_buf, mapping),
+						 dma_unmap_addr(rx_buf, mapping),
 						 fp->rx_buf_size,
 						 DMA_FROM_DEVICE);
+				skb = build_skb(data);
+				if (unlikely(!skb)) {
+					kfree(data);
+					fp->eth_q_stats.rx_skb_alloc_failed++;
+					goto next_rx;
+				}
 				skb_reserve(skb, pad);
-				skb_put(skb, len);
-
 			} else {
 				DP(NETIF_MSG_RX_ERR,
 				   "ERROR  packet dropped because "
 				   "of alloc failure\n");
 				fp->eth_q_stats.rx_skb_alloc_failed++;
 reuse_rx:
-				bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
+				bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
 				goto next_rx;
 			}
 
+			skb_put(skb, len);
 			skb->protocol = eth_type_trans(skb, bp->dev);
 
 			/* Set Toeplitz hash for a none-LRO skb */
-			bnx2x_set_skb_rxhash(bp, cqe, skb);
+			skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp);
 
 			skb_checksum_none_assert(skb);
 
@@ -767,7 +749,7 @@ reuse_rx:
 
 
 next_rx:
-		rx_buf->skb = NULL;
+		rx_buf->data = NULL;
 
 		bd_cons = NEXT_RX_IDX(bd_cons);
 		bd_prod = NEXT_RX_IDX(bd_prod);
@@ -1013,9 +995,9 @@ void bnx2x_init_rx_rings(struct bnx2x *bp)
 				struct sw_rx_bd *first_buf =
 					&tpa_info->first_buf;
 
-				first_buf->skb = netdev_alloc_skb(bp->dev,
-						       fp->rx_buf_size);
-				if (!first_buf->skb) {
+				first_buf->data = kmalloc(fp->rx_buf_size + NET_SKB_PAD,
+							  GFP_ATOMIC);
+				if (!first_buf->data) {
 					BNX2X_ERR("Failed to allocate TPA "
 						  "skb pool for queue[%d] - "
 						  "disabling TPA on this "
@@ -1118,16 +1100,16 @@ static void bnx2x_free_rx_bds(struct bnx2x_fastpath *fp)
 
 	for (i = 0; i < NUM_RX_BD; i++) {
 		struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[i];
-		struct sk_buff *skb = rx_buf->skb;
+		u8 *data = rx_buf->data;
 
-		if (skb == NULL)
+		if (data == NULL)
 			continue;
 		dma_unmap_single(&bp->pdev->dev,
 				 dma_unmap_addr(rx_buf, mapping),
 				 fp->rx_buf_size, DMA_FROM_DEVICE);
 
-		rx_buf->skb = NULL;
-		dev_kfree_skb(skb);
+		rx_buf->data = NULL;
+		kfree(data);
 	}
 }
 
@@ -1509,6 +1491,7 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
 
 	for_each_queue(bp, i) {
 		struct bnx2x_fastpath *fp = &bp->fp[i];
+		u32 mtu;
 
 		/* Always use a mini-jumbo MTU for the FCoE L2 ring */
 		if (IS_FCOE_IDX(i))
@@ -1518,13 +1501,15 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
 			 * IP_HEADER_ALIGNMENT_PADDING to prevent a buffer
 			 * overrun attack.
 			 */
-			fp->rx_buf_size =
-				BNX2X_FCOE_MINI_JUMBO_MTU + ETH_OVREHEAD +
-				BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
+			mtu = BNX2X_FCOE_MINI_JUMBO_MTU;
 		else
-			fp->rx_buf_size =
-				bp->dev->mtu + ETH_OVREHEAD +
-				BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
+			mtu = bp->dev->mtu;
+		fp->rx_buf_size = BNX2X_FW_RX_ALIGN_START +
+				  IP_HEADER_ALIGNMENT_PADDING +
+				  ETH_OVREHEAD +
+				  mtu +
+				  BNX2X_FW_RX_ALIGN_END;
+		/* Note : rx_buf_size doesnt take into account NET_SKB_PAD */
 	}
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index 260226d..41eb17e 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -910,26 +910,27 @@ static inline int bnx2x_alloc_rx_sge(struct bnx2x *bp,
 	return 0;
 }
 
-static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
-				     struct bnx2x_fastpath *fp, u16 index)
+static inline int bnx2x_alloc_rx_data(struct bnx2x *bp,
+				      struct bnx2x_fastpath *fp, u16 index)
 {
-	struct sk_buff *skb;
+	u8 *data;
 	struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[index];
 	struct eth_rx_bd *rx_bd = &fp->rx_desc_ring[index];
 	dma_addr_t mapping;
 
-	skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
-	if (unlikely(skb == NULL))
+	data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
+	if (unlikely(data == NULL))
 		return -ENOMEM;
 
-	mapping = dma_map_single(&bp->pdev->dev, skb->data, fp->rx_buf_size,
+	mapping = dma_map_single(&bp->pdev->dev, data + NET_SKB_PAD,
+				 fp->rx_buf_size,
 				 DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
-		dev_kfree_skb_any(skb);
+		kfree(data);
 		return -ENOMEM;
 	}
 
-	rx_buf->skb = skb;
+	rx_buf->data = data;
 	dma_unmap_addr_set(rx_buf, mapping, mapping);
 
 	rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
@@ -938,12 +939,12 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
 	return 0;
 }
 
-/* note that we are not allocating a new skb,
+/* note that we are not allocating a new buffer,
  * we are just moving one from cons to prod
  * we are not creating a new mapping,
  * so there is no need to check for dma_mapping_error().
  */
-static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
+static inline void bnx2x_reuse_rx_data(struct bnx2x_fastpath *fp,
 				      u16 cons, u16 prod)
 {
 	struct sw_rx_bd *cons_rx_buf = &fp->rx_buf_ring[cons];
@@ -953,7 +954,7 @@ static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
 
 	dma_unmap_addr_set(prod_rx_buf, mapping,
 			   dma_unmap_addr(cons_rx_buf, mapping));
-	prod_rx_buf->skb = cons_rx_buf->skb;
+	prod_rx_buf->data = cons_rx_buf->data;
 	*prod_bd = *cons_bd;
 }
 
@@ -1029,9 +1030,9 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
 	for (i = 0; i < last; i++) {
 		struct bnx2x_agg_info *tpa_info = &fp->tpa_info[i];
 		struct sw_rx_bd *first_buf = &tpa_info->first_buf;
-		struct sk_buff *skb = first_buf->skb;
+		u8 *data = first_buf->data;
 
-		if (skb == NULL) {
+		if (data == NULL) {
 			DP(NETIF_MSG_IFDOWN, "tpa bin %d empty on free\n", i);
 			continue;
 		}
@@ -1039,8 +1040,8 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
 			dma_unmap_single(&bp->pdev->dev,
 					 dma_unmap_addr(first_buf, mapping),
 					 fp->rx_buf_size, DMA_FROM_DEVICE);
-		dev_kfree_skb(skb);
-		first_buf->skb = NULL;
+		kfree(data);
+		first_buf->data = NULL;
 	}
 }
 
@@ -1148,7 +1149,7 @@ static inline int bnx2x_alloc_rx_bds(struct bnx2x_fastpath *fp,
 	 * fp->eth_q_stats.rx_skb_alloc_failed = 0
 	 */
 	for (i = 0; i < rx_ring_size; i++) {
-		if (bnx2x_alloc_rx_skb(bp, fp, ring_prod) < 0) {
+		if (bnx2x_alloc_rx_data(bp, fp, ring_prod) < 0) {
 			fp->eth_q_stats.rx_skb_alloc_failed++;
 			continue;
 		}
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index f6402fa..ec31871 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -1740,6 +1740,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
 	struct sw_rx_bd *rx_buf;
 	u16 len;
 	int rc = -ENODEV;
+	u8 *data;
 
 	/* check the loopback mode */
 	switch (loopback_mode) {
@@ -1865,10 +1866,9 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
 	dma_sync_single_for_cpu(&bp->pdev->dev,
 				   dma_unmap_addr(rx_buf, mapping),
 				   fp_rx->rx_buf_size, DMA_FROM_DEVICE);
-	skb = rx_buf->skb;
-	skb_reserve(skb, cqe->fast_path_cqe.placement_offset);
+	data = rx_buf->data + NET_SKB_PAD + cqe->fast_path_cqe.placement_offset;
 	for (i = ETH_HLEN; i < pkt_size; i++)
-		if (*(skb->data + i) != (unsigned char) (i & 0xff))
+		if (*(data + i) != (unsigned char) (i & 0xff))
 			goto test_loopback_rx_exit;
 
 	rc = 0;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 33ff60d..80fb9b5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -2789,8 +2789,8 @@ static void bnx2x_pf_rx_q_prep(struct bnx2x *bp,
 	/* This should be a maximum number of data bytes that may be
 	 * placed on the BD (not including paddings).
 	 */
-	rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN -
-		IP_HEADER_ALIGNMENT_PADDING;
+	rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN_START -
+		BNX2X_FW_RX_ALIGN_END -	IP_HEADER_ALIGNMENT_PADDING;
 
 	rxq_init->cl_qzone_id = fp->cl_qzone_id;
 	rxq_init->tpa_agg_sz = tpa_agg_size;

^ permalink raw reply related

* [PATCH net-next 1/2] net: introduce build_skb()
From: Eric Dumazet @ 2011-11-14 16:03 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, pstaszewski, netdev, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher

One of the thing we discussed during netdev 2011 conference was the idea
to change some network drivers to allocate/populate their skb at RX
completion time, right before feeding the skb to network stack.

In old days, we allocated skbs when populating the RX ring.

This means bringing into cpu cache sk_buff and skb_shared_info cache
lines (since we clear/initialize them), then 'queue' skb->data to NIC.

By the time NIC fills a frame in skb->data buffer and host can process
it, cpu probably threw away the cache lines from its caches, because lot
of things happened between the allocation and final use.

So the deal would be to allocate only the data buffer for the NIC to
populate its RX ring buffer. And use build_skb() at RX completion to
attach a data buffer (now filled with an ethernet frame) to a new skb,
initialize the skb_shared_info portion, and give the hot skb to network
stack.

build_skb() is the function to allocate an skb, caller providing the
data buffer that should be attached to it. Drivers are expected to call 
skb_reserve() right after build_skb() to adjust skb->data to the
Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
drivers might add a hardware provided alignment)

Data provided to build_skb() MUST have been allocated by a prior
kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
skb_shared_info)) bytes at the end of the data without corrupting
incoming frame.

data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
               SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
	       GFP_ATOMIC);
...
skb = build_skb(data);
if (!skb) {
	recycle_data(data);
} else {
	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
	...
}

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 include/linux/skbuff.h |    1 
 net/core/skbuff.c      |   49 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe86488..abad8a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -540,6 +540,7 @@ extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
+extern struct sk_buff *build_skb(void *data);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 18a3ceb..8d2c5b3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -245,6 +245,55 @@ nodata:
 EXPORT_SYMBOL(__alloc_skb);
 
 /**
+ * build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ *
+ * Allocate a new &sk_buff. Caller provides space holding head and
+ * skb_shared_info. @data must have been allocated by kmalloc()
+ * The return is the new skb buffer.
+ * On a failure the return is %NULL, and @data is not freed.
+ * Notes :
+ *  Before IO, driver allocates only data buffer where NIC put incoming frame
+ *  Driver should add room at head (NET_SKB_PAD) and
+ *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
+ *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
+ *  before giving packet to stack.
+ *  RX rings only contains data buffers, not full skbs.
+ */
+struct sk_buff *build_skb(void *data)
+{
+	struct skb_shared_info *shinfo;
+	struct sk_buff *skb;
+	unsigned int size;
+
+	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	if (!skb)
+		return NULL;
+
+	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->truesize = SKB_TRUESIZE(size);
+	atomic_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + size;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	skb->mac_header = ~0U;
+#endif
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+	kmemcheck_annotate_variable(shinfo->destructor_arg);
+
+	return skb;
+}
+EXPORT_SYMBOL(build_skb);
+
+/**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
  *	@length: length to allocate

^ permalink raw reply related

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
From: Eric Dumazet @ 2011-11-14 15:57 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, bhutchings, pstaszewski, netdev, Thomas Graf, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger
In-Reply-To: <1321251945.17837.55.camel@edumazet-laptop>

Le lundi 14 novembre 2011 à 07:25 +0100, Eric Dumazet a écrit :
> Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :
> 
> > I fully support bringing this thing back to life :-)
> 
> I'll make extensive tests today and provide two patches when ready, with
> all performance results.
> 
> Some prefetch() calls will be removed, since build_skb() provides
> already cache hot skb.

Impressive results :

before : 720.000 pps
after :  820.000 pps

[ One mono threaded application receiving UDP messages on a single
socket, asking IP_PKTINFO ancillary info ]

Latencies are also a bit improved : softirq handler dirties about 320
bytes less per skb.

Definitely worth the pain.

I am sending two patches. Other drivers probably can benefit from
build_skb() as well.

[PATCH net-next 1/2] net: introduce build_skb()
[PATCH net-next 2/2] bnx2x: uses build_skb() in receive path

^ permalink raw reply

* Re: [PATCH] net/packet: remove dead code and unneeded variable from prb_setup_retire_blk_timer()
From: chetan loke @ 2011-11-14 15:54 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: netdev, linux-kernel, David S. Miller, Eric Dumazet, Changli Gao,
	Ben Greear, waltje, gw4pts, waltje, ross.biro, alan
In-Reply-To: <alpine.LNX.2.00.1111132247440.3368@swampdragon.chaosbits.net>

On Sun, Nov 13, 2011 at 4:55 PM, Jesper Juhl <jj@chaosbits.net> wrote:
> We test for 'tx_ring' being != zero and BUG() if that's the case. So after
> that check there is no way that 'tx_ring' could be anything _but_ zero, so
> testing it again is just dead code. Once that dead code is removed, the
> 'pkc' local variable becomes entirely redundant, so remove that as well.

It was there so that it would be a no-brainer thing for the next
person who would want to enable tx_ring support. And given that this
check was called just once during the initial setup, it didn't seem
like a big deal to me. And the BUG() would let the next-person know
what different paths need to be plugged.


Chetan Loke

^ permalink raw reply

* Re: [PATCH V2 2/6] net/ethernet/atl1e: Disable ASPM
From: Matthew Garrett @ 2011-11-14 15:44 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, netdev, jcliburn, chris.snook
In-Reply-To: <20111114.003322.1579175313793306922.davem@davemloft.net>

Sorry, I screwed up there - it looks like I tested atl1c twice. Still 
inexcusable, and apologies for the wasted time.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Neil Horman @ 2011-11-14 14:43 UTC (permalink / raw)
  To: Dave Taht; +Cc: netdev, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <CAA93jw7tQtBGFMuDiK_pRUqAdfBVqrwbMegR8gh-8KgVb19PWg@mail.gmail.com>

On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> >> Data Center Bridging environments are currently somewhat limited in their
> >> ability to provide a general mechanism for controlling traffic priority.
> >> Specifically they are unable to administratively control the priority at which
> >> various types of network traffic are sent.
> >>
> >> Currently, the only ways to set the priority of a network buffer are:
> >>
> >> 1) Through the use of the SO_PRIORITY socket option
> >> 2) By using low level hooks, like a tc action
> >>
> >> (1) is difficult from an administrative perspective because it requires that the
> >> application to be coded to not just assume the default priority is sufficient,
> >> and must expose an administrative interface to allow priority adjustment.  Such
> >> a solution is not scalable in a DCB environment
> >>
> >> (2) is also difficult, as it requires constant administrative oversight of
> >> applications so as to build appropriate rules to match traffic belonging to
> >> various classes, so that priority can be appropriately set. It is further
> >> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> >> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> >> hw queues for various traffic classes and needs the priority to be set prior to
> >> selecting the root qdisc)
> >>
> >>
> >> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> >> being a good general solution to this problem.  The network priority cgroup
> >> allows for a per-interface priority map to be built per cgroup.  Any traffic
> >> originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface.  This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> >>
> >> Tested by John and myself, with good results
> >>
> >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >> CC: John Fastabend <john.r.fastabend@intel.com>
> >> CC: Robert Love <robert.w.love@intel.com>
> >> CC: "David S. Miller" <davem@davemloft.net>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> > Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
> > the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> > this change.  Any other reviews would be welcome.
> 
> Well, in part I've been playing catchup in the hope that lldp and
> openlldp and/or this dcb netlink layer that I don't know anything
> about (pointers please?) could help somehow to resolve the semantic
> mess skb->priority has become in the first place.
> 
> I liked what was described here.
> 
> "What if we did at least carve out the DCB functionality away from
> skb->priority?  Since, AIUI, we're only concerning ourselves with
> locally generated traffic here, we're talking
> about skbs that have a socket attached to them.  We could, instead of indexing
> the prio_tc_map with skb->priority, we could index it with
> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
> something more appropriately named), and we don't have to touch skb->priority at
> all.  I'd really rather not start down that road until I got more opinions and
> consensus on that, but it seems like a pretty good solution, one that would
> allow hardware queue selection in systems that use things like DCB to co-exist
> with software queueing features."
> 
I was initially ok with this, but the more I think about it, the more I think
its just not needed (see further down in this email for my reasoning).  John,
Rob, do you have any thoughts here?

> The piece that still kind of bothered me about the original proposal
> (and perhaps this one) was that setting SO_PRIORITY in an app means
> 'give my packets more mojo'.
> 
> Taking something that took unprioritized packets and assigned them and
> *them only* to a hardware queue struck me as possibly deprioritizing
> the 'more mojo wanted' packets in the app(s), as they would end up in
> some other, possibly overloaded, hardware queue.
> 
I don't really see what you mean by this at all.  Taking packets with no
priority and assigning them a priority doesn't really have an effect on
pre-prioritized packets.  Or rather it shouldn't.  You can certainly create a
problem by having apps prioritized according to conflicting semantic rules, but
that strikes me as administrative error.  Garbage in...Garbage out.

> So a cgroup that moves all of the packets from an application into a
> given hardware queue, and then gets scheduled normally according to
> skb->priority and friends (software queue, default of pfifo_fast,
> etc), seems to make some sense to me. (I wouldn't mind if we had
> abstractions for software queues, too, like, I need a software queue
> with these properties, find me a place for it on the hardware - but
> I'm dreaming)
> 
> One open question is where do packets generated from other subsystems
> end up, if you are using a cgroup for the app? arp, dns, etc?
> 
The overriding rule is the association of an skb to a socket.  If a transmitted
frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
as set when we passed through the sendmsg code at the top of the stack.
Otherwise its behavior is unchanged from its current standpoint.

> So to rephrase your original description from this:
> 
> >> Any traffic originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface.  This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
> 
> To this:
> 
> "Any traffic originating from an application in a cgroup,  will have
> its hardware queue  assigned to the value designated for that group on
> that interface.  This allows a user space daemon, when conducting LLDP
> negotiation with a DCB enabled peer to create a cgroup based on the
> APP_TLV value received and administratively assign applications to
> that hardware queue using the existing cgroup utility infrastructure."
> 
As above, I'm split brained about this.  I'm ok with the idea of making this a
queue selection cgroup, and separating it from priority, but at the same time,
in the context of DCB, we really are assigning priority here, so it seems a bit
false to do something that is not priority.  I also like the fact that it
provides administrative control in a way that netfilter and tc don't really
enable. 

> Assuming we're on the same page here, what the heck is APP_TLV?
> 
LLDP does layer 2 discovery with peer networking devices. It does so using sets
of Type/length/value tuples.  The types carry various bits of information, such
as which priority groups are available on the network.  The APP tlv conveys
application or feature specific information.  for instance, There is an ISCSI
app tlv that tells the host that  "on the interface you received this tlv, iscsi
traffic must be sent at priority X".  The idea being that, on receipt of this
tlv, the DCB daemon can create an ISCSI network priority cgroup instance, and
augment the cgroup rules file such that, when the user space iscsi daemon is
started, its traffic automatically transmits at the appropriate priority.

Actually, as I sit here thinking about it, I'm less supportive of separating the
assignment of hw queue / root qdisc from priority.  I say that because its
really just not necessecary, at least not in a DCB environment.  In a dcb
environment, we use priority exclusively to select the root qdisc and the
associated hardware queue.  Any software queue policing/grooming that takes
place after that by definition can't make any decisions based on skb->priority,
as the priority will be the same for each frame in the given parent queue (since
we used skb->priority to assign the parent queue).  You can still use RED or
some other queue discipline to manage queue length, but you have to understand
that priority won't be a packet-differentiating factor there.  If you made this
a separate queue selection variable, you could manipulate priority independenty,
but it would be non-sensical, because the intent would still be to send all
packets on that queue at the same priority, since thats what that hardware queue
is configured for.

In the alternate, if you don't use DCB, then this cgroup becomes somewhat
worthless anyway, because no-one has a need to select a specific hardware queue.
If you want to categorize traffic into classes you can already use the net_cls
cgroup, which works quite well for bandwidth allocation, and that still frees up
the skb->prioirty assignment to use as you see fit.

In the end, I think its just plain old more useful to assign priorty here than
some new thing.

Neil
 
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
> 
> 
> >
> >
> > Regards
> > Neil
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> 
> -- 
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> FR Tel: 0638645374
> http://www.bufferbloat.net
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] net, bridge: print log message after state changed
From: Joe Perches @ 2011-11-14 14:27 UTC (permalink / raw)
  To: David Miller; +Cc: Wolfgang.Fritz, holger.brunck, netdev, shemminger, bridge
In-Reply-To: <20111114.020749.1176229013272872343.davem@davemloft.net>

On Mon, 2011-11-14 at 02:07 -0500, David Miller wrote:
> I would never expect an "entering" message to print out after the event
> happens, and in fact I'd _always_ want to see it beforehand so that if
> the state change caused a crash or similar it'd be that much easier
> to pinpoint the proper location.
> 
> I'm still not applying this.  If the other log messages behave
> differently, they are broken, so go fix those instead.

Perhaps use "entered" after the transition or "entering" before.

^ permalink raw reply

* Re: [PATCH] ip6_tunnel: copy parms.name after register_netdevice
From: Josh Boyer @ 2011-11-14 14:02 UTC (permalink / raw)
  To: David Miller; +Cc: jpirko, netdev, linux-kernel, stable
In-Reply-To: <20111114.002456.1590067179487443754.davem@davemloft.net>

On Mon, Nov 14, 2011 at 12:24:56AM -0500, David Miller wrote:
> From: Josh Boyer <jwboyer@redhat.com>
> Date: Thu, 10 Nov 2011 20:10:23 -0500
> 
> > @@ -288,6 +288,8 @@ static struct ip6_tnl *ip6_tnl_create(struct net *net, struct ip6_tnl_parm *p)
> >  
> >  	if ((err = register_netdevice(dev)) < 0)
> >  		goto failed_free;
> > +	
> > +	strcpy(t->parms.name, dev->name);
> 
> Trailing whitespace on that blank line you're adding.

Ugh.  Forgot to run checkpatch.  Apologies.

> I fixed this up, applied, and queued up for -stable, thanks.

Thank you.

josh

^ permalink raw reply

* [PATCH] NET: MIPS: lantiq: fix etop compile error
From: John Crispin @ 2011-11-14 15:01 UTC (permalink / raw)
  To: netdev; +Cc: John Crispin

The Lantiq ETOP ethernet driver fails to build in 3.2-rc1 due to 2 missing
header files.

Signed-off-by: John Crispin <blogic@openwrt.org>
---
 drivers/net/ethernet/lantiq_etop.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index 6bb2b95..0b3567a 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -34,6 +34,8 @@
 #include <linux/init.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/dma-mapping.h>
+#include <linux/module.h>
 
 #include <asm/checksum.h>
 
-- 
1.7.7.1

^ permalink raw reply related

* net/can/mscan: Fix buggy listen only mode setting
From: Wolfgang Grandegger @ 2011-11-14 13:34 UTC (permalink / raw)
  To: Netdev; +Cc: linux-can, Socketcan-core, Marc Kleine-Budde

This patch fixes an issue introduced recently with commit
452448f9283e1939408b397e87974a418825b0a8.

CC: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Wolfgang Grandegger <wg@grandegger.com>
---
 drivers/net/can/mscan/mscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/can/mscan/mscan.c b/drivers/net/can/mscan/mscan.c
index 74f3b18..1c82dd8 100644
--- a/drivers/net/can/mscan/mscan.c
+++ b/drivers/net/can/mscan/mscan.c
@@ -581,7 +581,7 @@ static int mscan_open(struct net_device *dev)
 
 	priv->open_time = jiffies;
 
-	if (ctrlmode.flags & CAN_CTRLMODE_LISTENONLY)
+	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
 		setbits8(&regs->canctl1, MSCAN_LISTEN);
 	else
 		clrbits8(&regs->canctl1, MSCAN_LISTEN);
-- 
1.7.6.4

^ permalink raw reply related

* RE: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: David Laight @ 2011-11-14 13:25 UTC (permalink / raw)
  To: Ian Campbell, Michael S. Tsirkin
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs
In-Reply-To: <1321276055.3664.67.camel@zakaz.uk.xensource.com>

> This prevents an issue where an ACK is delayed, a retransmit is queued
(either
> at the RPC or TCP level) and the ACK arrives before the retransmission
hits the
> wire. If this happens to an NFS WRITE RPC then the write() system call
> completes and the userspace process can continue, potentially
modifying data
> referenced by the retransmission before the retransmission occurs.

The only problem I see is that the source address might get
invalidated (assuming it is a user address mapped into kernel).
Not sure what effect the fault would have...

If it is an RPC retransmittion the NFS write won't be
terminated by the RPC ACK (because it is a stale ACK)
and the destination will see two copies of the NFS write.
(This is quite common with NFS/UDP.)

If it is a TCP retransmittion, then the receiving TCP stack
will see it as duplicate data and discard the contents
without passing them to RPC.

Interrupting an NFS write with a signal is problematic
anyway - I'm sure I've seen systems (SYSV?) when NFS
writes were uninterruptable. NFI how Linux handles this.

	David

^ permalink raw reply

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: Ian Campbell @ 2011-11-14 13:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David S. Miller,
	Neil Brown, J. Bruce Fields,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20111113101713.GB15322-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sun, 2011-11-13 at 10:17 +0000, Michael S. Tsirkin wrote:
> On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> > On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > > completes and the userspace process can continue, potentially modifying data
> > > > referenced by the retransmission before the retransmission occurs.
> > > > 
> > > > Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > > > Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > > > Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > > > Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> > > > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> > > > Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > 
> > > So this blocks the system call until all page references
> > > are gone, right?
> > 
> > Right. The alternative is to return to userspace while the network stack
> > still has a reference to the buffer which was passed in -- that's the
> > exact class of problem this patch is supposed to fix.
> 
> BTW, the increased latency and the overhead extra wakeups might for some
> workloads be greater than the cost of the data copy.

Under normal circumstances these paths should not be activated at all.
These only come into play if there are delays in the network coming from
somewhere and I would expect any negative effect from that to outweigh
either a copy or an additional wakeup.

> > 
> > >  consider a bridged setup
> > > with an skb queued at a tap device - this cause one process
> > > to block another one by virtue of not consuming a cloned skb?
> > 
> > Hmm, yes.
> > 
> > One approach might be to introduce the concept of an skb timeout to the
> > stack as a whole and cancel (or deep copy) after that timeout occurs.
> > That's going to be tricky though I suspect...
> 
> Further, an application might use signals such as SIGALARM,
> delaying them significantly will cause trouble.

AIUI there is nothing to stop the SIGALARM being delivered in a timely
manner, all which may need to be delayed is the write() returning
-EINTR. 

When -EINTR is returned the buffer must no longer be referenced either
implicitly or explicitly by the kernel and the write must not have
completed and nor should it complete in the future (that way lies
corruption of varying sorts) so copying the data pages is not helpful in
this case.

This patch ensures that the buffer is no longer referenced when the
write returns. It's possible that NFS might need to cancel a write
operation in order to not complete it after returning -EINTR (I don't
know if it does this or not) but I don't think this series impacts that
one way or the other.

> 
> > A simpler option would be to have an end points such as a tap device
> 
> Which end points would that be? Doesn't this affect a packet socket
> with matching filters? A userspace TCP socket that happens to
> reside on the same box?  It also seems that at least with a tap device
> an skb can get queued in a qdisk, also indefinitely, right?

Those are all possibilities.

In order for this patch to have any impact on any of these scenarios
those end points would have to currently be referencing and using pages
of data which have been "returned" to the originating user process and
may be changing under their feet. This is without a doubt a Bad Thing.

In the normal case it is likely that the same end point which is
injecting delay is also the one the originating process is actually
trying to talk to and so the delay would already have been present and
this patch doesn't delay things any further.

If there are parts of the stack which can end up holding onto an skb for
an arbitrary amount of time then I think that is something which needs
to be fixed up in those end points rather than in everyone who injects
an skb into the stack. Whether an immediate deep copy or a more lazy
approach is appropriate I suspect depends upon the exact use case of
each end point.

Having said that one idea which springs to mind would be to allow
someone who has injected a page into the stack to "cancel" it. Since I
am working on pushing the struct page * down into the struct
skb_destructor this could be as simple as setting the page to NULL.
However every end point would need to be taught to expect this. I'm sure
there are also all sorts of locking nightmares underlying this idea.
Perhaps you'd need separate reference counts for queued vs in active
use.

Ian.

> 
> > which can swallow skbs for arbitrary times implement a policy in this
> > regard, either to deep copy or drop after a timeout?
> > 
> > Ian.
> 
> Or deep copy immediately?
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
From: Michael S. Tsirkin @ 2011-11-14 13:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Krishna Kumar, kvm, Asias He, virtualization, gorcunov,
	Sasha Levin, netdev, mingo
In-Reply-To: <CAOJsxLE4yRO+5XncrDEzVh5-RGXG+=HGTAmQ=bX=KMsDqf=feA@mail.gmail.com>

On Mon, Nov 14, 2011 at 02:25:17PM +0200, Pekka Enberg wrote:
> On Mon, Nov 14, 2011 at 4:04 AM, Asias He <asias.hejun@gmail.com> wrote:
> > Why both the bandwidth and latency performance are dropping so dramatically
> > with multiple VQ?
> 
> What's the expected benefit from multiple VQs

Heh, the original patchset didn't mention this :) It really should.
They are supposed to speed up networking for high smp guests.

> i.e. why are doing the patches Sasha?

-- 
MST

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Dave Taht @ 2011-11-14 12:32 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <20111114114700.GA27284@hmsreliant.think-freely.org>

On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
>> Data Center Bridging environments are currently somewhat limited in their
>> ability to provide a general mechanism for controlling traffic priority.
>> Specifically they are unable to administratively control the priority at which
>> various types of network traffic are sent.
>>
>> Currently, the only ways to set the priority of a network buffer are:
>>
>> 1) Through the use of the SO_PRIORITY socket option
>> 2) By using low level hooks, like a tc action
>>
>> (1) is difficult from an administrative perspective because it requires that the
>> application to be coded to not just assume the default priority is sufficient,
>> and must expose an administrative interface to allow priority adjustment.  Such
>> a solution is not scalable in a DCB environment
>>
>> (2) is also difficult, as it requires constant administrative oversight of
>> applications so as to build appropriate rules to match traffic belonging to
>> various classes, so that priority can be appropriately set. It is further
>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>> hw queues for various traffic classes and needs the priority to be set prior to
>> selecting the root qdisc)
>>
>>
>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>> being a good general solution to this problem.  The network priority cgroup
>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>> originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.  This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
>>
>> Tested by John and myself, with good results
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> CC: John Fastabend <john.r.fastabend@intel.com>
>> CC: Robert Love <robert.w.love@intel.com>
>> CC: "David S. Miller" <davem@davemloft.net>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
> the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> this change.  Any other reviews would be welcome.

Well, in part I've been playing catchup in the hope that lldp and
openlldp and/or this dcb netlink layer that I don't know anything
about (pointers please?) could help somehow to resolve the semantic
mess skb->priority has become in the first place.

I liked what was described here.

"What if we did at least carve out the DCB functionality away from
skb->priority?  Since, AIUI, we're only concerning ourselves with
locally generated traffic here, we're talking
about skbs that have a socket attached to them.  We could, instead of indexing
the prio_tc_map with skb->priority, we could index it with
skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
something more appropriately named), and we don't have to touch skb->priority at
all.  I'd really rather not start down that road until I got more opinions and
consensus on that, but it seems like a pretty good solution, one that would
allow hardware queue selection in systems that use things like DCB to co-exist
with software queueing features."

The piece that still kind of bothered me about the original proposal
(and perhaps this one) was that setting SO_PRIORITY in an app means
'give my packets more mojo'.

Taking something that took unprioritized packets and assigned them and
*them only* to a hardware queue struck me as possibly deprioritizing
the 'more mojo wanted' packets in the app(s), as they would end up in
some other, possibly overloaded, hardware queue.

So a cgroup that moves all of the packets from an application into a
given hardware queue, and then gets scheduled normally according to
skb->priority and friends (software queue, default of pfifo_fast,
etc), seems to make some sense to me. (I wouldn't mind if we had
abstractions for software queues, too, like, I need a software queue
with these properties, find me a place for it on the hardware - but
I'm dreaming)

One open question is where do packets generated from other subsystems
end up, if you are using a cgroup for the app? arp, dns, etc?

So to rephrase your original description from this:

>> Any traffic originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.  This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
> John, Robert, if you're supportive of these changes, some Acks would be
> appreciated.

To this:

"Any traffic originating from an application in a cgroup,  will have
its hardware queue  assigned to the value designated for that group on
that interface.  This allows a user space daemon, when conducting LLDP
negotiation with a DCB enabled peer to create a cgroup based on the
APP_TLV value received and administratively assign applications to
that hardware queue using the existing cgroup utility infrastructure."

Assuming we're on the same page here, what the heck is APP_TLV?

> John, Robert, if you're supportive of these changes, some Acks would be
> appreciated.

>
>
> Regards
> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox