Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor
From: Stefan Schmidt @ 2016-12-01 21:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev@vger.kernel.org, linux-wpan@vger.kernel.org
In-Reply-To: <1477647723-14641-1-git-send-email-stefan@datenfreihafen.org>

Hello.

On 28.10.2016 11:42, Stefan Schmidt wrote:
> Both types have been missing here and thus ip always showed
> only the numbers.
> 
> Based on a suggestion from Alexander Aring.
> 
> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>

Did you somehow mangle this patch manually?

Looking at the patch in your git repo it shows no author name but
instead just my patch was send with git format-patch and git send-email
as usual and shows the right author. Was there something worn on my side
or yours? Just checking to avoid it in the future.

http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=8ae2c5382bd9d98a8f7ddcb1faad1a978d773909

regards
Stefan Schmidt

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Alexander Duyck @ 2016-12-01 21:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <1480402716.18162.124.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:26         eth0      0.00 5822800.00      0.00 597064.41      0.00      0.00      1.00
> 22:43:27         eth0     24.00 5788237.00      2.09 593520.26      0.00      0.00      0.00
> 22:43:28         eth0     12.00 5817777.00      1.43 596551.47      0.00      0.00      1.00
> 22:43:29         eth0     22.00 5841516.00      1.61 598982.87      0.00      0.00      0.00
> 22:43:30         eth0      4.00 4389137.00      0.71 450058.08      0.00      0.00      1.00
> 22:43:31         eth0      4.00 5871008.00      0.72 602007.79      0.00      0.00      0.00
> 22:43:32         eth0     12.00 5891809.00      1.43 604142.60      0.00      0.00      1.00
> 22:43:33         eth0     10.00 5901904.00      1.12 605175.70      0.00      0.00      0.00
> 22:43:34         eth0      5.00 5907982.00      0.69 605798.99      0.00      0.00      1.00
> 22:43:35         eth0      2.00 5847086.00      0.12 599554.71      0.00      0.00      0.00
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:47         eth0      9.00 10226424.00      1.02 1048608.05      0.00      0.00      1.00
> 22:43:48         eth0      1.00 10316955.00      0.06 1057890.89      0.00      0.00      0.00
> 22:43:49         eth0      1.00 10310104.00      0.10 1057188.32      0.00      0.00      1.00
> 22:43:50         eth0      0.00 10249423.00      0.00 1050966.23      0.00      0.00      0.00
> 22:43:51         eth0      0.00 10210441.00      0.00 1046969.05      0.00      0.00      1.00
> 22:43:52         eth0      2.00 10198389.00      0.16 1045733.17      0.00      0.00      1.00
> 22:43:53         eth0      8.00 10079257.00      1.43 1033517.83      0.00      0.00      0.00
> 22:43:54         eth0      0.00 7693509.00      0.00 788885.16      0.00      0.00      0.00
> 22:43:55         eth0      2.00 10343076.00      0.20 1060569.32      0.00      0.00      1.00
> 22:43:56         eth0      1.00 10224571.00      0.14 1048417.93      0.00      0.00      0.00
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50
>
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.

A few years back when I did something like this on ixgbe I was told by
you that the issue was that doing something like this would add too
much latency.  I was just wondering what the latency impact is on a
change like this and if this now isn't that much of an issue?

>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:50:47         eth0      3.00 1355422.00      0.45 319706.04      0.00      0.00      0.00
> 22:50:48         eth0      2.00 1344270.00      0.42 317035.21      0.00      0.00      1.00
> 22:50:49         eth0      3.00 1350503.00      0.51 318478.34      0.00      0.00      0.00
> 22:50:50         eth0     29.00 1348593.00      2.86 318113.02      0.00      0.00      1.00
> 22:50:51         eth0     14.00 1354855.00      1.83 319508.56      0.00      0.00      0.00
> 22:50:52         eth0      7.00 1357794.00      0.73 320226.89      0.00      0.00      1.00
> 22:50:53         eth0      5.00 1326130.00      0.63 312784.72      0.00      0.00      0.00
> 22:50:54         eth0      2.00 994584.00      0.12 234598.40      0.00      0.00      1.00
> 22:50:55         eth0      5.00 1318209.00      0.75 310932.46      0.00      0.00      0.00
> 22:50:56         eth0     20.00 1323445.00      1.73 312178.11      0.00      0.00      1.00
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:51:03         eth0      4.00 1512055.00      0.54 356599.40      0.00      0.00      0.00
> 22:51:04         eth0      4.00 1507631.00      0.55 355609.46      0.00      0.00      1.00
> 22:51:05         eth0      4.00 1487789.00      0.42 350917.47      0.00      0.00      0.00
> 22:51:06         eth0      7.00 1474460.00      1.22 347811.16      0.00      0.00      1.00
> 22:51:07         eth0      2.00 1496529.00      0.24 352995.18      0.00      0.00      0.00
> 22:51:08         eth0      3.00 1485856.00      0.49 350425.65      0.00      0.00      1.00
> 22:51:09         eth0      1.00 1114808.00      0.06 262905.38      0.00      0.00      0.00
> 22:51:10         eth0      2.00 1510924.00      0.30 356397.53      0.00      0.00      1.00
> 22:51:11         eth0      2.00 1506408.00      0.30 355345.76      0.00      0.00      0.00
> 22:51:12         eth0      2.00 1499122.00      0.32 353668.75      0.00      0.00      1.00
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50
>
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>  include/linux/netdevice.h                    |    1
>  net/core/net-sysfs.c                         |   18 +++
>  5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
>         if (polled) {
>                 if (doorbell_pending)
> -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
>                 mlx4_cq_set_ci(&cq->mcq);
>                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>         ring->size = size;
>         ring->size_mask = size - 1;
>         ring->sp_stride = stride;
> -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>

Just wondering if this should be a separate change?  Seems like this
is something that might apply outside of your changes since it seems
like it is reserving enough room for 2 buffers.

>         tmp = size * sizeof(struct mlx4_en_tx_info);
>         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>         ring->sp_cqn = cq;
>         ring->prod = 0;
>         ring->cons = 0xffffffff;
> +       ring->ncons = 0;
>         ring->last_nr_txbb = 1;
>         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>         memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>  }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -       return ring->prod - ring->cons > ring->full_size;
> +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
>

With the use of READ_ONCE are there some barriers that are going to
need to be added to make sure these are populated in the correct
spots?  Any ordering constraints on the updates these fields?

>  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
>         /* Skip last polled descriptor */
>         ring->cons += ring->last_nr_txbb;
> +       ring->ncons += ring->last_nr_txbb;
>         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>                  ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>                                                 !!(ring->cons & ring->size), 0,
>                                                 0 /* Non-NAPI caller */);
>                 ring->cons += ring->last_nr_txbb;
> +               ring->ncons += ring->last_nr_txbb;
>                 cnt++;
>         }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>         return cnt;
>  }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> +                          struct mlx4_en_tx_ring *ring)
> +{
> +
> +       if (dev->doorbell_opt & 1) {
> +               u32 oval = READ_ONCE(ring->prod_bell);
> +               u32 nval = READ_ONCE(ring->prod);
> +
> +               if (oval == nval)
> +                       return;
> +
> +               /* I can not tell yet if a cmpxchg() is needed or not */
> +               if (dev->doorbell_opt & 2)
> +                       WRITE_ONCE(ring->prod_bell, nval);
> +               else
> +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> +                               return;
> +       }
> +       /* Since there is no iowrite*_native() that writes the
> +        * value as is, without byteswapping - using the one
> +        * the doesn't do byteswapping in the relevant arch
> +        * endianness.
> +        */
> +#if defined(__LITTLE_ENDIAN)
> +       iowrite32(
> +#else
> +       iowrite32be(
> +#endif
> +                 ring->doorbell_qpn,
> +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +

I realize you are just copying from the function further down, but
couldn't this just be __raw_writel instead of iowrite32?

Also you may be able to squeeze a bit more out of this by doing some
barrier clean-ups.  In most drivers the wmb() is needed between writes
of the DMA memory and the MMIO memory.  So odds are you could hold off
on using a wmb() until you call this function and you wouldn't really
need it until just before the call to iowrite32 or __raw_writel.  The
rest of it could use either an smp_wmb() or dma_wmb().

>  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>                                   struct mlx4_en_cq *cq, int napi_budget)
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
>

It might be more readable to put the addition on ring_cons out in
front of all the WRITE_ONCE macro calls.

>         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>                 return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>         __iowrite64_copy(dst, src, bytecnt / 8);
>  }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> -       wmb();
> -       /* Since there is no iowrite*_native() that writes the
> -        * value as is, without byteswapping - using the one
> -        * the doesn't do byteswapping in the relevant arch
> -        * endianness.
> -        */
> -#if defined(__LITTLE_ENDIAN)
> -       iowrite32(
> -#else
> -       iowrite32be(
> -#endif
> -                 ring->doorbell_qpn,
> -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
>  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                                   struct mlx4_en_tx_desc *tx_desc,
>                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>                                   int desc_size, int bf_index,
>                                   __be32 op_own, bool bf_ok,
> -                                 bool send_doorbell)
> +                                 bool send_doorbell,
> +                                 const struct net_device *dev, int nr_txbb)
>  {
>         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
>                 wmb();
>
> +               ring->prod += nr_txbb;
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>                              desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                  */
>                 dma_wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> +               ring->prod += nr_txbb;
>                 if (send_doorbell)
> -                       mlx4_en_xmit_doorbell(ring);
> +                       mlx4_en_xmit_doorbell(dev, ring);
>                 else
>                         ring->xmit_more++;
>         }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>         }
>
> -       ring->prod += nr_txbb;
> -
>         /* If we used a bounce buffer then copy descriptor back into place */
>         if (unlikely(bounce))
>                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> +               send_doorbell = false;
> +
>         real_size = (real_size / 16) & 0x3f;
>
>         bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
>         if (unlikely(stop_queue)) {
>                 /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                  */
>                 smp_rmb();
>
> -               ring_cons = ACCESS_ONCE(ring->cons);
>                 if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
>                         netif_tx_wake_queue(ring->tx_queue);
>                         ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         rx_ring->xdp_tx++;
>         AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> -       ring->prod += nr_txbb;
> -
>         stop_queue = mlx4_en_is_tx_ring_full(ring);
>         send_doorbell = stop_queue ||
>                                 *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>         *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
>         return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>          */
>         u32                     last_nr_txbb;
>         u32                     cons;
> +       u32                     ncons;
>         unsigned long           wake_queue;
>         struct netdev_queue     *tx_queue;
>         u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
>         /* cache line used and dirtied in mlx4_en_xmit() */
>         u32                     prod ____cacheline_aligned_in_smp;
> +       u32                     prod_bell;
>         unsigned int            tx_dropped;
>         unsigned long           bytes;
>         unsigned long           packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                                struct mlx4_en_rx_alloc *frame,
>                                struct net_device *dev, unsigned int length,
>                                int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
>  bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>                         struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
>         DECLARE_HASHTABLE       (qdisc_hash, 4);
>  #endif
>         unsigned long           tx_queue_len;
> +       unsigned long           doorbell_opt;
>         spinlock_t              tx_global_lock;
>         int                     watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> +       dev->doorbell_opt = val;
> +       return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t len)
> +{
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;
> +
> +       return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
>                              const char *buf, size_t len)
>  {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
>         &dev_attr_phys_port_name.attr,
>         &dev_attr_phys_switch_id.attr,
>         &dev_attr_proto_down.attr,
> +       &dev_attr_doorbell_opt.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
>
>

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Murali Karicheri @ 2016-12-01 21:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Joakim Tjernlund, netdev@vger.kernel.org, Roger Quadros,
	Grygorii Strashko
In-Reply-To: <20161201173100.GG21887@lunn.ch>

Hi Andrew,
On 12/01/2016 12:31 PM, Andrew Lunn wrote:
> Hi Murali 
> 
>> 2. Switch mode where it implements a simple Ethernet switch. Currently
>>    it doesn't have address learning capability, but in future it
>>    can.
> 
> If it does not have address learning capabilities, does it act like a
> plain old hub? What comes in one port goes out all others?

Thanks for the response!

Yes. It is a plain hub. it replicates frame to both ports. So need to
run a bridge layer for address learning in software.

> 
> Or can you do the learning in software on the host and program tables,
> which the hardware then uses?
> 

I think not. I see we have a non Linux implementation that does address
learning in software using a hash table and look up MAC for each packet
to see which port it needs to be sent to.

Murali

>> 3. Switch with HSR/PRP offload where it provides HSR/PRP protocol
>>    support and cut through switch.
>>
>> So a device need to function in one of the modes. A a regular Ethernet
>> driver that provides two network devices, one per port, and switchdev
>> for each physical port (in switch mode) will look ideal in this case.
> 
> Yes, this seems the right model to use.
> 
>      Andrew
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Rick Jones @ 2016-12-01 21:47 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <CALx6S37aknqfrj66AP8hEfVT9X2OmBFK9GVa9A+0FpydPbm9kg@mail.gmail.com>

On 12/01/2016 12:18 PM, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>> Just how much per-packet path-length are you thinking will go away under the
>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
>> non-trivial things to effective overhead (service demand) and so throughput:
>>
> For plain in order TCP packets I believe we should be able process
> each packet at nearly same speed as GRO. Most of the protocol
> processing we do between GRO and the stack are the same, the
> differences are that we need to do a connection lookup in the stack
> path (note we now do this is UDP GRO and that hasn't show up as a
> major hit). We also need to consider enqueue/dequeue on the socket
> which is a major reason to try for lockless sockets in this instance.

So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.

> Sure, but trying running something emulates a more realistic workload
> than a TCP stream, like RR test with relative small payload and many
> connections.

That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?

I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.

happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     7168   10.00   10535.01  1.52   -1.00  34.664  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     7168   10.00   8225.17  1.80   -1.00  52.661  -1.000
16384  87380

^ permalink raw reply

* [PATCH] net: ethernet: ti: davinci_cpdma: sanitize inter-module API
From: Arnd Bergmann @ 2016-12-01 21:51 UTC (permalink / raw)
  To: Mugunthan V N
  Cc: Arnd Bergmann, Grygorii Strashko, David S. Miller,
	Ivan Khoronzhuk, Uwe Kleine-König, linux-omap, netdev,
	linux-kernel

The davinci_cpdma module is a helper library that is used by the
actual device drivers and does nothing by itself, so all its API
functions need to be exported.

Four new functions were added recently without an export, so now
we get a build error:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

This exports those symbols. After taking a closer look, I found two
more global functions in this file that are not exported and not used
anywhere, so they can obviously be removed. There is also one function
that is only used internally in this module, and should be 'static'.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Fixes: 0fc6432cc78d ("net: ethernet: ti: davinci_cpdma: add weight function for channels")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 63 +++++++++++----------------------
 drivers/net/ethernet/ti/davinci_cpdma.h |  4 ---
 2 files changed, 21 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..b9d40f0cdf6c 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -628,6 +628,23 @@ int cpdma_ctlr_destroy(struct cpdma_ctlr *ctlr)
 }
 EXPORT_SYMBOL_GPL(cpdma_ctlr_destroy);
 
+static int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+	if (chan->state != CPDMA_STATE_ACTIVE) {
+		spin_unlock_irqrestore(&chan->lock, flags);
+		return -EINVAL;
+	}
+
+	dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
+		      chan->mask);
+	spin_unlock_irqrestore(&chan->lock, flags);
+
+	return 0;
+}
+
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable)
 {
 	unsigned long flags;
@@ -796,6 +813,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
 
 /* cpdma_chan_get_min_rate - get minimum allowed rate for channel
  * Should be called before cpdma_chan_set_rate.
@@ -810,6 +828,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
 
 	return DIV_ROUND_UP(divident, divisor);
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
 
 /* cpdma_chan_set_rate - limits bandwidth for transmit channel.
  * The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +872,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
 
 u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 {
@@ -865,6 +885,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 
 	return rate;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
 
 struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
 				     cpdma_handler_fn handler, int rx_type)
@@ -1270,46 +1291,4 @@ int cpdma_chan_stop(struct cpdma_chan *chan)
 }
 EXPORT_SYMBOL_GPL(cpdma_chan_stop);
 
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&chan->lock, flags);
-	if (chan->state != CPDMA_STATE_ACTIVE) {
-		spin_unlock_irqrestore(&chan->lock, flags);
-		return -EINVAL;
-	}
-
-	dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
-		      chan->mask);
-	spin_unlock_irqrestore(&chan->lock, flags);
-
-	return 0;
-}
-
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control)
-{
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&ctlr->lock, flags);
-	ret = _cpdma_control_get(ctlr, control);
-	spin_unlock_irqrestore(&ctlr->lock, flags);
-
-	return ret;
-}
-
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value)
-{
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&ctlr->lock, flags);
-	ret = _cpdma_control_set(ctlr, control, value);
-	spin_unlock_irqrestore(&ctlr->lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(cpdma_control_set);
-
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h b/drivers/net/ethernet/ti/davinci_cpdma.h
index 4a167db2abab..36d0a09a3d44 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.h
+++ b/drivers/net/ethernet/ti/davinci_cpdma.h
@@ -87,7 +87,6 @@ int cpdma_chan_process(struct cpdma_chan *chan, int quota);
 
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable);
 void cpdma_ctlr_eoi(struct cpdma_ctlr *ctlr, u32 value);
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable);
 u32 cpdma_ctrl_rxchs_state(struct cpdma_ctlr *ctlr);
 u32 cpdma_ctrl_txchs_state(struct cpdma_ctlr *ctlr);
 bool cpdma_check_free_tx_desc(struct cpdma_chan *chan);
@@ -111,7 +110,4 @@ enum cpdma_control {
 	CPDMA_RX_BUFFER_OFFSET,		/* read-write */
 };
 
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control);
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value);
-
 #endif
-- 
2.9.0

^ permalink raw reply related

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-01 21:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <9b4264f8-26b9-a611-56f0-0840cecf9c44@stressinduktion.org>

On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 01.12.2016 22:12, Tom Herbert wrote:
>> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>>> Hello,
>>>
>>> this is a good conversation and I simply want to bring my worries
>>> across. I don't have good solutions for the problems XDP tries to solve
>>> but I fear we could get caught up in maintenance problems in the long
>>> term given the ideas floating around on how to evolve XDP currently.
>>>
>>> On 01.12.2016 17:28, Thomas Graf wrote:
>>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>>>> XDP manipulates packets at free will and thus all security guarantees
>>>>> are off as well as in any user space solution.
>>>>>
>>>>> Secondly user space provides policy, acl, more controlled memory
>>>>> protection, restartability and better debugability. If I had multi
>>>>> tenant workloads I would definitely put more complex "business/acl"
>>>>> logic into user space, so I can make use of LSM and other features to
>>>>> especially prevent a network facing service to attack the tenants. If
>>>>> stuff gets put into the kernel you run user controlled code in the
>>>>> kernel exposing a much bigger attack vector.
>>>>>
>>>>> What use case do you see in XDP specifically e.g. for container networking?
>>>>
>>>> DDOS mitigation to protect distributed applications in large clusters.
>>>> Relying on CDN works to protect API gateways and frontends (as long as
>>>> they don't throw you out of their network) but offers no protection
>>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>>> level and allowing the mitigation capability to scale up with the number
>>>> of servers is natural and cheap.
>>>
>>> So far we e.g. always considered L2 attacks a problem of the network
>>> admin to correctly protect the environment. Are you talking about
>>> protecting the L3 data plane? Are there custom proprietary protocols in
>>> place which need custom protocol parsers that need involvement of the
>>> kernel before it could verify the packet?
>>>
>>> In the past we tried to protect the L3 data plane as good as we can in
>>> Linux to allow the plain old server admin to set an IP address on an
>>> interface and install whatever software in user space. We try not only
>>> to protect it but also try to achieve fairness by adding a lot of
>>> counters everywhere. Are protections missing right now or are we talking
>>> about better performance?
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>
I agree that eBPF is not going to be sufficient from everything we'll
want to do. Undoubtably, we'll continue see new addition of more
helpers to assist in processing, but at some point we will want a to
load a kernel module that handles more complex processing and insert
it at the XDP callout. Nothing in the design of XDP precludes doing
that and I have already posted the patches to generalize the XDP
callout for that. Taking either of these routes has tradeoffs, but
regardless of whether this is BPF or module code, the principles of
XDP and its value to help solve some class of problems remains.

 Tom

> Not to mention the fact that you might have to deal with fragments in
> the Internet. Some DOS mitigations were already abused to generate
> blackholes for other users. Filtering such stuff is quite complicated.
>
> I argued also under the aspect of what Thomas said, that the outside
> world of the cluster is already protected by a CDN.
>
> Bye,
> Hannes
>

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:57 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130182243.cpoyoqvxuv2bnn4y@splinter.mtl.com>

On 30.11.2016 19:22, Ido Schimmel wrote:
> On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 17:32, Ido Schimmel wrote:
>>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>>>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>
>>>>> Make sure the device has a complete view of the FIB tables by invoking
>>>>> their dump during module init.
>>>>>
>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>> ---
>>>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>>>  1 file changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> index 14bed1d..d176047 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>>>  	return NOTIFY_DONE;
>>>>>  }
>>>>>  
>>>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>>>> +{
>>>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>>>> +
>>>>> +	/* Flush pending FIB notifications and then flush the device's
>>>>> +	 * table before requesting another dump. Do that with RTNL held,
>>>>> +	 * as FIB notification block is already registered.
>>>>> +	 */
>>>>> +	mlxsw_core_flush_owq();
>>>>> +	rtnl_lock();
>>>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>>>> +	rtnl_unlock();
>>>>> +}
>>>>> +
>>>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  {
>>>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>>>  	int err;
>>>>>  
>>>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  
>>>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>>>
>>>> Sorry to pick in here again:
>>>>
>>>> There is a race here. You need to protect the registration of the fib
>>>> notifier as well by the sequence counter. Updates here are not ordered
>>>> in relation to this code below.
>>>
>>> You mean updates that can be received after you registered the notifier
>>> and until the dump started? I'm aware of that and that's OK. This
>>> listener should be able to handle duplicates.
>>
>> I am not concerned about duplicates, but about ordering deletes and
>> getting an add from the RCU code you will add the node to hw while it is
>> deleted in the software path. You probably will ignore the delete
>> because nothing is installed in hw and later add the node which was
>> actually deleted but just reordered which happend on another CPU, no?
> 
> Are you referring to reordering in the workqueue? We already covered
> this using an ordered workqueue, which has one context of execution
> system-wide.

Ups, sorry, I missed that mail. Probably read it on the mobile phone and
it became invisible for me later on. Busy day... ;)

The reordering in the workqueue seems fine to me and also still necessary.

Basically, if you delete a node right now the kernel might simply do a
RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
or synchronization with the reader side. Thus you might get a callback
from the notifier for a delete event on the one CPU and you end up
queueing this fib entry after the delete queue, because the RCU walk
isn't protected by any means.

Looking closer at this series again, I overlooked the fact that you
fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
orders fetching of fib_seq and thus the RCU dumping after any concurrent
executing fib table update, also the mutex_lock and unlock provide
proper acquire and release fences, so the CPU indeed sees the effect of
a RCU_INIT_POINTER update done on another CPU, because they pair with
the rtnl_unlock which might happen on the other CPU.

My question is if this is a bit of luck and if we should make this
explicit by putting the registration itself under the protection of the
sequence counter. I favor the additional protection, e.g. if we some day
actually we optimize the fib_seq code? Otherwise we might probably
document this fact. :)

>>> I've a follow up patchset that introduces a new event in switchdev
>>> notification chain called SWITCHDEV_SYNC, which is sent when port
>>> netdevs are enslaved / released  from a master device (points in time
>>> where kernel<->device can get out of sync). It will invoke
>>> re-propagation of configuration from different parts of the stack
>>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
>>> in duplicates.
>>
>> Okay, understood. I wonder how we can protect against accidentally abort
>> calls actually. E.g. if I start to inject routes into my routing domain
>> how can I make sure the box doesn't die after I try to insert enough
>> routes. Do we need to touch quagga etc?
> 
> The whole point of moving abort mechanism to the driver is that the
> system won't die, but instead routing will be done in the kernel. If you
> respect hardware limitations, then there's no reason for abort mechanism
> to kick in.

Quick follow-up question: How can I quickly find out the hw limitations
via the kernel api?

Thanks,
Hannes

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 22:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <CAKgT0Uf_EeWYqGinyV6U1r-Jqd07R9gN-7GeA82emk4EoN=1mQ@mail.gmail.com>

On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote:

> A few years back when I did something like this on ixgbe I was told by
> you that the issue was that doing something like this would add too
> much latency.  I was just wondering what the latency impact is on a
> change like this and if this now isn't that much of an issue?

I believe it is clear that we can not use this trick without admin
choice.

In my experience, mlx4 can deliver way more interrupts than ixgbe.
(No idea why)

This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames|
tx-frames-irq", or simply a choice for hosts dedicated to content
delivery (like video servers) with 40GBit+ cards.

Under stress, softirq can be handled by ksoftirqd, and might be delayed
by ~10 ms or even more.

This WIP was minimal, but we certainly need to add belts and suspenders.

1) <maybe> timestamps
  use a ktime_get_ns() to remember a timestamp for the last doorbell,
  and force a doorbell if it gets too old, checked in ndo_start_xmit()
  every time we would like to not send the doorbell.

2) <maybe> max numbers of packets not yet doorbelled.

But we can not rely on another high resolution timer, since they also
require softirq being processed timely.

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 22:10 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
In-Reply-To: <20161201.152029.2045474106824619667.davem@davemloft.net>

On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 01 Dec 2016 09:04:17 -0800
> 
> > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> > 
> >> When qdisc layer or trafgen/af_packet see this indication it knows it
> >> should/must flush the queue when it don't have more work left.  Perhaps
> >> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> >> is called and queue is empty then check if queue needs a flush. I would
> >> also allow driver to flush and clear this bit.
> > 
> > net_tx_action() is not normally called, unless BQL limit is hit and/or
> > some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> The one thing I wonder about is whether we should "ramp up" into a mode
> where the NAPI poll does the doorbells instead of going directly there.
> 
> Maybe I misunderstand your algorithm, but it looks to me like if there
> are any active packets in the TX queue at enqueue time you will defer
> the doorbell to the interrupt handler.
> 
> Let's say we put 1 packet in, and hit the doorbell.
> 
> Then another packet comes in and we defer the doorbell to the IRQ.
> 
> At this point there are a couple things I'm unclear about.


> 
> For example, if we didn't hit the doorbell, will the chip still take a
> peek at the second descriptor?  Depending upon how the doorbell works
> it might, or it might not.

It might depend on the hardware. I can easily check on mlx4, by
increasing tx-usecs and tx-frames, and sending 2 packets back to back.

> 
> Either way, wouldn't there be a possible condition where the chip
> wouldn't see the second enqueued packet and we'd thus have the wire
> idle until the interrupt + NAPI runs and hits the doorbell?
> 
> This is why I think we should "ramp up" the doorbell deferral, in
> order to avoid this potential wire idle time situation.


> 
> Maybe the situation I'm worried about is not possible, so please
> explain it to me :-)

This is absolutely the problem. We might need to enable this mode only
above a given load. We could have an EWMA of the number of packets
that TX completion runs can dequeue. And enable auto doorbell only if we
have that many packets in the TX ring (instead of the  "1 packet
threshold" of the WIP)

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 22:12 UTC (permalink / raw)
  To: Rick Jones; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <1e88fd64-0045-beb5-101a-a55b8f54bd08@hpe.com>

On Thu, Dec 1, 2016 at 1:47 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 12/01/2016 12:18 PM, Tom Herbert wrote:
>>
>> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>>>
>>> Just how much per-packet path-length are you thinking will go away under
>>> the
>>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does
>>> some
>>> non-trivial things to effective overhead (service demand) and so
>>> throughput:
>>>
>> For plain in order TCP packets I believe we should be able process
>> each packet at nearly same speed as GRO. Most of the protocol
>> processing we do between GRO and the stack are the same, the
>> differences are that we need to do a connection lookup in the stack
>> path (note we now do this is UDP GRO and that hasn't show up as a
>> major hit). We also need to consider enqueue/dequeue on the socket
>> which is a major reason to try for lockless sockets in this instance.
>
>
> So waving hands a bit, and taking the service demand for the GRO-on receive
> test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or
> ~1.216 usec of CPU time per TCP segment, including ACK generation which
> unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in
> place would be for every-other segment. Etc etc.
>
>> Sure, but trying running something emulates a more realistic workload
>> than a TCP stream, like RR test with relative small payload and many
>> connections.
>
>
> That is a good point, which of course is why the RR tests are there in
> netperf :) Don't get me wrong, I *like* seeing path-length reductions. What
> would you posit is a relatively small payload?  The promotion of IR10
> suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at
> that as the length of a piece of string:
>
We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
> 16384  87380
>
> So, losing GRO doubled the service demand.  I suppose I could see cutting
> path-length in half based on the things you listed which would be bypassed?
>
> I'm sure mileage will vary with different NICs and CPUs.  The ones used here
> happened to be to hand.
>
This is also biased because you're using a single connection, but is
consistent with data we've seen in the past. To be clear I'm not
saying GRO is bad, the fact that GRO has such a visible impact in your
test means that the GRO path is significantly more efficient. Closing
that gap seen in your numbers would be a benefit, that means we have
improved per packet processing.

Tom

> happy benchmarking,
>
> rick
>
> Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly
> goes to more than doubling, and halving to 7K narrows the delta:
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
> 16384  87380
>
>
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,7K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     7168   10.00   10535.01  1.52   -1.00  34.664  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,7K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     7168   10.00   8225.17  1.80   -1.00  52.661  -1.000
> 16384  87380
>
>

^ permalink raw reply

* Re: [PATCH 1/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
From: Bjørn Mork @ 2016-12-01 22:13 UTC (permalink / raw)
  To: Daniele Palmas; +Cc: netdev
In-Reply-To: <1480607525-23044-2-git-send-email-dnlplm@gmail.com>

Daniele Palmas <dnlplm@gmail.com> writes:

> This patch adds support for PID 0x1040 of Telit LE922A.

LGTM

Acked-by: Bjørn Mork <bjorn@mork.no>

^ permalink raw reply

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
From: Paolo Abeni @ 2016-12-01 22:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, netdev@vger.kernel.org, Hannes Frederic Sowa
In-Reply-To: <20161201183402.2fbb8c5b@redhat.com>

On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed      Messages                
>  Size    Size     Time         Okay Errors   Throughput
>  bytes   bytes    secs            #      #   10^6bits/sec
> 
>    4608    1024   60.00     50024301      0    6829.98
>    2560           60.00     46133211           6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives                    885162             0.0
>  IpInDelivers                    885161             0.0
>  IpOutRequests                   885162             0.0
>  UdpInDatagrams                  776105             0.0
>  UdpInErrors                     109056             0.0
>  UdpOutDatagrams                 885160             0.0
>  UdpRcvbufErrors                 109056             0.0
>  IpExtInOctets                   931190476          0.0
>  IpExtOutOctets                  931189564          0.0
>  IpExtInNoECTPkts                885162             0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
 	int size = skb->truesize;
+	int limit;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
 	 */
 	rmem = atomic_read(&sk->sk_rmem_alloc);
-	if (rmem && (rmem + size > sk->sk_rcvbuf))
+	limit = size + sk->sk_rcvbuf;
+	if (rmem > limit)
 		goto drop;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
 	 */
 	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
-	if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+	if (rmem > limit)
 		goto uncharge_drop;
 
 	spin_lock(&list->lock);





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Hannes Frederic Sowa @ 2016-12-01 22:28 UTC (permalink / raw)
  To: Erik Nordmark, davem; +Cc: netdev, Bob Gilligan
In-Reply-To: <1480549159-8142-1-git-send-email-nordmark@arista.com>

On 01.12.2016 00:39, Erik Nordmark wrote:
> Implemented RFC7527 Enhanced DAD.
> IPv6 duplicate address detection can fail if there is some temporary
> loopback of Ethernet frames. RFC7527 solves this by including a random
> nonce in the NS messages used for DAD, and if an NS is received with the
> same nonce it is assumed to be a looped back DAD probe and is ignored.
> RFC7527 is enabled by default. Can be disabled by setting both of
> conf/{all,interface}/enhanced_dad to zero.
> 
> Signed-off-by: Erik Nordmark <nordmark@arista.com>
> Signed-off-by: Bob Gilligan <gilligan@arista.com>
> ---

Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks!
> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  have_ifp:
>  		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>  			if (dad) {
> +				if (nonce != 0 && ifp->dad_nonce == nonce) {
> +					u8 *np = (u8 *)&nonce;
> +					/* Matching nonce if looped back */
> +					ND_PRINTK(2, notice,
> +						  "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> +						  ifp->idev->dev->name,
> +						  &ifp->addr,
> +						  np[0], np[1], np[2], np[3],
> +						  np[4], np[5]);
> +					goto out;
> +				}
>  				/*
>  				 * We are colliding with another node
>  				 * who is doing DAD
> 

I think it could be a "%pM" because it looks like a MAC address, but
better leave it like that. :)

Bye,
Hannes

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Hannes Frederic Sowa @ 2016-12-01 22:47 UTC (permalink / raw)
  To: Tom Herbert, Florian Westphal
  Cc: Linux Kernel Network Developers, Jesper Dangaard Brouer
In-Reply-To: <CALx6S36ywu3ruY7AFKYk=N4Ekr5zjY33ivx92EgNNT36XoXhFA@mail.gmail.com>

Side note:

On 01.12.2016 20:51, Tom Herbert wrote:
>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>> > (where does that come from? should make no difference if a 32 or
>> >  320 byte buffer gets allocated).
>> >
> It's the zero'ing of three cache lines. I believe we talked about that
> as netdev.

Jesper and me played with that again very recently:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590

In micro-benchmarks we saw a pretty good speed up not using the rep
stosb generated by gcc builtin but plain movq's. Probably the cost model
for __builtin_memset in gcc is wrong?

When Jesper is free we wanted to benchmark this and maybe come up with a
arch specific way of cleaning if it turns out to really improve throughput.

SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
all the benefits.

Bye,
Hannes

^ permalink raw reply

* stmmac: turn coalescing / NAPI off in stmmac
From: Pavel Machek @ 2016-12-01 22:48 UTC (permalink / raw)
  To: David Miller, alexandre.torgue; +Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161201.152303.425589678238707778.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 7554 bytes --]


> > @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct net_device *dev,
> >  		features &= ~NETIF_F_CSUM_MASK;
> >  
> >  	/* Disable tso if asked by ethtool */
> > -	if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) {
> > -		if (features & NETIF_F_TSO)
> > -			priv->tso = true;
> > -		else
> > -			priv->tso = false;
> > -	}
> > +	if ((priv->plat->tso_en) && (priv->dma_cap.tsoen))
> > +		priv->tso = !!(features & NETIF_F_TSO);
> >  
> 
> Pavel, this really seems arbitrary.
> 
> Whilst I really appreciate you're looking into this driver a bit because
> of some issues you are trying to resolve, I'd like to ask that you not
> start bombarding me with nit-pick cleanups here and there and instead
> concentrate on the real bug or issue.

Well, fixing clean code is easier than fixing strange code... Plus I
was hoping to make the mainainers to talk. The driver is listed as
supported after all.

Anyway... since you asked. I belive I have way to disable NAPI / tx
coalescing in the driver. Unfortunately, locking is missing on the rx
path, and needs to be extended to _irqsave variant on tx path.

So patch currently looks like this (hand edited, can't be
applied, got it working few hours ago). Does it look acceptable?

I'd prefer this to go after the patch that pulls common code to single
place, so that single place needs to be patched. Plus I guess I should
add ifdefs, so that more advanced NAPI / tx coalescing code can be
reactivated when it is fixed. Trivial fixes can go on top. Does that
sound like a plan?

Which tree do you want patches against?

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/ ?

Best regards,
								Pavel


diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 0b706a7..c0016c8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1395,9 +1397,10 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv)
 
 static void stmmac_tx_clean(struct stmmac_priv *priv)
 {
-	spin_lock(&priv->tx_lock);
+	unsigned long flags;
+	spin_lock_irqsave(&priv->tx_lock, flags);
 	__stmmac_tx_clean(priv);
-	spin_unlock(&priv->tx_lock);	
+	spin_unlock_irqrestore(&priv->tx_lock, flags);	
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -1441,6 +1444,8 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
 	netif_wake_queue(priv->dev);
 }
 
+static int stmmac_rx(struct stmmac_priv *priv, int limit);
+
 /**
  * stmmac_dma_interrupt - DMA ISR
  * @priv: driver private structure
@@ -1452,10 +1457,17 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 {
 	int status;
 	int rxfifosz = priv->plat->rx_fifo_size;
+	unsigned long flags;
 
 	status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats);
 	if (likely((status & handle_rx)) || (status & handle_tx)) {
+		int r;
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		r = stmmac_rx(priv, 999);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);		
+#if 0
 		if (likely(napi_schedule_prep(&priv->napi))) {
 			//pr_err("napi: schedule\n");
 			stmmac_disable_dma_irq(priv);
@@ -1463,7 +1475,8 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 		} else
 			pr_err("napi: schedule failed\n");
 #endif
+		stmmac_tx_clean(priv);
 	}
 	if (unlikely(status & tx_hard_error_bump_tc)) {
 		/* Try to bump up the dma threshold on this failure */
@@ -1638,7 +1651,7 @@ static void stmmac_tx_timer(unsigned long data)
 {
 	struct stmmac_priv *priv = (struct stmmac_priv *)data;
 
-	stmmac_tx_clean(priv);
+	//stmmac_tx_clean(priv);
 }
 
 /**
@@ -1990,6 +2003,8 @@ static void stmmac_xmit_common(struct sk_buff *skb, struct net_device *dev, int
 	if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
 		mod_timer(&priv->txtimer,
 			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
+		priv->hw->desc->set_tx_ic(desc);
+		priv->xstats.tx_set_ic_bit++;
 	} else {
 		priv->tx_count_frames = 0;
 		priv->hw->desc->set_tx_ic(desc);
@@ -2038,8 +2053,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct dma_desc *desc, *first, *mss_desc = NULL;
 	u8 proto_hdr_len;
 	int i;
+	unsigned long flags;
 
-	spin_lock(&priv->tx_lock);
+	spin_lock_irqsave(&priv->tx_lock, flags);
 
 	/* Compute header lengths */
 	proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
@@ -2052,7 +2068,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2168,11 +2184,11 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 				       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2197,6 +2213,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct dma_desc *desc, *first;
 	unsigned int enh_desc;
 	unsigned int des;
+	unsigned int flags;
 
 	/* Manage oversized TCP frames for GMAC4 device */
 	if (skb_is_gso(skb) && priv->tso) {
@@ -2204,7 +2221,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			return stmmac_tso_xmit(skb, dev);
 	}
 
-	spin_lock(&priv->tx_lock);
+	spin_lock_irqsave(&priv->tx_lock, flags);
 
 	if (unlikely(stmmac_tx_avail(priv) < nfrags + 1)) {
 		if (!netif_queue_stopped(dev)) {
@@ -2212,7 +2229,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2347,11 +2364,11 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 					       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2634,7 +2651,8 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit)
 			else
 				skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-			napi_gro_receive(&priv->napi, skb);
+			//napi_gro_receive(&priv->napi, skb);
+			netif_rx(skb);
 
 			priv->dev->stats.rx_packets++;
 			priv->dev->stats.rx_bytes += frame_len;
@@ -2662,6 +2680,7 @@ static int stmmac_poll(struct napi_struct *napi, int budget)
 	struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi);
 	int work_done;
 
+	BUG();
 	priv->xstats.napi_poll++;
 	stmmac_tx_clean(priv);
 



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related

* Re: Initial thoughts on TXDP
From: Hannes Frederic Sowa @ 2016-12-01 22:55 UTC (permalink / raw)
  To: Sowmini Varadhan, Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201201324.GJ24547@oracle.com>

On 01.12.2016 21:13, Sowmini Varadhan wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
> 
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>>> and one other critical difference from the hot-potato-forwarding
>>> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>>> does not apply: in order to figure out the ethernet and IP headers
>>> in the response correctly at all times (in the face of things like VRRP,
>>> gw changes, gw's mac addr changes etc) the application should really
>>> be listening on NETLINK sockets for modifications to the networking
>>> state - again points to needing a select() socket set where you can
>>> have both the I/O fds and the netlink socket,
>>>
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
> 
> sure, but my point was that *XDP and other stack-bypass methods needs 
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)

Busypoll on steroids is what windows does by mapping the user space
"doorbell" into a vDSO and let user space loop on that maybe with
MWAIT/MONITOR. The interesting thing is that you can map other events to
this notification event, too. It sounds like a usable idea to me and
reassembles what we already do with futexes.

Bye,
Hannes

^ permalink raw reply

* [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
From: Mahesh Bandewar @ 2016-12-01 22:56 UTC (permalink / raw)
  To: netdev, Eric Dumazet, David Miller; +Cc: Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

If initial broadcast probe(s) is/are lost, the neigh entry wont have
valid address of the neighbour. In a situation like this, the fall
back should be to send a broadcast probe, however the code logic
continues sending ucast probes to 00:00:00:00:00:00. The default value
of ucast probes is 3 so system usually recovers after three such probes
but if the value configured is larger it takes those many probes
(a probe is sent every second in default config) / seconds to recover
making machine not-available on the network.

This patch just ensures that the unicast address is not NULL otherwise
falls back to sending broadcast probe.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 net/ipv4/arp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 89a8cac4726a..56fb33d5ed31 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
 	__be32 saddr = 0;
 	u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL;
+	u8 null_dev_hw_addr[MAX_ADDR_LEN];
 	struct net_device *dev = neigh->dev;
 	__be32 target = *(__be32 *)neigh->primary_key;
 	int probes = atomic_read(&neigh->probes);
@@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 
 	probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
 	if (probes < 0) {
+		memset(&null_dev_hw_addr, 0, dev->addr_len);
 		if (!(neigh->nud_state & NUD_VALID))
 			pr_debug("trying to ucast probe in NUD_INVALID\n");
 		neigh_ha_snapshot(dst_ha, neigh, dev);
-		dst_hw = dst_ha;
+		if (memcmp(&dst_ha, &null_dev_hw_addr, dev->addr_len) != 0)
+			dst_hw = dst_ha;
 	} else {
 		probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
 		if (probes < 0) {
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH 1/3] netns: publish net_generic correctly
From: Alexey Dobriyan @ 2016-12-02  1:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, xemul

Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.

	memcpy
	rcu_assign_pointer
	kfree_rcu
	ng->ptr[id - 1] = data;

This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)

How it didn't trigger for so long?
Well, you need quite specific set of conditions:

*) race window opens once per pernet subsystem addition
   (read: modprobe or boot)

*) not every pernet subsystem is eligible (need ->id and ->size)

*) not every pernet subsystem is vulnerable (need incorrect or absense
   of ordering of register_pernet_sybsys() and actually using net_generic())

*) to hide the bug even more, default is to preallocate 13 pointers which
   is actually quite a lot. You need IPv6, netfilter, bridging etc together
   loaded to trigger reallocation in the first place. Trimmed down
   config are OK.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 net/core/net_namespace.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -64,9 +64,10 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 
 	old_ng = rcu_dereference_protected(net->gen,
 					   lockdep_is_held(&net_mutex));
-	ng = old_ng;
-	if (old_ng->len >= id)
-		goto assign;
+	if (old_ng->len >= id) {
+		old_ng->ptr[id - 1] = data;
+		return 0;
+	}
 
 	ng = net_alloc_generic();
 	if (ng == NULL)
@@ -84,11 +85,10 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	 */
 
 	memcpy(&ng->ptr, &old_ng->ptr, old_ng->len * sizeof(void*));
+	ng->ptr[id - 1] = data;
 
 	rcu_assign_pointer(net->gen, ng);
 	kfree_rcu(old_ng, rcu);
-assign:
-	ng->ptr[id - 1] = data;
 	return 0;
 }
 

^ permalink raw reply

* [PATCH 2/3] netns: add dummy struct inside "struct net_generic"
From: Alexey Dobriyan @ 2016-12-02  1:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, xemul

This is precursor to fixing "[id - 1]" bloat inside net_generic().

Name "s" is chosen to complement name "u" often used for dummy unions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netns/generic.h |    6 ++++--
 net/core/net_namespace.c    |    8 ++++----
 2 files changed, 8 insertions(+), 6 deletions(-)

--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -25,8 +25,10 @@
  */
 
 struct net_generic {
-	unsigned int len;
-	struct rcu_head rcu;
+	struct {
+		unsigned int len;
+		struct rcu_head rcu;
+	} s;
 
 	void *ptr[0];
 };
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -50,7 +50,7 @@ static struct net_generic *net_alloc_generic(void)
 
 	ng = kzalloc(generic_size, GFP_KERNEL);
 	if (ng)
-		ng->len = max_gen_ptrs;
+		ng->s.len = max_gen_ptrs;
 
 	return ng;
 }
@@ -64,7 +64,7 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 
 	old_ng = rcu_dereference_protected(net->gen,
 					   lockdep_is_held(&net_mutex));
-	if (old_ng->len >= id) {
+	if (old_ng->s.len >= id) {
 		old_ng->ptr[id - 1] = data;
 		return 0;
 	}
@@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	 * the old copy for kfree after a grace period.
 	 */
 
-	memcpy(&ng->ptr, &old_ng->ptr, old_ng->len * sizeof(void*));
+	memcpy(&ng->ptr, &old_ng->ptr, old_ng->s.len * sizeof(void*));
 	ng->ptr[id - 1] = data;
 
 	rcu_assign_pointer(net->gen, ng);
-	kfree_rcu(old_ng, rcu);
+	kfree_rcu(old_ng, s.rcu);
 	return 0;
 }
 

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-12-01 23:14 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <ea2a8856-59d8-1b5f-9428-b16efa8f6979@stressinduktion.org>

On Thu, Dec 01, 2016 at 10:57:52PM +0100, Hannes Frederic Sowa wrote:
> On 30.11.2016 19:22, Ido Schimmel wrote:
> > On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
> >> On 30.11.2016 17:32, Ido Schimmel wrote:
> >>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
> >>>> On 30.11.2016 11:09, Jiri Pirko wrote:
> >>>>> From: Ido Schimmel <idosch@mellanox.com>
> >>>>>
> >>>>> Make sure the device has a complete view of the FIB tables by invoking
> >>>>> their dump during module init.
> >>>>>
> >>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>>>> ---
> >>>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
> >>>>>  1 file changed, 23 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >>>>> index 14bed1d..d176047 100644
> >>>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >>>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >>>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
> >>>>>  	return NOTIFY_DONE;
> >>>>>  }
> >>>>>  
> >>>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> >>>>> +{
> >>>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
> >>>>> +
> >>>>> +	/* Flush pending FIB notifications and then flush the device's
> >>>>> +	 * table before requesting another dump. Do that with RTNL held,
> >>>>> +	 * as FIB notification block is already registered.
> >>>>> +	 */
> >>>>> +	mlxsw_core_flush_owq();
> >>>>> +	rtnl_lock();
> >>>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
> >>>>> +	rtnl_unlock();
> >>>>> +}
> >>>>> +
> >>>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >>>>>  {
> >>>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
> >>>>>  	int err;
> >>>>>  
> >>>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
> >>>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >>>>>  
> >>>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
> >>>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
> >>>>
> >>>> Sorry to pick in here again:
> >>>>
> >>>> There is a race here. You need to protect the registration of the fib
> >>>> notifier as well by the sequence counter. Updates here are not ordered
> >>>> in relation to this code below.
> >>>
> >>> You mean updates that can be received after you registered the notifier
> >>> and until the dump started? I'm aware of that and that's OK. This
> >>> listener should be able to handle duplicates.
> >>
> >> I am not concerned about duplicates, but about ordering deletes and
> >> getting an add from the RCU code you will add the node to hw while it is
> >> deleted in the software path. You probably will ignore the delete
> >> because nothing is installed in hw and later add the node which was
> >> actually deleted but just reordered which happend on another CPU, no?
> > 
> > Are you referring to reordering in the workqueue? We already covered
> > this using an ordered workqueue, which has one context of execution
> > system-wide.
> 
> Ups, sorry, I missed that mail. Probably read it on the mobile phone and
> it became invisible for me later on. Busy day... ;)

Yet another reason not to read emails on your phone ;)

> The reordering in the workqueue seems fine to me and also still necessary.

Correct.

> Basically, if you delete a node right now the kernel might simply do a
> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
> or synchronization with the reader side. Thus you might get a callback
> from the notifier for a delete event on the one CPU and you end up
> queueing this fib entry after the delete queue, because the RCU walk
> isn't protected by any means.
> 
> Looking closer at this series again, I overlooked the fact that you
> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
> orders fetching of fib_seq and thus the RCU dumping after any concurrent
> executing fib table update, also the mutex_lock and unlock provide
> proper acquire and release fences, so the CPU indeed sees the effect of
> a RCU_INIT_POINTER update done on another CPU, because they pair with
> the rtnl_unlock which might happen on the other CPU.

Yep, Exactly. I had a feeling this is the issue you were referring to,
but then you were the one to suggest the use of RTNL, so I was quite
confused.

> My question is if this is a bit of luck and if we should make this
> explicit by putting the registration itself under the protection of the
> sequence counter. I favor the additional protection, e.g. if we some day
> actually we optimize the fib_seq code? Otherwise we might probably
> document this fact. :)

Well, some listeners don't require a dump, but only registration
(rocker) and in the future we might only need a dump (e.g., port being
moved to a different net namespace). So I'm not sure if bundling both
together is a good idea.

Maybe we can keep register_fib_notifier() as-is and add 'bool register'
to fib_notifier_dump() so that when set, 'nb' is also registered after
RCU walk, but before we check if the dump is consistent (unregistered if
inconsistent)?

> >>> I've a follow up patchset that introduces a new event in switchdev
> >>> notification chain called SWITCHDEV_SYNC, which is sent when port
> >>> netdevs are enslaved / released  from a master device (points in time
> >>> where kernel<->device can get out of sync). It will invoke
> >>> re-propagation of configuration from different parts of the stack
> >>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> >>> in duplicates.
> >>
> >> Okay, understood. I wonder how we can protect against accidentally abort
> >> calls actually. E.g. if I start to inject routes into my routing domain
> >> how can I make sure the box doesn't die after I try to insert enough
> >> routes. Do we need to touch quagga etc?
> > 
> > The whole point of moving abort mechanism to the driver is that the
> > system won't die, but instead routing will be done in the kernel. If you
> > respect hardware limitations, then there's no reason for abort mechanism
> > to kick in.
> 
> Quick follow-up question: How can I quickly find out the hw limitations
> via the kernel api?

That's a good question. Currently, you can't. However, we already have a
mechanism in place to read device's capabilities from the firmware and
we can (and should) expose some of them to the user. The best API for
that would be devlink, as it can represent the entire device as opposed
to only a port netdev like other tools.

We're also working on making the pipeline more visible to the user, so
that it would be easier for users to understand and debug their
networks. I believe a colleague of mine (Matty) presented this during
the last netdev conference.

^ permalink raw reply

* [PATCH] netlink: 2-clause nla_ok()
From: Alexey Dobriyan @ 2016-12-02  0:59 UTC (permalink / raw)
  To: davem; +Cc: netdev

nla_ok() consists of 3 clauses:

	1) int rem >= (int)sizeof(struct nlattr)

	2) u16 nla_len >= sizeof(struct nlattr)

	3) u16 nla_len <= int rem

The statement is that clause (1) is redundant.

What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.

However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.

NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.

Obligatory space savings report: -1.6 KB

	$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
	add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
	function                                     old     new   delta
	validate_scan_freqs                          142     155     +13
	tcf_em_tree_validate                         867     879     +12
	dcbnl_ieee_del                               328     338     +10
	netlbl_cipsov4_add_common.isra               218     215      -3
		...
	ovs_nla_put_actions                          888     806     -82
	netlbl_cipsov4_add_std                      1648    1566     -82
	nl80211_parse_sched_scan                    2889    2780    -109
	ip_tun_from_nlattr                          3086    2945    -141

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netlink.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -698,8 +698,7 @@ static inline int nla_len(const struct nlattr *nla)
  */
 static inline int nla_ok(const struct nlattr *nla, int remaining)
 {
-	return remaining >= (int) sizeof(*nla) &&
-	       nla->nla_len >= sizeof(*nla) &&
+	return nla->nla_len >= sizeof(*nla) &&
 	       nla->nla_len <= remaining;
 }
 

^ permalink raw reply

* Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
From: Eric Dumazet @ 2016-12-01 23:15 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: netdev, Eric Dumazet, David Miller, Mahesh Bandewar
In-Reply-To: <1480632994-12128-1-git-send-email-mahesh@bandewar.net>

On Thu, 2016-12-01 at 14:56 -0800, Mahesh Bandewar wrote:
> From: Mahesh Bandewar <maheshb@google.com>
> 
> If initial broadcast probe(s) is/are lost, the neigh entry wont have
> valid address of the neighbour. In a situation like this, the fall
> back should be to send a broadcast probe, however the code logic
> continues sending ucast probes to 00:00:00:00:00:00. The default value
> of ucast probes is 3 so system usually recovers after three such probes
> but if the value configured is larger it takes those many probes
> (a probe is sent every second in default config) / seconds to recover
> making machine not-available on the network.
> 
> This patch just ensures that the unicast address is not NULL otherwise
> falls back to sending broadcast probe.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> ---
>  net/ipv4/arp.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
> index 89a8cac4726a..56fb33d5ed31 100644
> --- a/net/ipv4/arp.c
> +++ b/net/ipv4/arp.c
> @@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  {
>  	__be32 saddr = 0;
>  	u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL;
> +	u8 null_dev_hw_addr[MAX_ADDR_LEN];
>  	struct net_device *dev = neigh->dev;
>  	__be32 target = *(__be32 *)neigh->primary_key;
>  	int probes = atomic_read(&neigh->probes);
> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  
>  	probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
>  	if (probes < 0) {
> +		memset(&null_dev_hw_addr, 0, dev->addr_len);
>  		if (!(neigh->nud_state & NUD_VALID))
>  			pr_debug("trying to ucast probe in NUD_INVALID\n");
>  		neigh_ha_snapshot(dst_ha, neigh, dev);
> -		dst_hw = dst_ha;
> +		if (memcmp(&dst_ha, &null_dev_hw_addr, dev->addr_len) != 0)
> +			dst_hw = dst_ha;
>  	} else {
>  		probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
>  		if (probes < 0) {

Why is is an IPv4 specific issue ?
What about IPv6 ?


I would try something in neighbour code, maybe :

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg)
 			neigh_dbg(2, "neigh %p is probed\n", neigh);
 			neigh->nud_state = NUD_PROBE;
 			neigh->updated = jiffies;
-			atomic_set(&neigh->probes, 0);
+			atomic_set(&neigh->probes,
+				   (neigh->output == neigh_blackhole) ?
+					NEIGH_VAR(neigh->parms, UCAST_PROBES) :
+					0);
 			notify = 1;
 			next = now + NEIGH_VAR(neigh->parms, RETRANS_TIME);
 		}

Thanks.

^ permalink raw reply related

* [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: Alexey Dobriyan @ 2016-12-02  1:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, xemul

net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

		...
	ptr = ng->ptr[id - 1];
		...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
	function                                     old     new   delta
	tipc_nametbl_insert_publ                    1250    1270     +20
	nlmclnt_lookup_host                          686     703     +17
	nfsd4_encode_fattr                          5930    5941     +11
	nfs_get_client                              1050    1061     +11
	register_pernet_operations                   333     342      +9
	tcf_mirred_init                              843     849      +6
	tcf_bpf_init                                1143    1149      +6
	gss_setup_upcall                             990     994      +4
	idmap_name_to_id                             432     434      +2
	ops_init                                     274     275      +1
	nfsd_inject_forget_client                    259     260      +1
	nfs4_alloc_client                            612     613      +1
	tunnel_key_walker                            164     163      -1

		...

	tipc_bcbase_select_primary                   392     360     -32
	mac80211_hwsim_new_radio                    2808    2767     -41
	ipip6_tunnel_ioctl                          2228    2186     -42
	tipc_bcast_rcv                               715     672     -43
	tipc_link_build_proto_msg                   1140    1089     -51
	nfsd4_lock                                  3851    3796     -55
	tipc_mon_rcv                                1012     956     -56
	Total: Before=156643951, After=156639743, chg -0.00%


Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netns/generic.h |   16 +++++++++-------
 net/core/net_namespace.c    |   20 ++++++++++++--------
 2 files changed, 21 insertions(+), 15 deletions(-)

--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -25,12 +25,14 @@
  */
 
 struct net_generic {
-	struct {
-		unsigned int len;
-		struct rcu_head rcu;
-	} s;
-
-	void *ptr[0];
+	union {
+		struct {
+			unsigned int len;
+			struct rcu_head rcu;
+		} s;
+
+		void *ptr[0];
+	};
 };
 
 static inline void *net_generic(const struct net *net, unsigned int id)
@@ -40,7 +42,7 @@ static inline void *net_generic(const struct net *net, unsigned int id)
 
 	rcu_read_lock();
 	ng = rcu_dereference(net->gen);
-	ptr = ng->ptr[id - 1];
+	ptr = ng->ptr[id];
 	rcu_read_unlock();
 
 	return ptr;
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -39,6 +39,9 @@ EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
 
+#define MIN_PERNET_OPS_ID	\
+	((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
+
 #define INITIAL_NET_GEN_PTRS	13 /* +1 for len +2 for rcu_head */
 
 static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
@@ -46,7 +49,7 @@ static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
 static struct net_generic *net_alloc_generic(void)
 {
 	struct net_generic *ng;
-	size_t generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]);
+	unsigned int generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]);
 
 	ng = kzalloc(generic_size, GFP_KERNEL);
 	if (ng)
@@ -60,12 +63,12 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	struct net_generic *ng, *old_ng;
 
 	BUG_ON(!mutex_is_locked(&net_mutex));
-	BUG_ON(id == 0);
+	BUG_ON(id < MIN_PERNET_OPS_ID);
 
 	old_ng = rcu_dereference_protected(net->gen,
 					   lockdep_is_held(&net_mutex));
-	if (old_ng->s.len >= id) {
-		old_ng->ptr[id - 1] = data;
+	if (old_ng->s.len > id) {
+		old_ng->ptr[id] = data;
 		return 0;
 	}
 
@@ -84,8 +87,9 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	 * the old copy for kfree after a grace period.
 	 */
 
-	memcpy(&ng->ptr, &old_ng->ptr, old_ng->s.len * sizeof(void*));
-	ng->ptr[id - 1] = data;
+	memcpy(&ng->ptr[MIN_PERNET_OPS_ID], &old_ng->ptr[MIN_PERNET_OPS_ID],
+	       (old_ng->s.len - MIN_PERNET_OPS_ID) * sizeof(void *));
+	ng->ptr[id] = data;
 
 	rcu_assign_pointer(net->gen, ng);
 	kfree_rcu(old_ng, s.rcu);
@@ -874,7 +878,7 @@ static int register_pernet_operations(struct list_head *list,
 
 	if (ops->id) {
 again:
-		error = ida_get_new_above(&net_generic_ids, 1, ops->id);
+		error = ida_get_new_above(&net_generic_ids, MIN_PERNET_OPS_ID, ops->id);
 		if (error < 0) {
 			if (error == -EAGAIN) {
 				ida_pre_get(&net_generic_ids, GFP_KERNEL);
@@ -882,7 +886,7 @@ static int register_pernet_operations(struct list_head *list,
 			}
 			return error;
 		}
-		max_gen_ptrs = max(max_gen_ptrs, *ops->id);
+		max_gen_ptrs = max(max_gen_ptrs, *ops->id + 1);
 	}
 	error = __register_pernet_operations(list, ops);
 	if (error) {

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 23:27 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161201231445.vtx7mjxuxusar3mu@splinter>

On 02.12.2016 00:14, Ido Schimmel wrote:

[...]

>> Basically, if you delete a node right now the kernel might simply do a
>> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
>> or synchronization with the reader side. Thus you might get a callback
>> from the notifier for a delete event on the one CPU and you end up
>> queueing this fib entry after the delete queue, because the RCU walk
>> isn't protected by any means.
>>
>> Looking closer at this series again, I overlooked the fact that you
>> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
>> orders fetching of fib_seq and thus the RCU dumping after any concurrent
>> executing fib table update, also the mutex_lock and unlock provide
>> proper acquire and release fences, so the CPU indeed sees the effect of
>> a RCU_INIT_POINTER update done on another CPU, because they pair with
>> the rtnl_unlock which might happen on the other CPU.
> 
> Yep, Exactly. I had a feeling this is the issue you were referring to,
> but then you were the one to suggest the use of RTNL, so I was quite
> confused.

At that time I actually had in mind that the fib_register would happen
under the sequence lock, so I didn't look closely to the memory barrier
pairings. I kinda still consider this to be a happy accident. ;)

>> My question is if this is a bit of luck and if we should make this
>> explicit by putting the registration itself under the protection of the
>> sequence counter. I favor the additional protection, e.g. if we some day
>> actually we optimize the fib_seq code? Otherwise we might probably
>> document this fact. :)
> 
> Well, some listeners don't require a dump, but only registration
> (rocker) and in the future we might only need a dump (e.g., port being
> moved to a different net namespace). So I'm not sure if bundling both
> together is a good idea.
> 
> Maybe we can keep register_fib_notifier() as-is and add 'bool register'
> to fib_notifier_dump() so that when set, 'nb' is also registered after
> RCU walk, but before we check if the dump is consistent (unregistered if
> inconsistent)?

I really like that. Would you mind adding this?

[...]

>> Quick follow-up question: How can I quickly find out the hw limitations
>> via the kernel api?
> 
> That's a good question. Currently, you can't. However, we already have a
> mechanism in place to read device's capabilities from the firmware and
> we can (and should) expose some of them to the user. The best API for
> that would be devlink, as it can represent the entire device as opposed
> to only a port netdev like other tools.
> 
> We're also working on making the pipeline more visible to the user, so
> that it would be easier for users to understand and debug their
> networks. I believe a colleague of mine (Matty) presented this during
> the last netdev conference.

Thanks, I will look it up!

Bye,
Hannes

^ permalink raw reply

* [PATCH 0/7]  net: ethernet: ti: cpsw: support placing CPDMA descriptors into DDR
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko

This series intended to add support for placing CPDMA descriptors into DDR by
introducing new DT property "descs_pool_size" to specify buffer descriptor's
pool size. The "descs_pool_size" defines total number of CPDMA
CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM.

This allows significantly to reduce UDP packets drop rate
for bandwidth >301 Mbits/sec (am57x).  

Before enabling this feature, the am437x SoC has to be fixed as it's proved
that it's not working when CPDMA descriptors placed in DDR.
So, the patch 1 fixes this issue.

Grygorii Strashko (7):
  net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr
  net: ethernet: ti: cpdma: fix desc re-queuing
  net: ethernet: ti: cpdma: minimize number of parameters in
    cpdma_desc_pool_create/destroy()
  net: ethernet: ti: cpdma: use devm_ioremap
  Documentation: DT: net: cpsw: allow to specify descriptors pool size
  net: ethernet: ti: cpsw: add support for descs_pool_size dt property
  Documentation: DT: net: cpsw: remove no_bd_ram property

 Documentation/devicetree/bindings/net/cpsw.txt |  8 ++-
 arch/arm/boot/dts/am33xx.dtsi                  |  1 -
 arch/arm/boot/dts/am4372.dtsi                  |  1 -
 arch/arm/boot/dts/dm814x.dtsi                  |  1 -
 arch/arm/boot/dts/dra7.dtsi                    |  1 -
 drivers/net/ethernet/ti/cpsw.c                 |  5 ++
 drivers/net/ethernet/ti/cpsw.h                 |  1 +
 drivers/net/ethernet/ti/davinci_cpdma.c        | 90 +++++++++++++++-----------
 drivers/net/ethernet/ti/davinci_cpdma.h        |  1 +
 9 files changed, 63 insertions(+), 46 deletions(-)

-- 
2.10.1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox