Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [WIP] net+mlx4: auto doorbell
From: Saeed Mahameed @ 2016-11-30 16:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan
In-Reply-To: <1480520661.18162.177.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 5:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
>> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>> >
>> >
>> >> Not sure it this has been tried before, but the doorbell avoidance could
>> >> be done by the driver itself, because it knows a TX completion will come
>> >> shortly (well... if softirqs are not delayed too much !)
>> >>
>> >> Doorbell would be forced only if :
>> >>
>> >> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> >> OR
>> >> ( too many [1] packets were put in TX ring buffer, no point deferring
>> >> more)
>> >>
>> >> Start the pump, but once it is started, let the doorbells being done by
>> >> TX completion.
>> >>
>> >> ndo_start_xmit and TX completion handler would have to maintain a shared
>> >> state describing if packets were ready but doorbell deferred.
>> >>
>> >>
>> >> Note that TX completion means "if at least one packet was drained",
>> >> otherwise busy polling, constantly calling napi->poll() would force a
>> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
>> >>
>> >> But then, maybe busy poll would like to force a doorbell...
>> >>
>> >> I could try these ideas on mlx4 shortly.
>> >>
>> >>
>> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>> >
>> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
>> > not used.
>>
>> Hi Eric, Nice Idea indeed and we need something like this,
>> today we almost don't exploit the TX bulking at all.
>>
>> But please see below, i am not sure different contexts should share
>> the doorbell ringing, it is really risky.
>>
>> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>> >  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>> >  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>> >  include/linux/netdevice.h                    |    1
>> >  net/core/net-sysfs.c                         |   18 +++
>> >  5 files changed, 83 insertions(+), 32 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > index 6562f78b07f4..fbea83218fc0 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>> >
>> >         if (polled) {
>> >                 if (doorbell_pending)
>> > -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
>> > +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>> >
>> >                 mlx4_cq_set_ci(&cq->mcq);
>> >                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > index 4b597dca5c52..affebb435679 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>> >         ring->size = size;
>> >         ring->size_mask = size - 1;
>> >         ring->sp_stride = stride;
>> > -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
>> > +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>> >
>> >         tmp = size * sizeof(struct mlx4_en_tx_info);
>> >         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
>> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>> >         ring->sp_cqn = cq;
>> >         ring->prod = 0;
>> >         ring->cons = 0xffffffff;
>> > +       ring->ncons = 0;
>> >         ring->last_nr_txbb = 1;
>> >         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>> >         memset(ring->buf, 0, ring->buf_size);
>> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>> >                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>> >  }
>> >
>> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
>> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>> >  {
>> > -       return ring->prod - ring->cons > ring->full_size;
>> > +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>> >  }
>> >
>> >  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
>> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >
>> >         /* Skip last polled descriptor */
>> >         ring->cons += ring->last_nr_txbb;
>> > +       ring->ncons += ring->last_nr_txbb;
>> >         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>> >                  ring->cons, ring->prod);
>> >
>> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >                                                 !!(ring->cons & ring->size), 0,
>> >                                                 0 /* Non-NAPI caller */);
>> >                 ring->cons += ring->last_nr_txbb;
>> > +               ring->ncons += ring->last_nr_txbb;
>> >                 cnt++;
>> >         }
>> >
>> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >         return cnt;
>> >  }
>> >
>> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
>> > +                          struct mlx4_en_tx_ring *ring)
>> > +{
>> > +
>> > +       if (dev->doorbell_opt & 1) {
>> > +               u32 oval = READ_ONCE(ring->prod_bell);
>> > +               u32 nval = READ_ONCE(ring->prod);
>> > +
>> > +               if (oval == nval)
>> > +                       return;
>> > +
>> > +               /* I can not tell yet if a cmpxchg() is needed or not */
>> > +               if (dev->doorbell_opt & 2)
>> > +                       WRITE_ONCE(ring->prod_bell, nval);
>> > +               else
>> > +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
>> > +                               return;
>> > +       }
>> > +       /* Since there is no iowrite*_native() that writes the
>> > +        * value as is, without byteswapping - using the one
>> > +        * the doesn't do byteswapping in the relevant arch
>> > +        * endianness.
>> > +        */
>> > +#if defined(__LITTLE_ENDIAN)
>> > +       iowrite32(
>> > +#else
>> > +       iowrite32be(
>> > +#endif
>> > +                 ring->doorbell_qpn,
>> > +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > +}
>> > +
>> >  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> >                                   struct mlx4_en_cq *cq, int napi_budget)
>> >  {
>> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> >         wmb();
>> >
>> >         /* we want to dirty this cache line once */
>> > -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
>> > -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
>> > +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
>> > +       ring_cons += txbbs_skipped;
>> > +       WRITE_ONCE(ring->cons, ring_cons);
>> > +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
>> > +
>> > +       if (dev->doorbell_opt)
>> > +               mlx4_en_xmit_doorbell(dev, ring);
>> >
>> >         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>> >                 return done < budget;
>> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>> >         __iowrite64_copy(dst, src, bytecnt / 8);
>> >  }
>> >
>> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
>> > -{
>> > -       wmb();
>>
>> you missed/removed this "wmb()" in the new function, why ? where did
>> you compensate for it ?
>
> I removed it because I had a cmpxchg() there if the barrier was needed.
>

Cool, so the answer is yes, the barrier is needed in order for the HW
to see the last step of
mlx4_en_tx_write_desc where we write the ownership bit (which means
this descriptor is valid for HW processing).
tx_desc->ctrl.owner_opcode = op_own;

ringing the doorbell without this wmb might cause the HW to miss that
last packet.

> My patch is a WIP, where you can set the bit 2 to ask to replace the
> cmpxchg() by a simple write, only for performance testing/comparisons.
>
>
>>
>> > -       /* Since there is no iowrite*_native() that writes the
>> > -        * value as is, without byteswapping - using the one
>> > -        * the doesn't do byteswapping in the relevant arch
>> > -        * endianness.
>> > -        */
>> > -#if defined(__LITTLE_ENDIAN)
>> > -       iowrite32(
>> > -#else
>> > -       iowrite32be(
>> > -#endif
>> > -                 ring->doorbell_qpn,
>> > -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > -}
>> >
>> >  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >                                   struct mlx4_en_tx_desc *tx_desc,
>> >                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>> >                                   int desc_size, int bf_index,
>> >                                   __be32 op_own, bool bf_ok,
>> > -                                 bool send_doorbell)
>> > +                                 bool send_doorbell,
>> > +                                 const struct net_device *dev, int nr_txbb)
>> >  {
>> >         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>> >
>> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >
>> >                 wmb();
>> >
>> > +               ring->prod += nr_txbb;
>> >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>> >                              desc_size);
>> >
>> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >                  */
>> >                 dma_wmb();
>> >                 tx_desc->ctrl.owner_opcode = op_own;
>> > +               ring->prod += nr_txbb;
>> >                 if (send_doorbell)
>> > -                       mlx4_en_xmit_doorbell(ring);
>> > +                       mlx4_en_xmit_doorbell(dev, ring);
>> >                 else
>> >                         ring->xmit_more++;
>> >         }
>> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> >                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>> >         }
>> >
>> > -       ring->prod += nr_txbb;
>> > -
>> >         /* If we used a bounce buffer then copy descriptor back into place */
>> >         if (unlikely(bounce))
>> >                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> >         }
>> >         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>> >
>> > +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
>> > +        * will happen shortly.
>> > +        */
>> > +       if (send_doorbell &&
>> > +           dev->doorbell_opt &&
>> > +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>>
>> Aelexi already expressed his worries about synchronization, and i
>> think here (in this exact line) sits the problem,
>> what about if exactly at this point the TX completion handler just
>> finished and rang the last doorbell,
>> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
>> the last doorbell from the CQ handler missed it.
>> even if you wrote the TX descriptor before the doorbell decision here,
>> you will need a memory barrier to make sure the HW sees
>> the new packet, which was typically done before ringing the doorbell.
>
>
> My patch is a WIP, meaning it is not completed ;)
>
> Surely we can find a non racy way to handle this.

The question is, can it be done without locking ? Maybe ring the
doorbell every N msecs  just in case.

>
>
>> All in all, this is risky business :),  the right way to go is to
>> force the upper layer to use xmit-more and delay doorbells/use bulking
>> but from the same context
>> (xmit routine).  For example see Achiad's suggestion (attached in
>> Jesper's response), he used stop queue to force the stack to queue up
>> packets (TX bulking)
>> which would set xmit-more and will use the next completion to release
>> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
>
>
>
> Well, you depend on having a higher level queue like a qdisc.
>
> Some users do not use a qdisc.
> If you stop the queue, they no longer can send anything -> drops.
>

In this case, i think they should implement their own bulking (pktgen
is not a good example)
but XDP can predict if it has more packets to xmit  as long as all of
them fall in the same NAPI cycle.
Others should try and do the same.

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-11-30 16:32 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <eb7ddecd-d6b4-dfce-5990-9e933227c862@stressinduktion.org>

Hi Hannes,

On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
> On 30.11.2016 11:09, Jiri Pirko wrote:
> > From: Ido Schimmel <idosch@mellanox.com>
> > 
> > Make sure the device has a complete view of the FIB tables by invoking
> > their dump during module init.
> > 
> > Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> > Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> > ---
> >  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > index 14bed1d..d176047 100644
> > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
> >  	return NOTIFY_DONE;
> >  }
> >  
> > +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> > +{
> > +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
> > +
> > +	/* Flush pending FIB notifications and then flush the device's
> > +	 * table before requesting another dump. Do that with RTNL held,
> > +	 * as FIB notification block is already registered.
> > +	 */
> > +	mlxsw_core_flush_owq();
> > +	rtnl_lock();
> > +	mlxsw_sp_router_fib_flush(mlxsw_sp);
> > +	rtnl_unlock();
> > +}
> > +
> >  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >  {
> > +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
> >  	int err;
> >  
> >  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
> > @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >  
> >  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
> >  	register_fib_notifier(&mlxsw_sp->fib_nb);
> 
> Sorry to pick in here again:
> 
> There is a race here. You need to protect the registration of the fib
> notifier as well by the sequence counter. Updates here are not ordered
> in relation to this code below.

You mean updates that can be received after you registered the notifier
and until the dump started? I'm aware of that and that's OK. This
listener should be able to handle duplicates.

I've a follow up patchset that introduces a new event in switchdev
notification chain called SWITCHDEV_SYNC, which is sent when port
netdevs are enslaved / released  from a master device (points in time
where kernel<->device can get out of sync). It will invoke
re-propagation of configuration from different parts of the stack
(e.g. bridge driver, 8021q driver, fib/neigh code), which can result
in duplicates.

> I think just move the register notification into the fib_notifier_dump
> function, rename it to fib_notifier_init and use it here:

I separated the two on purpose. For example, rocker only needs to
register notifier, but doesn't need the dump.

> 
> > +	if (!fib_notifier_dump(&mlxsw_sp->fib_nb, &init_net, cb)) {
> > +		err = -EBUSY;
> > +		goto err_fib_notifier_dump;
> > +	}
> > +
> >  	return 0;
> 
> Thanks,
> Hannes
> 

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Joakim Tjernlund @ 2016-11-30 16:35 UTC (permalink / raw)
  To: andrew@lunn.ch; +Cc: netdev@vger.kernel.org
In-Reply-To: <20161130152503.GE21645@lunn.ch>

On Wed, 2016-11-30 at 16:25 +0100, Andrew Lunn wrote:
> On Wed, Nov 30, 2016 at 02:30:43PM +0000, Joakim Tjernlund wrote:
> > On Wed, 2016-11-30 at 14:52 +0100, Andrew Lunn wrote:
> > > On Wed, Nov 30, 2016 at 08:50:34AM +0000, Joakim Tjernlund wrote:
> > > > I am trying to wrap my head around these two "devices" and have a hard time telling them apart.
> > > > We are looking att adding a faily large switch(over PCIe) to our board and from what I can tell
> > > > switchdev is the new way to do it but DSA is still there. Is it possible to just list
> > > > how they differ?
> > > 
> > > Hi Joakim
> > 
> > Hi Andrew, thanks for answering
> > 
> > > 
> > > If the interface you use to send frames from the host to the switch is
> > > PCIe, you probably want to use switchdev directly.
> > 
> > OK, we will have a few ethernet I/F's connected too but I these should be used
> > as normal interfaces just interfacing a switch.
> 
> That does not make much sense.
> 
> Maybe time to backtrack a bit. The Linux concept for switch/router
> chips is that they are just hardware accelerators for what Linux can
> already do in software. Each port of the switch is just a normal Linux
> interface.  ip link show will list each port. ip addr add can be used
> to add an IP address to the interface.  You want to switch frames
> between two ports: Create a linux bridge and put the interfaces into
> it. Via switchdev you get a call into the hardware to accelerate
> this. If the hardware cannot accelerate it, it is done in software as
> normal.  Want to combine two ports into a trunk: Add a team interface
> and make the port interfaces slaves of the team interface. Via
> switchdev, you ask the hardware to accelerate this. If it cannot, it
> is done in software.
> 
> So back your connecting a few host interfaces to the switch. This is
> logically putting a cable between two interfaces on the same host. You
> are making a loopback. Why do that? Sure it is possible, but it is an
> odd architecture.

This is an embedded system with several boards in a subrack.
Each board has eth I/F connected to a switch to communicate with each other.
One of the board will also house the actual switch device and manage the switch.
Then the normal app just communicates over the physical eth I/F like any other board
in the system. There is a "manage switch app" which brings the switch up and partition
phys VLANs etc. (each phys I/F would be a a separate domain so no loop)

I guess I could skip the phys I/F and have the switch app create a virtual eth0 I/F over PCIe
instead to save eth MACS but the above is safer should there be some problems/limitations in
swicthdev plus switchdev does not exist in u-boot so it would be a lot of effort to
get a working eth I/F inside u-boot.
I can still can still create a bridge I/F etc. should I need to.
Does the above make sense to you ?
 
> 
> > And switchdev can do all this over PCIe instead? Can you have a
> > switch tree in switchdev too?
> 
> Mellonex says so, but i don't think they have actually implemented it.

Not impl. any of DSAs features? What can you do with a Mellonex switch then?

 Jocke

^ permalink raw reply

* Re: [PATCH net-next 8/8] net/mlx5e: Support adding ingress tc rule when egress device flag is set
From: John Fastabend @ 2016-11-30 16:36 UTC (permalink / raw)
  To: Hadar Hen Zion, David S. Miller
  Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
	Roi Dayan
In-Reply-To: <1480516895-29545-9-git-send-email-hadarh@mellanox.com>

On 16-11-30 06:41 AM, Hadar Hen Zion wrote:
> When ndo_setup_tc is called with an egress_dev flag set, it means that
> the ndo call was executed on the mirred action (egress) device and not
> on the ingress device.
> 
> In order to support this kind of ndo_setup_tc call, and insert the
> correct decap rule to the hardware, the uplink device on the same eswitch
> should be found.
> 
> Currently, we use this resolution between the mirred device and the
> uplink on the same eswitch to offload vxlan shared device decap rules.
> 
> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
> ---

Hi Hadar,

I started to dig through these patches and the last series here,

   Re: [PATCH net-next 00/13] Mellanox 100G SRIOV offloads tunnel_key
set/release

Can you explain how these two are related? I'm guessing in that first
series the actual redirect action to a tunnel device was being ignore?

Does this series clean up that bit of software/hardware alignment.

Thanks,
John


>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 0868677..8503788 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -289,6 +289,14 @@ static int mlx5e_rep_ndo_setup_tc(struct net_device *dev, u32 handle,
>  	if (TC_H_MAJ(handle) != TC_H_MAJ(TC_H_INGRESS))
>  		return -EOPNOTSUPP;
>  
> +	if (tc->egress_dev) {
> +		struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
> +		struct net_device *uplink_dev = mlx5_eswitch_get_uplink_netdev(esw);
> +
> +		return uplink_dev->netdev_ops->ndo_setup_tc(uplink_dev, handle,
> +							    proto, tc);
> +	}
> +
>  	switch (tc->type) {
>  	case TC_SETUP_CLSFLOWER:
>  		switch (tc->cls_flower->command) {
> 

^ permalink raw reply

* Re: [PATCH net 5/7] net: ethernet: stmmac: dwmac-meson8b: fix probe error path
From: Kevin Hilman @ 2016-11-30 16:43 UTC (permalink / raw)
  To: Johan Hovold
  Cc: David S. Miller, Giuseppe Cavallaro, Alexandre Torgue,
	Joachim Eastwood, Carlo Caione, Maxime Coquelin, Maxime Ripard,
	Chen-Yu Tsai, netdev, linux-kernel
In-Reply-To: <1480516195-27696-6-git-send-email-johan@kernel.org>

Johan Hovold <johan@kernel.org> writes:

> Make sure to disable clocks before returning on late probe errors.
>
> Fixes: 566e82516253 ("net: stmmac: add a glue driver for the Amlogic
> Meson 8b / GXBB DWMAC")
> Signed-off-by: Johan Hovold <johan@kernel.org>

Acked-by: Kevin Hilman <khilman@baylibre.com>

^ permalink raw reply

* Re: qed, qedi patchset submission
From: Martin K. Petersen @ 2016-11-30 16:45 UTC (permalink / raw)
  To: Arun Easi; +Cc: David Miller, Martin K. Petersen, linux-scsi, netdev
In-Reply-To: <alpine.LRH.2.00.1611291452030.28058@mvluser05.qlc.com>

>>>>> "Arun" == Arun Easi <arun.easi@cavium.com> writes:

Arun,

Arun> So far, we have been posting qedi changes split into functional
Arun> blocks, for review, but was not bisectable. With Martin ok to our
Arun> request to squash all patches while committing to tree, we were
Arun> wondering if we should post the qedi patches squashed, with all
Arun> the Reviewed-by added, or continue to post as before?

I guess it depends how things can be split up in a bisectable fashion.

If the net/ pieces can be completely separated from the scsi/ pieces
maybe it would be best to have two patches?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH] net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
From: Lorenzo Colitti @ 2016-11-30 16:46 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <1480442207-43618-1-git-send-email-lorenzo@google.com>

On Tue, Nov 29, 2016 at 9:56 AM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
> protocols.") made __build_flow_key call sock_net(sk) to determine
> the network namespace of the passed-in socket. This crashes if sk
> is NULL.

Since I missed this in the patch description: this is targeted to
net-next (the code it fixes is not in net yet). Also:

Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.")

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-11-30 16:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480521514.18162.191.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 5:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 15:08 +0100, Jesper Dangaard Brouer wrote:
>> On Fri, 25 Nov 2016 07:46:20 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > From: Eric Dumazet <edumazet@google.com>
>>
>> Ended up-in net-next as:
>>
>>  commit 40931b85113dad7881d49e8759e5ad41d30a5e6c
>>  Author: Eric Dumazet <edumazet@google.com>
>>  Date:   Fri Nov 25 07:46:20 2016 -0800
>>
>>     mlx4: give precise rx/tx bytes/packets counters
>>
>>     mlx4 stats are chaotic because a deferred work queue is responsible
>>     to update them every 250 ms.
>>
>> Likely after this patch I get this crash (below), when rebooting my machine.
>> Looks like a device removal order thing.
>> Tested with net-next at commit 93ba22225504.
>>
>> [...]
>> [ 1967.248453] mlx5_core 0000:02:00.1: Shutdown was called
>> [ 1967.854556] mlx5_core 0000:02:00.0: Shutdown was called
>> [ 1968.443015] e1000e: EEE TX LPI TIMER: 00000011
>> [ 1968.484676] sd 3:0:0:0: [sda] Synchronizing SCSI cache
>> [ 1968.528354] mlx4_core 0000:01:00.0: mlx4_shutdown was called
>> [ 1968.534054] mlx4_en: mlx4p1: Close port called
>> [ 1968.571156] mlx4_en 0000:01:00.0: removed PHC
>> [ 1968.575677] mlx4_en: mlx4p2: Close port called
>> [ 1969.506602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000d08
>> [ 1969.514530] IP: [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
>> [ 1969.522963] PGD 0 [ 1969.524803]
>> [ 1969.526332] Oops: 0000 [#1] PREEMPT SMP
>> [ 1969.530201] Modules linked in: coretemp kvm_intel kvm irqbypass intel_cstate mxm_wmi i2c_i801 intel_rapl_perf i2c_smbus sg pcspkr i2c_core shpchp nfsd wmi video acpi_pad auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables mlx4_en e1000e mlx5_core ptp serio_raw sd_mod mlx4_core pps_core devlink hid_generic
>> [ 1969.559616] CPU: 3 PID: 3104 Comm: kworker/3:1 Not tainted 4.9.0-rc6-net-next3-01390-g93ba22225504 #12
>> [ 1969.568984] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
>> [ 1969.578877] Workqueue: events linkwatch_event
>> [ 1969.583285] task: ffff8803f42a0000 task.stack: ffff88040b2d0000
>> [ 1969.589238] RIP: 0010:[<ffffffffa0127ca4>]  [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
>> [ 1969.600102] RSP: 0018:ffff88040b2d3bd8  EFLAGS: 00010282
>> [ 1969.605442] RAX: ffff8803f432efc8 RBX: ffff8803f4320000 RCX: 0000000000000000
>> [ 1969.612604] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8803f4320000
>> [ 1969.619772] RBP: ffff88040b2d3bd8 R08: 000000000000000c R09: ffff8803f432f000
>> [ 1969.626938] R10: 0000000000000000 R11: ffff88040d64ac00 R12: ffff8803e5aff8dc
>> [ 1969.634104] R13: ffff8803f4320a28 R14: ffff8803e5aff800 R15: 0000000000000000
>> [ 1969.641273] FS:  0000000000000000(0000) GS:ffff88041fac0000(0000) knlGS:0000000000000000
>> [ 1969.649422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1969.655197] CR2: 0000000000000d08 CR3: 0000000001c07000 CR4: 00000000001406e0
>> [ 1969.662366] Stack:
>> [ 1969.664412]  ffff88040b2d3be8 ffffffffa0127f8e ffff88040b2d3c10 ffffffffa012a23b
>> [ 1969.671948]  ffff8803e5aff8dc ffff8803f4320000 ffff8803f4320000 ffff88040b2d3c30
>> [ 1969.679478]  ffffffff8160ae29 ffff8803e5aff8d8 ffff8804088ff300 ffff88040b2d3c58
>> [ 1969.687001] Call Trace:
>> [ 1969.689484]  [<ffffffffa0127f8e>] mlx4_en_fold_software_stats+0x1e/0x20 [mlx4_en]
>> [ 1969.697026]  [<ffffffffa012a23b>] mlx4_en_get_stats64+0x2b/0x50 [mlx4_en]
>> [ 1969.703844]  [<ffffffff8160ae29>] dev_get_stats+0x39/0xa0
>> [ 1969.709274]  [<ffffffff81622470>] rtnl_fill_stats+0x40/0x130
>> [ 1969.714968]  [<ffffffff8162631b>] rtnl_fill_ifinfo+0x55b/0x1010
>> [ 1969.720921]  [<ffffffff816285d3>] rtmsg_ifinfo_build_skb+0x73/0xd0
>> [ 1969.727136]  [<ffffffff81628646>] rtmsg_ifinfo.part.25+0x16/0x50
>> [ 1969.733176]  [<ffffffff81628698>] rtmsg_ifinfo+0x18/0x20
>> [ 1969.738522]  [<ffffffff8160e947>] netdev_state_change+0x47/0x50
>> [ 1969.744478]  [<ffffffff81629018>] linkwatch_do_dev+0x38/0x50
>> [ 1969.750170]  [<ffffffff81629257>] __linkwatch_run_queue+0xe7/0x160
>> [ 1969.756385]  [<ffffffff816292f5>] linkwatch_event+0x25/0x30
>> [ 1969.761991]  [<ffffffff8107b3cb>] process_one_work+0x15b/0x460
>> [ 1969.767857]  [<ffffffff8107b71e>] worker_thread+0x4e/0x480
>> [ 1969.773378]  [<ffffffff8107b6d0>] ? process_one_work+0x460/0x460
>> [ 1969.779420]  [<ffffffff8107b6d0>] ? process_one_work+0x460/0x460
>> [ 1969.785460]  [<ffffffff810811ea>] kthread+0xca/0xe0
>> [ 1969.790372]  [<ffffffff81081120>] ? kthread_worker_fn+0x120/0x120
>> [ 1969.796495]  [<ffffffff817302d2>] ret_from_fork+0x22/0x30
>> [ 1969.801924] Code: 00 00 55 48 89 e5 85 d2 0f 84 90 00 00 00 83 ea 01 31 c9 31 f6 48 8d 87 c0 ef 00 00 4c 8d 8c d7 c8 ef 00 00 48 8b 10 48 83 c0 08 <4c> 8b 82 08 0d 00 00 48 8b 92 00 0d 00 00 4c 01 c6 48 01 d1 4c
>> [ 1969.821969] RIP  [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
>> [ 1969.830486]  RSP <ffff88040b2d3bd8>
>> [ 1969.834002] CR2: 0000000000000d08
>> [ 1969.837440] ---[ end trace 80b9fbc1e7baed9b ]---
>> [ 1969.842102] Kernel panic - not syncing: Fatal exception in interrupt
>> [ 1969.848520] Kernel Offset: disabled
>> [ 1969.852050] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>> (END)
>
> Hi Jesper.
>
> Thanks for the report.
>
> Then we have a bug in the driver, deleting some memory too soon.

No ! it always been this way, the cached stats are always there (never deleted).
we just stop caching once the device is down, nothing is deleted too soon.

>
> If we depend on having proper stats at device dismantle, we need to keep

we had/still have the proper stats they are the ones that
mlx4_en_fold_software_stats is trying to cache into  (they always
exist),
but the ones that you are trying to read from (the mlx4 rings) are gone !

This bug is totally new and as i warned, this is another symptom of
the real root cause (can't sleep while reading stats).

Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
always iterate over all of them to query stats ?
what if you have one ring/none/1K ? how would you know how many to query ?

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-11-30 16:49 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>

On 30.11.2016 17:32, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>> From: Ido Schimmel <idosch@mellanox.com>
>>>
>>> Make sure the device has a complete view of the FIB tables by invoking
>>> their dump during module init.
>>>
>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>  1 file changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> index 14bed1d..d176047 100644
>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>  	return NOTIFY_DONE;
>>>  }
>>>  
>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>> +{
>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>> +
>>> +	/* Flush pending FIB notifications and then flush the device's
>>> +	 * table before requesting another dump. Do that with RTNL held,
>>> +	 * as FIB notification block is already registered.
>>> +	 */
>>> +	mlxsw_core_flush_owq();
>>> +	rtnl_lock();
>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>> +	rtnl_unlock();
>>> +}
>>> +
>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  {
>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>  	int err;
>>>  
>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  
>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>
>> Sorry to pick in here again:
>>
>> There is a race here. You need to protect the registration of the fib
>> notifier as well by the sequence counter. Updates here are not ordered
>> in relation to this code below.
> 
> You mean updates that can be received after you registered the notifier
> and until the dump started? I'm aware of that and that's OK. This
> listener should be able to handle duplicates.

I am not concerned about duplicates, but about ordering deletes and
getting an add from the RCU code you will add the node to hw while it is
deleted in the software path. You probably will ignore the delete
because nothing is installed in hw and later add the node which was
actually deleted but just reordered which happend on another CPU, no?

> I've a follow up patchset that introduces a new event in switchdev
> notification chain called SWITCHDEV_SYNC, which is sent when port
> netdevs are enslaved / released  from a master device (points in time
> where kernel<->device can get out of sync). It will invoke
> re-propagation of configuration from different parts of the stack
> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> in duplicates.

Okay, understood. I wonder how we can protect against accidentally abort
calls actually. E.g. if I start to inject routes into my routing domain
how can I make sure the box doesn't die after I try to insert enough
routes. Do we need to touch quagga etc?

Thanks,
Hannes

^ permalink raw reply

* Re: [net-next PATCH v3 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: John Fastabend @ 2016-11-30 16:50 UTC (permalink / raw)
  To: Jakub Kicinski, Michael S. Tsirkin
  Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
	john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161130143031.2fc64ab4@jkicinski-Precision-T1700>

On 16-11-30 06:30 AM, Jakub Kicinski wrote:
> [add MST]
> 

Thanks sorry MST. I did a cut'n'paste of an old list of CC's and missed
you were not on the list.

[...]

>> +	memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
>> +	while (--num_buf) {
>> +		unsigned int buflen;
>> +		unsigned long ctx;
>> +		void *buf;
>> +		int off;
>> +
>> +		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &buflen);
>> +		if (unlikely(!ctx))
>> +			goto err_buf;
>> +
>> +		buf = mergeable_ctx_to_buf_address(ctx);
>> +		p = virt_to_head_page(buf);
>> +		off = buf - page_address(p);
>> +
>> +		memcpy(page_address(page) + page_off,
>> +		       page_address(p) + off, buflen);
>> +		page_off += buflen;
> 
> Could malicious user potentially submit a frame bigger than MTU?

Well presumably if the MTU is greater than PAGE_SIZE the xdp program
would not have been loaded. And the malicious user in this case would
have to be qemu which seems like everything is already lost if qemu
is trying to attack its VM.

But this is a good point because it looks like there is nothing in
virtio or qemu that drops frames with MTU greater than the virtio
configured setting. Maybe Michael can confirm this or I'll poke at it
more. I think qemu should drop these frames in general.

So I think adding a guard here is sensible I'll go ahead and do that.
Also the MTU guard at set_xdp time needs to account for header length.

Thanks nice catch.

> 
>> +	}
>> +
>> +	*len = page_off;
>> +	return page;
>> +err_buf:
>> +	__free_pages(page, 0);
>> +	return NULL;
>> +}
>> +
>>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  					 struct virtnet_info *vi,
>>  					 struct receive_queue *rq,
>> @@ -469,21 +519,37 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  	rcu_read_lock();
>>  	xdp_prog = rcu_dereference(rq->xdp_prog);
>>  	if (xdp_prog) {
>> +		struct page *xdp_page;
>>  		u32 act;
>>  
>>  		if (num_buf > 1) {
>>  			bpf_warn_invalid_xdp_buffer();
>> -			goto err_xdp;
>> +
>> +			/* linearize data for XDP */
>> +			xdp_page = xdp_linearize_page(rq, num_buf,
>> +						      page, offset, &len);
>> +			if (!xdp_page)
>> +				goto err_xdp;
>> +			offset = len;
>> +		} else {
>> +			xdp_page = page;
>>  		}
>>  
>> -		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
>> +		act = do_xdp_prog(vi, xdp_prog, xdp_page, offset, len);
>>  		switch (act) {
>>  		case XDP_PASS:
>> +			if (unlikely(xdp_page != page))
>> +				__free_pages(xdp_page, 0);
>>  			break;
>>  		case XDP_TX:
>> +			if (unlikely(xdp_page != page))
>> +				goto err_xdp;
>> +			rcu_read_unlock();
> 
> Only if there is a reason for v4 - this unlock could go to the previous
> patch.
> 

Sure will do this.

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Andrew Lunn @ 2016-11-30 16:55 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev@vger.kernel.org
In-Reply-To: <1480523716.3563.144.camel@infinera.com>

> This is an embedded system with several boards in a subrack.
> Each board has eth I/F connected to a switch to communicate with each other.
> One of the board will also house the actual switch device and manage the switch.
> Then the normal app just communicates over the physical eth I/F like any other board
> in the system. There is a "manage switch app" which brings the switch up and partition
> phys VLANs etc. (each phys I/F would be a a separate domain so no loop)

So you are planning on throwing away the "manage switch app", and just
use standard linux networking commands? That is what switchdev is all
about really, throwing away the vendor SDK for the switch, making a
switch just a bunch on interfaces on the host which you manage as
normal interfaces.

> I guess I could skip the phys I/F and have the switch app create a virtual eth0 I/F over PCIe

No need to create this interface. It will exist if you go the
switchdev route.

> > > And switchdev can do all this over PCIe instead? Can you have a
> > > switch tree in switchdev too?
> > 
> > Mellonex says so, but i don't think they have actually implemented it.
> 
> Not impl. any of DSAs features? What can you do with a Mellonex switch then?

They don't have a tree of switches, as far as i know. Just a single
switch. But DSA does support a tree of switches, that is what the D in
DSA means, distributed. And there are a couple of boards which have 2
to 4 switches in a tree.

I think this is partially down to market segments. Mellonex market is
top of rack switches. High port count, very high bandwidth. DSA is
more wireless access points, set top boxes, generally up to 7 ports of
1Gbps and a few custom embedded products which need more ports, so
build a tree of switches.

      Andrew

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
From: John Fastabend @ 2016-11-30 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Thomas Graf, davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <20161130053735.GB31581@ast-mbp.thefacebook.com>

On 16-11-29 09:37 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
>> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
>>>> Registers new BPF program types which correspond to the LWT hooks:
>>>>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>>>>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>>>>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
>>>>
>>>> The separate program types are required to differentiate between the
>>>> capabilities each LWT hook allows:
>>>>
>>>>  * Programs attached to dst_input() or dst_output() are restricted and
>>>>    may only read the data of an skb. This prevent modification and
>>>>    possible invalidation of already validated packet headers on receive
>>>>    and the construction of illegal headers while the IP headers are
>>>>    still being assembled.
>>>>
>>>>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>>>>    content as well as prepending an L2 header via a newly introduced
>>>>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>>>>    after the IP header has been assembled completely.
>>>>
>>>> All BPF programs receive an skb with L3 headers attached and may return
>>>> one of the following error codes:
>>>>
>>>>  BPF_OK - Continue routing as per nexthop
>>>>  BPF_DROP - Drop skb and return EPERM
>>>>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>>>>                 (Only valid in lwtunnel_xmit() context)
>>>>
>>>> The return codes are binary compatible with their TC_ACT_
>>>> relatives to ease compatibility.
>>>>
>>>> Signed-off-by: Thomas Graf <tgraf@suug.ch>
>>> ...
>>>> +#define LWT_BPF_MAX_HEADROOM 128
>>>
>>> why 128?
>>> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
>>>
>>
>> hopefully not too off-topic but for XDP I would like to see this get
> 
> definitely off-topic. lwt->headroom is existing concept. Too late
> to do anything about it.
> 
>> passed down with the program. It would be more generic and drivers could
>> configure the headroom on demand and more importantly verify that a
>> program pushing data is not going to fail at runtime.
> 
> For xdp I think it will be problematic, since we'd have to check for
> that value at prog array access to make sure tailcalls are not broken.
> Mix and match won't be possible.
> So what does 'configure the headroom on demand' buys us?
> Isn't it much easier to tell all drivers "always reserve this much" ?
> We burn the page anyway.
> If it's configurable per driver, then we'd need an api
> to retrieve it. Yet the program author doesn't care what the value is.
> If program needs to do udp encap, it will try do it. No matter what.
> If xdp_adjust_head() helper fails, the program will likely decide
> to drop the packet. In some cases it may decide to punt to stack
> for further processing, but for high performance dataplane code
> it's highly unlikely.
> If it's configurable to something that is not cache line boundary
> hw dma performance may suffer and so on.
> So I see only cons in such 'configurable headroom' and propose
> to have fixed 256 bytes headroom for XDP
> which is enough for any sensible encap and metadata.
> 

OK I'm convinced let it be fixed at some conservative value.

^ permalink raw reply

* Re: BUG() can be hit in tcp_collapse()
From: Vladis Dronov @ 2016-11-30 17:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, stable, Marco Grassi
In-Reply-To: <1716309808.12143903.1478869689618.JavaMail.zimbra@redhat.com>

Hello, Eric, Marco, all,

This is JFYI and a follow-up message.

A further investigation was made to find out the Linux kernel commit which has
introduced the flaw. It appeared that previous Linux kernel versions are vulnerable,
down to v3.6-rc1. This fact was hidden by 'net.ipv4.tcp_fastopen' set to 0 by default,
and now it is easier to notice since kernel v3.12 due to commit 0d41cca490 where the
default was changed to 1. With 'net.ipv4.tcp_fastopen' set to 1, previous Linux
kernels (including RHEL-7 ones) are also vulnerable.

The bug is here since tcp-fastopen feature was introduced in kernel v3.6-rc1, the first
commit when the reproducer starts to panic the kernel with net.ipv4.tcp_fastopen=1 set
is cf60af03ca, which is a part of commit sequence 2100c8d2d9..67da22d23f introducing
net-tcp-fastopen feature:

$ git bisect bad cf60af03ca4e71134206809ea892e49b92a88896
cf60af03ca4e71134206809ea892e49b92a88896 is the first bad commit
commit cf60af03ca4e71134206809ea892e49b92a88896
Author: Yuchung Cheng <ycheng@google.com>
Date:   Thu Jul 19 06:43:09 2012 +0000

So, ideally, the upstream commit ac6e780070 which fixes the bug should have
"Fixes: cf60af03ca" statement, unfortunately, this investigation was not completed at
the time the patch was accepted upstream. And unfortunately I do not see other way
to add this information except making notes in a comment in the related code, which
seems weird.

Best regards,
Vladis Dronov | Red Hat, Inc. | Product Security Engineer

^ permalink raw reply

* Re: [PATCH v2] vxlan: fix a potential issue when create a new vxlan fdb entry.
From: David Miller @ 2016-11-30 17:03 UTC (permalink / raw)
  To: yanhaishuang; +Cc: jbenc, hannes, pshelar, netdev, linux-kernel
In-Reply-To: <1480384776-8252-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Tue, 29 Nov 2016 09:59:36 +0800

> vxlan_fdb_append may return error, so add the proper check,
> otherwise it will cause memory leak.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> 
> Changes in v2:
>   - Unnecessary to initialize rc to zero.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 8/8] net/mlx5e: Support adding ingress tc rule when egress device flag is set
From: Or Gerlitz @ 2016-11-30 17:10 UTC (permalink / raw)
  To: John Fastabend
  Cc: Hadar Hen Zion, David S. Miller, Linux Netdev List,
	Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz, Roi Dayan
In-Reply-To: <583F0007.3090507@gmail.com>

On Wed, Nov 30, 2016 at 6:36 PM, John Fastabend
<john.fastabend@gmail.com> wrote:

> I started to dig through these patches and the last series here,
>
>    Re: [PATCH net-next 00/13] Mellanox 100G SRIOV offloads tunnel_key
> set/release

> Can you explain how these two are related? I'm guessing in that first
> series the actual redirect action to a tunnel device was being ignore?

> Does this series clean up that bit of software/hardware alignment.

If only the 1st series is applied, the kernel pieces are functional
but with user-space
manually putting the decap (tunnel key release) TC rule on the uplink device
(tc tool -> UAPI --> tc core --> flower --> mlx5), but under real life
use-case, the rule
is put on shared-tunnel device, so this series adds on the prev one.

Or.

^ permalink raw reply

* Re: qed, qedi patchset submission
From: Arun Easi @ 2016-11-30 17:15 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: David Miller, linux-scsi, netdev
In-Reply-To: <yq1oa0wj4qy.fsf@sermon.lab.mkp.net>

Thanks for the response, Martin.

On Wed, 30 Nov 2016, 8:45am, Martin K. Petersen wrote:

> >>>>> "Arun" == Arun Easi <arun.easi@cavium.com> writes:
> 
> Arun,
> 
> Arun> So far, we have been posting qedi changes split into functional
> Arun> blocks, for review, but was not bisectable. With Martin ok to our
> Arun> request to squash all patches while committing to tree, we were
> Arun> wondering if we should post the qedi patches squashed, with all
> Arun> the Reviewed-by added, or continue to post as before?
> 
> I guess it depends how things can be split up in a bisectable fashion.

These were the patches that was sent:
  1 qed: Add support for hardware offloaded iSCSI.
  2 qed: Add iSCSI out of order packet handling.
  3 qedi: Add QLogic FastLinQ offload iSCSI driver framework.
  4 qedi: Add LL2 iSCSI interface for offload iSCSI.
  5 qedi: Add support for iSCSI session management.
  6 qedi: Add support for data path.

> 
> If the net/ pieces can be completely separated from the scsi/ pieces
> maybe it would be best to have two patches?
> 

Yes, those pieces can be completely separated; there are actually 3 parts, 
2 that goes to net/ and 1 to scsi/.

In the list above, 1 & 2 goes to net/ and are bisectable. 3 through 6 are 
scsi/ pieces, which are the ones requested for collapsing.

We will post these pieces for V3, then:

  1 [PATCH v3 net-next 1/3] qed: Add support for hardware offloaded iSCSI.
  2 [PATCH v3 net-next 2/3] qed: Add iSCSI out of order packet handling.
  3 [PATCH v3 3/3] qedi: Add QLogic FastLinQ offload iSCSI driver framework.

Regards,
-Arun

^ permalink raw reply

* Re: [Patch net-next] audit: remove useless synchronize_net()
From: Cong Wang @ 2016-11-30 17:20 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Linux Kernel Network Developers, linux-audit, Paul Moore
In-Reply-To: <20161130091643.GA32562@madcap2.tricolour.ca>

On Wed, Nov 30, 2016 at 1:16 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2016-11-29 09:14, Cong Wang wrote:
>> netlink kernel socket is protected by refcount, not RCU.
>> Its rcv path is neither protected by RCU. So the synchronize_net()
>> is just pointless.
>
> If I understand correctly, xfrm_user_net_exit() usage of
> RCU_INIT_POINTER() and synchronize_net() is similarly pointless?  Also
> net/phonet/socket.c?  I probably modelled things based on the former...

Possibly, but xfrm case is slightly different, it has two copies of the pointer
to the netlink socket, also it uses exit_batch(). I need to double check.

Take a look at better examples, fib_front, genetlink, rtnetlink.

^ permalink raw reply

* Re: [PATCH net-next V2 1/7] net/mlx5e: Implement Fragmented Work Queue (WQ)
From: Sebastian Ott @ 2016-11-30 17:26 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Tariq Toukan, Or Gerlitz, Roi Dayan
In-Reply-To: <1480521583-12755-2-git-send-email-saeedm@mellanox.com>

Hi,

On Wed, 30 Nov 2016, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt@mellanox.com>
> 
> Add new type of struct mlx5_frag_buf which is used to allocate fragmented
> buffers rather than contiguous, and make the Completion Queues (CQs) use
> it as they are big (default of 2MB per CQ in Striding RQ).
> 
> This fixes the failures of type:
> "mlx5e_open_locked: mlx5e_open_channels failed, -12"
> due to dma_zalloc_coherent insufficient contiguous coherent memory to
> satisfy the driver's request when the user tries to setup more or larger
> rings.

Thanks for that patch! I can confirm that this fixes the lage allocation
issue.

Regards,
Sebastian

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-11-30 17:28 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan
In-Reply-To: <CALzJLG9iCWkk2WTGrkFhZ61Sx5R-JDCxANhLQZPO8E47QGwOmQ@mail.gmail.com>

On Wed, 2016-11-30 at 18:27 +0200, Saeed Mahameed wrote:

> 
> In this case, i think they should implement their own bulking (pktgen
> is not a good example)
> but XDP can predict if it has more packets to xmit  as long as all of
> them fall in the same NAPI cycle.


> Others should try and do the same.

We can not trust user space to signal us when it is the good time to
perform a doorbell.

trafgen is using af_packet with qdisc bypass for example.

^ permalink raw reply

* Re: [PATCH 1/6] net: ethernet: ti: netcp: add support of cpts
From: Grygorii Strashko @ 2016-11-30 17:31 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA, Mugunthan V N,
	Sekhar Nori, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-omap-u79uwXL29TY76Z2rM5mHXA, Rob Herring,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Murali Karicheri, Wingman Kwok
In-Reply-To: <20161130094441.GB28680-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>



On 11/30/2016 03:44 AM, Richard Cochran wrote:
> On Mon, Nov 28, 2016 at 05:04:23PM -0600, Grygorii Strashko wrote:
>> @@ -678,6 +744,9 @@ struct gbe_priv {
>>  	int				num_et_stats;
>>  	/*  Lock for updating the hwstats */
>>  	spinlock_t			hw_stats_lock;
>> +
>> +	int                             cpts_registered;
> 
> The usage of this counter is racy.
> 
>> +	struct cpts                     *cpts;
>>  };
> 
> This ++ and -- business ...
> 
>> +static void gbe_register_cpts(struct gbe_priv *gbe_dev)
>> +{
>> +	if (!gbe_dev->cpts)
>> +		return;
>> +
>> +	if (gbe_dev->cpts_registered > 0)
>> +		goto done;
>> +
>> +	if (cpts_register(gbe_dev->cpts)) {
>> +		dev_err(gbe_dev->dev, "error registering cpts device\n");
>> +		return;
>> +	}
>> +
>> +done:
>> +	++gbe_dev->cpts_registered;
>> +}
>> +
>> +static void gbe_unregister_cpts(struct gbe_priv *gbe_dev)
>> +{
>> +	if (!gbe_dev->cpts || (gbe_dev->cpts_registered <= 0))
>> +		return;
>> +
>> +	if (--gbe_dev->cpts_registered)
>> +		return;
>> +
>> +	cpts_unregister(gbe_dev->cpts);
>> +}
> 
> is invoked from your open() and close() methods, but those methods
> are not serialized among multiple ports.
> 

ok. Seems my assumption that ndo_open/ndo_close serialized by rtnl_lock is incorrect. Right?
net_device_ops.ndo_open ->
 netcp_ndo_open
 gbe_open
 gbe_register_cpts

-- 
regards,
-grygorii
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: BUG() can be hit in tcp_collapse()
From: Eric Dumazet @ 2016-11-30 17:33 UTC (permalink / raw)
  To: Vladis Dronov; +Cc: netdev, stable, Marco Grassi
In-Reply-To: <1418136049.827916.1480525217226.JavaMail.zimbra@redhat.com>

On Wed, 2016-11-30 at 12:00 -0500, Vladis Dronov wrote:
> Hello, Eric, Marco, all,
> 
> This is JFYI and a follow-up message.
> 
> A further investigation was made to find out the Linux kernel commit which has
> introduced the flaw. It appeared that previous Linux kernel versions are vulnerable,
> down to v3.6-rc1. This fact was hidden by 'net.ipv4.tcp_fastopen' set to 0 by default,
> and now it is easier to notice since kernel v3.12 due to commit 0d41cca490 where the
> default was changed to 1. With 'net.ipv4.tcp_fastopen' set to 1, previous Linux
> kernels (including RHEL-7 ones) are also vulnerable.
> 
> The bug is here since tcp-fastopen feature was introduced in kernel v3.6-rc1, the first
> commit when the reproducer starts to panic the kernel with net.ipv4.tcp_fastopen=1 set
> is cf60af03ca, which is a part of commit sequence 2100c8d2d9..67da22d23f introducing
> net-tcp-fastopen feature:
> 
> $ git bisect bad cf60af03ca4e71134206809ea892e49b92a88896
> cf60af03ca4e71134206809ea892e49b92a88896 is the first bad commit
> commit cf60af03ca4e71134206809ea892e49b92a88896
> Author: Yuchung Cheng <ycheng@google.com>
> Date:   Thu Jul 19 06:43:09 2012 +0000
> 
> So, ideally, the upstream commit ac6e780070 which fixes the bug should have
> "Fixes: cf60af03ca" statement, unfortunately, this investigation was not completed at
> the time the patch was accepted upstream. And unfortunately I do not see other way
> to add this information except making notes in a comment in the related code, which
> seems weird.

Well, the crash can happen way before Yuchung patch.

It is a 0-day bug.

^ permalink raw reply

* Re: [PATCH 2/6] net: ethernet: ti: cpts: add support for ext rftclk selection
From: Grygorii Strashko @ 2016-11-30 17:35 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA, Mugunthan V N,
	Sekhar Nori, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-omap-u79uwXL29TY76Z2rM5mHXA, Rob Herring,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Murali Karicheri, Wingman Kwok
In-Reply-To: <20161130095632.GC28680-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>



On 11/30/2016 03:56 AM, Richard Cochran wrote:
> On Mon, Nov 28, 2016 at 05:04:24PM -0600, Grygorii Strashko wrote:
>> Some CPTS instances, which can be found on KeyStone 2 1/10G Ethernet
>> Switch Subsystems, can control an external multiplexer that selects
>> one of up to 32 clocks for time sync reference (RFTCLK). This feature
>> can be configured through CPTS_RFTCLK_SEL register (offset: x08).
>>
>> Hence, introduce optional DT cpts_rftclk_sel poperty wich, if present,
>> will specify CPTS reference clock. The cpts_rftclk_sel should be
>> omitted in DT if HW doesn't support this feature. The external fixed
>> rate clocks can be defined in board files as "fixed-clock".
> 
> Can't you implement this using the clock tree, rather than an ad-hoc
> DT property?
> 

I've thought about this, but decided to move forward with this impl
 which is pretty simple. I will try.


-- 
regards,
-grygorii
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-11-30 17:35 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <CALzJLG8pcuNo4uD3+dVms1gHT9=WfZAM1DDp_J10NgBZ7YZnEA@mail.gmail.com>

On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:

> we had/still have the proper stats they are the ones that
> mlx4_en_fold_software_stats is trying to cache into  (they always
> exist),
> but the ones that you are trying to read from (the mlx4 rings) are gone !
> 
> This bug is totally new and as i warned, this is another symptom of
> the real root cause (can't sleep while reading stats).
> 
> Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
> always iterate over all of them to query stats ?
> what if you have one ring/none/1K ? how would you know how many to query ?

I am suggesting I will fix the bug I introduced.

Do not panic.

^ permalink raw reply

* Re: [PATCH net 1/2] Set skb->protocol properly before calling dst_output()
From: David Miller @ 2016-11-30 17:38 UTC (permalink / raw)
  To: elicooper; +Cc: netdev, eric.dumazet
In-Reply-To: <20161129023529.17645-1-elicooper@gmx.com>

From: Eli Cooper <elicooper@gmx.com>
Date: Tue, 29 Nov 2016 10:35:28 +0800

> When xfrm is applied to TSO/GSO packets, it follows this path:
> 
>     xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> where skb_gso_segment() relies on skb->protocol to function properly.
> 
> This patch sets skb->protocol properly before dst_output() is called,
> fixing a bug where GSO packets sent through a sit or ipip6 tunnel are
> dropped when xfrm is involved.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Eli Cooper <elicooper@gmx.com>
> ---
>  net/ipv4/ip_output.c   | 4 +++-
>  net/ipv6/output_core.c | 4 +++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 105908d..0180e44 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -117,8 +117,10 @@ int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
>  	int err;
>  
>  	err = __ip_local_out(net, sk, skb);
> -	if (likely(err == 1))
> +	if (likely(err == 1)) {
> +		skb->protocol = htons(ETH_P_IP);
>  		err = dst_output(net, sk, skb);
> +	}
>  

__ip_local_out() potentially does a dst_output() call too via the
netfilter hook, so you definitely need to place the skb->protocol
assignment before that netfilter hook.

^ permalink raw reply

* Re: [net-next] macvtap: replace printk with netdev_err
From: David Miller @ 2016-11-30 17:41 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev, jasowang
In-Reply-To: <1480389992-69609-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Tue, 29 Nov 2016 11:26:32 +0800

> This patch replaces printk() with netdev_err() for macvtap device.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox