Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc
From: Cong Wang @ 2017-09-27 20:23 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: Linux Kernel Network Developers, intel-wired-lan,
	Jamal Hadi Salim, Jiri Pirko, andre.guedes, ivan.briano,
	jesus.sanchez-palencia, boon.leong.ong, richardcochran, henrik
In-Reply-To: <20170926233916.11774-3-vinicius.gomes@intel.com>

On Tue, Sep 26, 2017 at 4:39 PM, Vinicius Costa Gomes
<vinicius.gomes@intel.com> wrote:
> +static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
> +{
> +       struct cbs_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +
> +       if (!opt)
> +               return -EINVAL;
> +
> +       /* FIXME: this means that we can only install this qdisc
> +        * "under" mqprio. Do we need a more generic way to retrieve
> +        * the queue, or do we pass the netdev_queue to the driver?
> +        */
> +       q->queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
> +
> +       return cbs_change(sch, opt);
> +}

Yeah it is ugly to assume its parent is mqprio, at least you should
error out if it is not the case.

I am not sure how we can solve this elegantly, perhaps you should
extend mqprio rather than add a new one?

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Andrew Lunn @ 2017-09-27 20:18 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: Vivien Didelot, David Miller, netdev
In-Reply-To: <465be6d8-67d0-d176-1252-abb222bf0528@gmail.com>

On Wed, Sep 27, 2017 at 12:33:29PM -0700, Florian Fainelli wrote:
> On 09/27/2017 12:19 PM, Vivien Didelot wrote:
> > Hi Andrew, Florian,
> > 
> > Andrew Lunn <andrew@lunn.ch> writes:
> > 
> >> It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
> >> enabled. Any change to enable hardware flooding needs careful testing
> >> for lots of different configurations. This is another reason i don't
> >> want to do it at the DSA level, until we have a good understanding
> >> what it means in each individual driver.
> > 
> > Then if we are worried about how broadcast flooding is handled on
> > different switches, we can provide a new .flood_broadcast(ds, vid)
> > switch operation for the drivers to implement.
> 
> We don't really have a good visibility on the number of switches
> requiring special configuration for broadcast addresses nor how this
> would have to happen so it would be a tad difficult to define an
> appropriate API with a single user.

Yes, i agree with this. We should wait before adding a generic
solution. I want to wait until a few drivers do whatever is needed for
hardware broadcast. We can then see what is common, and what is
different, find an API to suit and do some refactoring.

	   Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/6] net: dsa: {e}dsa: set offload_fwd_mark on received packets
From: Andrew Lunn @ 2017-09-27 20:11 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: David Miller, Vivien Didelot, netdev
In-Reply-To: <1c4ba973-91c0-ae81-1c4b-d0281f5c7517@gmail.com>

On Wed, Sep 27, 2017 at 12:46:35PM -0700, Florian Fainelli wrote:
> On 09/26/2017 03:25 PM, Andrew Lunn wrote:
> > The software bridge needs to know if a packet has already been bridged
> > by hardware offload to ports in the same hardware offload, in order
> > that it does not re-flood them, causing duplicates. This is
> > particularly true for broadcast and multicast traffic which the host
> > has requested.
> > 
> > By setting offload_fwd_mark in the skb the bridge will only flood to
> > ports in other offloads and other netifs. Set this flag in the DSA and
> > EDSA tag driver.
> 
> Is not there some kind of forwarding code/reason code being provided in
> the EDSA/DSA tag that tell you why this packet was sent to the CPU in
> the first place?

Hi Florian

There are some codes, but nothing specific to broadcast, or ATU
misses. I'm also trying to keep the code generic so it could be a
template for other drivers. Many of the tagging schemes don't provide
a reason code. So i want that any frame that comes from the switch has
no need to go back to the switch. KISS.

> What is the impact on non-broadcast traffic, e.g: multicast and unicast?

The bridge uses this flag when flooding. unicast traffic from the
switch should not need flooding. Either it is known in the switch and
hence won't be forwarded to the host, or it is unknown in the switch,
so it probably is on some other interface.

My testing with multicast has not shown issues. The switch pushes down
mdb entries, which causes frames to be replicated out ports. So again,
there should not be a need to pass the frame back to the switch. But
it is possible i missed a corner case or two...

> Nit: I don't really have a solution on how to order patches, but until
> the next 4 patches get in, I suppose we temporarily have broadcast
> flooding by the bridge "broken"? Ordering in the opposite way would
> probably result in an equally bad situation so...

Yes, it is an issue. I could put this patch last. We then get
duplication of broadcast...

Which is the lesser of two evils?

      Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/6] net: dsa: {e}dsa: set offload_fwd_mark on received packets
From: Florian Fainelli @ 2017-09-27 19:46 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: Vivien Didelot, netdev
In-Reply-To: <1506464764-12699-3-git-send-email-andrew@lunn.ch>

On 09/26/2017 03:25 PM, Andrew Lunn wrote:
> The software bridge needs to know if a packet has already been bridged
> by hardware offload to ports in the same hardware offload, in order
> that it does not re-flood them, causing duplicates. This is
> particularly true for broadcast and multicast traffic which the host
> has requested.
> 
> By setting offload_fwd_mark in the skb the bridge will only flood to
> ports in other offloads and other netifs. Set this flag in the DSA and
> EDSA tag driver.

Is not there some kind of forwarding code/reason code being provided in
the EDSA/DSA tag that tell you why this packet was sent to the CPU in
the first place?

What is the impact on non-broadcast traffic, e.g: multicast and unicast?

Nit: I don't really have a solution on how to order patches, but until
the next 4 patches get in, I suppose we temporarily have broadcast
flooding by the bridge "broken"? Ordering in the opposite way would
probably result in an equally bad situation so...

> 
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> ---
> 
> v2
> --
> For the moment, do this in the tag drivers, not the generic code.
> Once we get more test results from other switches, maybe move it back
> again.
> ---
>  net/dsa/tag_dsa.c  | 1 +
>  net/dsa/tag_edsa.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> index fbf9ca954773..ea6ada9d5016 100644
> --- a/net/dsa/tag_dsa.c
> +++ b/net/dsa/tag_dsa.c
> @@ -154,6 +154,7 @@ static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev,
>  	}
>  
>  	skb->dev = ds->ports[source_port].netdev;
> +	skb->offload_fwd_mark = 1;
>  
>  	return skb;
>  }
> diff --git a/net/dsa/tag_edsa.c b/net/dsa/tag_edsa.c
> index 76367ba1b2e2..a961b22a7018 100644
> --- a/net/dsa/tag_edsa.c
> +++ b/net/dsa/tag_edsa.c
> @@ -173,6 +173,7 @@ static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev,
>  	}
>  
>  	skb->dev = ds->ports[source_port].netdev;
> +	skb->offload_fwd_mark = 1;
>  
>  	return skb;
>  }
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Florian Fainelli @ 2017-09-27 19:33 UTC (permalink / raw)
  To: Vivien Didelot, Andrew Lunn; +Cc: David Miller, netdev
In-Reply-To: <8760c4t0df.fsf@weeman.i-did-not-set--mail-host-address--so-tickle-me>

On 09/27/2017 12:19 PM, Vivien Didelot wrote:
> Hi Andrew, Florian,
> 
> Andrew Lunn <andrew@lunn.ch> writes:
> 
>> It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
>> enabled. Any change to enable hardware flooding needs careful testing
>> for lots of different configurations. This is another reason i don't
>> want to do it at the DSA level, until we have a good understanding
>> what it means in each individual driver.
> 
> Then if we are worried about how broadcast flooding is handled on
> different switches, we can provide a new .flood_broadcast(ds, vid)
> switch operation for the drivers to implement.

We don't really have a good visibility on the number of switches
requiring special configuration for broadcast addresses nor how this
would have to happen so it would be a tad difficult to define an
appropriate API with a single user.

In general, single user "generic" facilities tend to be biased towards
their particular problem space (c.f: devlink) so a generic interface to
call into HW specific details does not usually sell well...
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Vivien Didelot @ 2017-09-27 19:19 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli; +Cc: David Miller, netdev
In-Reply-To: <20170927184636.GD12394@lunn.ch>

Hi Andrew, Florian,

Andrew Lunn <andrew@lunn.ch> writes:

> It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
> enabled. Any change to enable hardware flooding needs careful testing
> for lots of different configurations. This is another reason i don't
> want to do it at the DSA level, until we have a good understanding
> what it means in each individual driver.

Then if we are worried about how broadcast flooding is handled on
different switches, we can provide a new .flood_broadcast(ds, vid)
switch operation for the drivers to implement.

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Vivien Didelot @ 2017-09-27 18:59 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David Miller, netdev
In-Reply-To: <20170927183650.GC12394@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

>> Adding the broadcast address to an Ethernet switch's FDB is pretty
>> generic and mv88e6xxx mustn't be the only driver doing this.
>
> Actually, it is. All the others seem to do this in hardware without
> needing an FDB. Since mv88e6xxx is the only one requiring it, it has
> to be done in the mv88e6xxx driver.

Adding the broadcast address from the DSA layer wouldn't hurt and make
things pretty obvious. This would also avoid drivers to get
unnecessarily complex. A .port_vlan_add implementation must remain
simple and mustn't do more than adding a VLAN entry.

Don't forget that we want the DSA drivers to be dump and have the core
logic of Ethernet switch handling resides in DSA core itself.

If some switch chips can flood broadcast without an FDB entry, good for
them, they can skip it. We will have the same issue for special L2
Multicast destination addresses, some switches have special bits to
consider them as management, some others don't and require to load the
ATU with them.

Regarding Marvell, what value do you have for the global FloodBC bit
(Global 2, offset 0x05)?

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available
From: Martin KaFai Lau @ 2017-09-27 19:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Wei Wang, Paolo Abeni, Linux Kernel Network Developers,
	David S. Miller, Hannes Frederic Sowa
In-Reply-To: <CANn89iJ0t8Vpukak3WEOA1x4M4hk-Z9hDkcSqG9+7qLsbuJQZg@mail.gmail.com>

On Wed, Sep 27, 2017 at 06:03:33PM +0000, Eric Dumazet wrote:
> >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> 
> >
> > Hi Paolo,
> >
> > Eric and I discussed about this issue recently as well :).
> >
> > What about the following change:
> >
> > diff --git a/include/net/dst.h b/include/net/dst.h
> > index 93568bd0a352..33e1d86bcef6 100644
> > --- a/include/net/dst.h
> > +++ b/include/net/dst.h
> > @@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
> >  static inline void dst_use(struct dst_entry *dst, unsigned long time)
> >  {
> >         dst_hold(dst);
> > -       dst->__use++;
> > -       dst->lastuse = time;
> > +       if (dst->lastuse != time) {
> > +               dst->__use++;
> > +               dst->lastuse = time;
> > +       }
> >  }
> >
> >  static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
> >  {
> > -       dst->__use++;
> > -       dst->lastuse = time;
> > +       if (dst->lastuse != time) {
> > +               dst->__use++;
> > +               dst->lastuse = time;
> > +       }
> >  }
> >
> >  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 26cc9f483b6d..e195f093add3 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
> > struct fib6_table *table,
> >
> >                 struct rt6_info *pcpu_rt;
> >
> > -               rt->dst.lastuse = jiffies;
> > -               rt->dst.__use++;
> > +               dst_use_noref(rt, jiffies);
> >                 pcpu_rt = rt6_get_pcpu_route(rt);
> >
> >                 if (pcpu_rt) {
> >
> >
> > This way we always only update dst->__use and dst->lastuse at most
> > once per jiffy. And we don't really need to update pcpu and then do
> > the copy over from pcpu_rt to rt operation.
> >
> > Another thing is that I don't really see any places making use of
> > dst->__use. So maybe we can also get rid of this dst->__use field?
> >
> > Thanks.
> > Wei
> 
> Paolo, given we are very close to send Wei awesome work about IPv6
> routing cache,
> could we ask you to wait few days before doing the same work from your side ?
> 
> Main issue is the rwlock, and we are converting it to full RCU.
+1

We can get a better picture of other optimizations once
the rwlock is removed.

> 
> You are sending patches that are making our job very difficult IMO.
> 
> We chose to mimic the change done in neighbour code years ago
> ( 0ed8ddf4045fcfcac36bad753dc4046118c603ec )
> 
> Thanks.

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Andrew Lunn @ 2017-09-27 18:46 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: Vivien Didelot, David Miller, netdev
In-Reply-To: <9733da02-0b69-33f0-de8a-63bc5cae6bb4@gmail.com>

> What if I don't have CONFIG_BRIDGE_VLAN_FILTERING enabled, what happens
> in that case, would not this result in not programming the broadcast
> address?

Hi Florian

It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
enabled. Any change to enable hardware flooding needs careful testing
for lots of different configurations. This is another reason i don't
want to do it at the DSA level, until we have a good understanding
what it means in each individual driver.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Florian Fainelli @ 2017-09-27 18:37 UTC (permalink / raw)
  To: Vivien Didelot, Andrew Lunn, David Miller; +Cc: netdev
In-Reply-To: <8737786lua.fsf@weeman.i-did-not-set--mail-host-address--so-tickle-me>

On 09/27/2017 11:24 AM, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn <andrew@lunn.ch> writes:
> 
>> +static int mv88e6xxx_port_add_broadcast(struct mv88e6xxx_chip *chip, int port,
>> +					u16 vid)
>> +{
>> +	const char broadcast[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
>> +
>> +	return mv88e6xxx_port_db_load_purge(
>> +		chip, port, broadcast, vid,
>> +		MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
> 
> Please don't do this. This is not a valid coding style and has already
> shown to be a bad example for other DSA drivers copying mv88e6xxx.
> 
> Simply declare u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC above...
> 
>> +}
>> +
>> +static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
>> +{
>> +	int port;
>> +	int err;
>> +
>> +	for (port = 0; port < mv88e6xxx_num_ports(chip); ++port) {
>> +		err = mv88e6xxx_port_add_broadcast(chip, port, vid);
>> +		if (err)
>> +			return err;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
>>  				    u16 vid, u8 member)
>>  {
>> @@ -1247,7 +1271,11 @@ static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
>>  
>>  	vlan.member[port] = member;
>>  
>> -	return mv88e6xxx_vtu_loadpurge(chip, &vlan);
>> +	err = mv88e6xxx_vtu_loadpurge(chip, &vlan);
>> +	if (err)
>> +		return err;
>> +
>> +	return mv88e6xxx_broadcast_setup(chip, vid);
>>  }
>>  
>>  static void mv88e6xxx_port_vlan_add(struct dsa_switch *ds, int port,
>> @@ -2025,6 +2053,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
>>  	if (err)
>>  		goto unlock;
>>  
>> +	err = mv88e6xxx_broadcast_setup(chip, 0);
>> +	if (err)
>> +		goto unlock;
>> +
>>  	err = mv88e6xxx_pot_setup(chip);
>>  	if (err)
>>  		goto unlock;
> 
> 
> Adding the broadcast address to an Ethernet switch's FDB is pretty
> generic and mv88e6xxx mustn't be the only driver doing this.

I have not had time to test Andrew's IGMP patch set on bcm_sf2/b53, but
while I agree that adding a broadcast address using a FDB entry is
generic in premise, we don't know yet whether other switch drivers need
that or not, so until we do, it seems like Andrew's approach is
appropriate here by keeping this local to mv88e6xxx.

> 
> They do not have to care about the broadcast address, this is just a
> standard MDB entry for them. This must be moved up to the DSA layer.
> 
> Adding the broadcast address in dsa_port_vlan_add and dsa_port_enable
> like this should be enough: http://ix.io/AmS
> 

What if I don't have CONFIG_BRIDGE_VLAN_FILTERING enabled, what happens
in that case, would not this result in not programming the broadcast
address?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Andrew Lunn @ 2017-09-27 18:36 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: David Miller, netdev
In-Reply-To: <8737786lua.fsf@weeman.i-did-not-set--mail-host-address--so-tickle-me>

> Adding the broadcast address to an Ethernet switch's FDB is pretty
> generic and mv88e6xxx mustn't be the only driver doing this.

Actually, it is. All the others seem to do this in hardware without
needing an FDB. Since mv88e6xxx is the only one requiring it, it has
to be done in the mv88e6xxx driver.

   Andrew

^ permalink raw reply

* Re: [patch net-next v3 00/12] mlxsw: Add support for offloading IPv4 multicast routes
From: David Miller @ 2017-09-27 18:33 UTC (permalink / raw)
  To: jiri; +Cc: netdev, yotamg, idosch, mlxsw, nikolay, andrew, linyunsheng
In-Reply-To: <20170927062322.5476-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Wed, 27 Sep 2017 08:23:10 +0200

> This patch-set introduces offloading of the kernel IPv4 multicast router
> logic in the Spectrum driver.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/6] mv88e6xxx broadcast flooding in hardware
From: Vivien Didelot @ 2017-09-27 18:28 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1506464764-12699-1-git-send-email-andrew@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> This patchset makes the mv88e6xxx driver perform flooding in hardware,
> rather than let the software bridge perform the flooding. This is a
> prerequisite for IGMP snooping on the bridge interface.
>
> In order to make hardware broadcasting work, a few other issues need
> fixing or improving. SWITCHDEV_ATTR_ID_PORT_PARENT_ID is broken, which
> is apparent when testing on the ZII devel board with multiple
> switches.
>
> Some of these patches are taken from a previous RFC patchset of IGMP
> support. Hence the v2 comments...

mlxsw and others return BR_FLOOD and BR_MCAST_FLOOD in
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT, is this something DSA needs
here?


Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware
From: Vivien Didelot @ 2017-09-27 18:24 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1506464764-12699-7-git-send-email-andrew@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> +static int mv88e6xxx_port_add_broadcast(struct mv88e6xxx_chip *chip, int port,
> +					u16 vid)
> +{
> +	const char broadcast[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
> +
> +	return mv88e6xxx_port_db_load_purge(
> +		chip, port, broadcast, vid,
> +		MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);

Please don't do this. This is not a valid coding style and has already
shown to be a bad example for other DSA drivers copying mv88e6xxx.

Simply declare u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC above...

> +}
> +
> +static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
> +{
> +	int port;
> +	int err;
> +
> +	for (port = 0; port < mv88e6xxx_num_ports(chip); ++port) {
> +		err = mv88e6xxx_port_add_broadcast(chip, port, vid);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
>  static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
>  				    u16 vid, u8 member)
>  {
> @@ -1247,7 +1271,11 @@ static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
>  
>  	vlan.member[port] = member;
>  
> -	return mv88e6xxx_vtu_loadpurge(chip, &vlan);
> +	err = mv88e6xxx_vtu_loadpurge(chip, &vlan);
> +	if (err)
> +		return err;
> +
> +	return mv88e6xxx_broadcast_setup(chip, vid);
>  }
>  
>  static void mv88e6xxx_port_vlan_add(struct dsa_switch *ds, int port,
> @@ -2025,6 +2053,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
>  	if (err)
>  		goto unlock;
>  
> +	err = mv88e6xxx_broadcast_setup(chip, 0);
> +	if (err)
> +		goto unlock;
> +
>  	err = mv88e6xxx_pot_setup(chip);
>  	if (err)
>  		goto unlock;


Adding the broadcast address to an Ethernet switch's FDB is pretty
generic and mv88e6xxx mustn't be the only driver doing this.

They do not have to care about the broadcast address, this is just a
standard MDB entry for them. This must be moved up to the DSA layer.

Adding the broadcast address in dsa_port_vlan_add and dsa_port_enable
like this should be enough: http://ix.io/AmS


Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available
From: Cong Wang @ 2017-09-27 18:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: Paolo Abeni, Eric Dumazet, Martin KaFai Lau,
	Linux Kernel Network Developers, David S. Miller,
	Hannes Frederic Sowa
In-Reply-To: <CAEA6p_DZxGesjymubWoWx7hQjaZ7Vdchh=cdOW9tAr4F7QvzXg@mail.gmail.com>

On Wed, Sep 27, 2017 at 10:44 AM, Wei Wang <weiwan@google.com> wrote:
> Another thing is that I don't really see any places making use of
> dst->__use. So maybe we can also get rid of this dst->__use field?

It is used in rtnl_put_cacheinfo():

        struct rta_cacheinfo ci = {
                .rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse),
                .rta_used = dst->__use,
                .rta_clntref = atomic_read(&(dst->__refcnt)),
                .rta_error = error,
                .rta_id =  id,
        };

^ permalink raw reply

* Re: [PATCH v2 16/16] net: Add support for networking over Thunderbolt cable
From: David Miller @ 2017-09-27 18:23 UTC (permalink / raw)
  To: mika.westerberg
  Cc: gregkh, andreas.noever, michael.jamet, yehezkel.bernat,
	amir.jer.levy, Mario.Limonciello, lukas, andriy.shevchenko,
	andrew, linux-kernel, netdev
In-Reply-To: <20170927172702.GE4630@lahna.fi.intel.com>

From: Mika Westerberg <mika.westerberg@linux.intel.com>
Date: Wed, 27 Sep 2017 20:27:02 +0300

> On Wed, Sep 27, 2017 at 09:27:09AM -0700, David Miller wrote:
>> From: Mika Westerberg <mika.westerberg@linux.intel.com>
>> Date: Wed, 27 Sep 2017 16:42:38 +0300
>> 
>> > Using build_skb() then would require to allocate larger buffer, that
>> > includes NET_SKB_PAD + SKB_DATA_ALIGN(skb_shared_info) and that exceeds
>> > page size. Is this something supported by build_skb()? It was not clear
>> > to me based on the code and other users of build_skb() but I may be
>> > missing something.
>> 
>> You need NET_SKB_PAD before and SKB_DATA_ALIGN(skb_shared_info) afterwards.
>> An order 1 page, if that's what you need, should work just fine.
> 
> I mean in order to fit a single ThunderboltIP frame, I would need to
> allocate NET_SKB_PAD+4096+SKB_DATA_ALIGN(skb_shared_info) size buffer.

Which would be an order 1 page or 8192 bytes.

> Is that still fine for build_skb()? Also can I use that with
> skb_add_rx_frag() which seem to take single page?

Again, an order 1 page should work fine.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available
From: Eric Dumazet @ 2017-09-27 18:03 UTC (permalink / raw)
  To: Wei Wang
  Cc: Paolo Abeni, Martin KaFai Lau, Linux Kernel Network Developers,
	David S. Miller, Hannes Frederic Sowa
In-Reply-To: <CAEA6p_DZxGesjymubWoWx7hQjaZ7Vdchh=cdOW9tAr4F7QvzXg@mail.gmail.com>

>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c

>
> Hi Paolo,
>
> Eric and I discussed about this issue recently as well :).
>
> What about the following change:
>
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 93568bd0a352..33e1d86bcef6 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
>  static inline void dst_use(struct dst_entry *dst, unsigned long time)
>  {
>         dst_hold(dst);
> -       dst->__use++;
> -       dst->lastuse = time;
> +       if (dst->lastuse != time) {
> +               dst->__use++;
> +               dst->lastuse = time;
> +       }
>  }
>
>  static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
>  {
> -       dst->__use++;
> -       dst->lastuse = time;
> +       if (dst->lastuse != time) {
> +               dst->__use++;
> +               dst->lastuse = time;
> +       }
>  }
>
>  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 26cc9f483b6d..e195f093add3 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
> struct fib6_table *table,
>
>                 struct rt6_info *pcpu_rt;
>
> -               rt->dst.lastuse = jiffies;
> -               rt->dst.__use++;
> +               dst_use_noref(rt, jiffies);
>                 pcpu_rt = rt6_get_pcpu_route(rt);
>
>                 if (pcpu_rt) {
>
>
> This way we always only update dst->__use and dst->lastuse at most
> once per jiffy. And we don't really need to update pcpu and then do
> the copy over from pcpu_rt to rt operation.
>
> Another thing is that I don't really see any places making use of
> dst->__use. So maybe we can also get rid of this dst->__use field?
>
> Thanks.
> Wei

Paolo, given we are very close to send Wei awesome work about IPv6
routing cache,
could we ask you to wait few days before doing the same work from your side ?

Main issue is the rwlock, and we are converting it to full RCU.

You are sending patches that are making our job very difficult IMO.

We chose to mimic the change done in neighbour code years ago
( 0ed8ddf4045fcfcac36bad753dc4046118c603ec )

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available
From: Wei Wang @ 2017-09-27 17:44 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet, Martin KaFai Lau
  Cc: Linux Kernel Network Developers, David S. Miller,
	Hannes Frederic Sowa
In-Reply-To: <9d32116e61596b0de6a584b6cf05cae3ce7abb56.1506532145.git.pabeni@redhat.com>

On Wed, Sep 27, 2017 at 10:14 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> When a host is under high ipv6 load, the updates of the ingress
> route '__use' field are a source of relevant contention: such
> field is updated for each packet and several cores can access
> concurrently the dst, even if percpu dst entries are available
> and used.
>
> The __use value is just a rough indication of the dst usage: is
> already updated concurrently from multiple CPUs without any lock,
> so we can decrease the contention leveraging the percpu dst to perform
> __use bulk updates: if a per cpu dst entry is found, we account on
> such entry and we flush the percpu counter once per jiffy.
>
> Performace gain under UDP flood is as follows:
>
> nr RX queues    before          after           delta
>                 kpps            kpps            (%)
> 2               2316            2688            16
> 3               3033            3605            18
> 4               3963            4328            9
> 5               4379            5253            19
> 6               5137            6000            16
>
> Performance gain under TCP syn flood should be measurable as well.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv6/route.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 26cc9f483b6d..e69f304de950 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1170,12 +1170,24 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
>
>                 struct rt6_info *pcpu_rt;
>
> -               rt->dst.lastuse = jiffies;
> -               rt->dst.__use++;
>                 pcpu_rt = rt6_get_pcpu_route(rt);
>
>                 if (pcpu_rt) {
> +                       unsigned long ts;
> +
>                         read_unlock_bh(&table->tb6_lock);
> +
> +                       /* do lazy updates of rt->dst->__use, at most once
> +                        * per jiffy, to avoid contention on such cacheline.
> +                        */
> +                       ts = jiffies;
> +                       pcpu_rt->dst.__use++;
> +                       if (pcpu_rt->dst.lastuse != ts) {
> +                               rt->dst.__use += pcpu_rt->dst.__use;
> +                               rt->dst.lastuse = ts;
> +                               pcpu_rt->dst.__use = 0;
> +                               pcpu_rt->dst.lastuse = ts;
> +                       }
>                 } else {
>                         /* We have to do the read_unlock first
>                          * because rt6_make_pcpu_route() may trigger
> @@ -1185,6 +1197,8 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
>                         read_unlock_bh(&table->tb6_lock);
>                         pcpu_rt = rt6_make_pcpu_route(rt);
>                         dst_release(&rt->dst);
> +                       rt->dst.lastuse = jiffies;
> +                       rt->dst.__use++;
>                 }
>
>                 trace_fib6_table_lookup(net, pcpu_rt, table->tb6_id, fl6);
> --
> 2.13.5
>

Hi Paolo,

Eric and I discussed about this issue recently as well :).

What about the following change:

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..33e1d86bcef6 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
 static inline void dst_use(struct dst_entry *dst, unsigned long time)
 {
        dst_hold(dst);
-       dst->__use++;
-       dst->lastuse = time;
+       if (dst->lastuse != time) {
+               dst->__use++;
+               dst->lastuse = time;
+       }
 }

 static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
 {
-       dst->__use++;
-       dst->lastuse = time;
+       if (dst->lastuse != time) {
+               dst->__use++;
+               dst->lastuse = time;
+       }
 }

 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..e195f093add3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
struct fib6_table *table,

                struct rt6_info *pcpu_rt;

-               rt->dst.lastuse = jiffies;
-               rt->dst.__use++;
+               dst_use_noref(rt, jiffies);
                pcpu_rt = rt6_get_pcpu_route(rt);

                if (pcpu_rt) {


This way we always only update dst->__use and dst->lastuse at most
once per jiffy. And we don't really need to update pcpu and then do
the copy over from pcpu_rt to rt operation.

Another thing is that I don't really see any places making use of
dst->__use. So maybe we can also get rid of this dst->__use field?

Thanks.
Wei

^ permalink raw reply related

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Alexei Starovoitov @ 2017-09-27 17:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Daniel Borkmann, davem, peter.waskiewicz.jr,
	jakub.kicinski, netdev, Andy Gospodarek
In-Reply-To: <20170927165457.4265bfc3@redhat.com>

On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 27 Sep 2017 06:35:40 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:
> > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >   
> > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > >> [...]  
> > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > >>> actually works extremely well).    
> > >>
> > >> Meaning you let all the XDP_PASS packets get processed on a
> > >> different CPU, so you can reserve the whole CPU just for
> > >> prefiltering, right?   
> > > 
> > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > The trick is using the map-flush point, to transfer packets in bulk to
> > > the remote CPU (single call IPC is too slow), but at the same time
> > > flush single packets if NAPI didn't see a bulk.
> > >   
> > >> Do you have some numbers to share at this point, just curious when
> > >> you mention it works extremely well.  
> > > 
> > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > 
> > >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > 
> > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > the 11.9Mpps DDoS attack is hitting.
> > > 
> > > Running XDP/eBPF prog_num:4
> > > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > > XDP-RX          0       12,030,471     11,966,982  0          
> > > XDP-RX          total   12,030,471     11,966,982 
> > > cpumap-enqueue    0:2   63,488         0           0          
> > > cpumap-enqueue  sum:2   63,488         0           0          
> > > cpumap_kthread  2       63,488         0           3          time_exceed
> > > cpumap_kthread  total   63,488         0           0          
> > > redirect_err    total   0              0          
> > > 
> > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size     Size    Time     Rate         
> > > bytes  Bytes  bytes    bytes   secs.    per sec   
> > > 
> > > 16384  87380  1024     1024    10.00    12735.97   
> > > 16384  87380 
> > > 
> > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > >   
> > 
> > Just curious could you also try this with RPS enabled (or does this have
> > RPS enabled). RPS should effectively do the same thing but higher in the
> > stack. I'm curious what the delta would be. Might be another interesting
> > case and fairly easy to setup if you already have the above scripts.
> 
> Yes, I'm essentially competing with RSP, thus such a comparison is very
> relevant...
> 
> This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> other 4 CPUS process packet.
> 
> Summary of RPS (Receive Packet Steering) performance:
>  * End result is 6.3 Mpps max performance
>  * netperf TCP_CRR is 1 trans/sec.
>  * Each RX-RPS CPU stall at ~3.2Mpps.
> 
> The full test report below with setup:
> 
> The mask needed::
> 
>  perl -e 'printf "%b\n",0x3C'
>  111100
> 
> RPS setup::
> 
>  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> 
>  for N in $(seq 0 5) ; do \
>    sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
>    sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
>    grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
>  done
> 
> Reduce RX queues to two ::
> 
>  ethtool -L ixgbe1 combined 2
> 
> IRQ align to CPU numbers::
> 
>  $ ~/setup01.sh
>  Not root, running with sudo
>   --- Disable Ethernet flow-control ---
>  rx unmodified, ignoring
>  tx unmodified, ignoring
>  no pause parameters changed, aborting
>  rx unmodified, ignoring
>  tx unmodified, ignoring
>  no pause parameters changed, aborting
>   --- Align IRQs ---
>  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
>  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
>  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> 
> $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> 
> Generator is sending: 12,715,782 tx_packets /sec
> 
>  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
>     -d 172.16.0.2 -t8
> 
> $ nstat > /dev/null && sleep 1 && nstat
> #kernel
> IpInReceives                    6346544            0.0
> IpInDelivers                    6346544            0.0
> IpOutRequests                   1020               0.0
> IcmpOutMsgs                     1020               0.0
> IcmpOutDestUnreachs             1020               0.0
> IcmpMsgOutType3                 1020               0.0
> UdpNoPorts                      6346898            0.0
> IpExtInOctets                   291964714          0.0
> IpExtOutOctets                  73440              0.0
> IpExtInNoECTPkts                6347063            0.0
> 
> $ mpstat -P ALL -u -I SCPU -I SUM
> 
> Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
> Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
> Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
> Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
> Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
> Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
> Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
> Average:       5    0.00    0.00    0.00   0.67   58.93  40.40
> 
> Average:     CPU    intr/s
> Average:     all 152067.22
> Average:       0  50064.73
> Average:       1  50089.35
> Average:       2  45095.17
> Average:       3  44875.04
> Average:       4  44906.32
> Average:       5  45152.08
> 
> Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
> Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
> Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
> Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
> Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
> Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
> Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52
> 
> From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.
> 
> Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
> Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
> Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
> Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
> Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
> Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
> Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
> Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
> Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
> Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
> Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
> Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec
> 
> Notice, each RX-CPU can only process 3.1Mpps.
> 
> RPS RX-CPU(0):
> 
>  # Overhead  CPU  Symbol
>  # ........  ...  .......................................
>  #
>     11.72%  000  [k] ixgbe_poll
>     11.29%  000  [k] _raw_spin_lock
>     10.35%  000  [k] dev_gro_receive
>      8.36%  000  [k] __build_skb
>      7.35%  000  [k] __skb_get_hash
>      6.22%  000  [k] enqueue_to_backlog
>      5.89%  000  [k] __skb_flow_dissect
>      4.43%  000  [k] inet_gro_receive
>      4.19%  000  [k] ___slab_alloc
>      3.90%  000  [k] queued_spin_lock_slowpath
>      3.85%  000  [k] kmem_cache_alloc
>      3.06%  000  [k] build_skb
>      2.66%  000  [k] get_rps_cpu
>      2.57%  000  [k] napi_gro_receive
>      2.34%  000  [k] eth_type_trans
>      1.81%  000  [k] __cmpxchg_double_slab.isra.61
>      1.47%  000  [k] ixgbe_alloc_rx_buffers
>      1.43%  000  [k] get_partial_node.isra.81
>      0.84%  000  [k] swiotlb_sync_single
>      0.74%  000  [k] udp4_gro_receive
>      0.73%  000  [k] netif_receive_skb_internal
>      0.72%  000  [k] udp_gro_receive
>      0.63%  000  [k] skb_gro_reset_offset
>      0.49%  000  [k] __skb_flow_get_ports
>      0.48%  000  [k] llist_add_batch
>      0.36%  000  [k] swiotlb_sync_single_for_cpu
>      0.34%  000  [k] __slab_alloc
> 
> 
> Remote RPS-CPU(3) getting packets::
> 
>  # Overhead  CPU  Symbol
>  # ........  ...  ..............................................
>  #
>     33.02%  003  [k] poll_idle
>     10.99%  003  [k] __netif_receive_skb_core
>     10.45%  003  [k] page_frag_free
>      8.49%  003  [k] ip_rcv
>      4.19%  003  [k] fib_table_lookup
>      2.84%  003  [k] __udp4_lib_rcv
>      2.81%  003  [k] __slab_free
>      2.23%  003  [k] __udp4_lib_lookup
>      2.09%  003  [k] ip_route_input_rcu
>      2.07%  003  [k] kmem_cache_free
>      2.06%  003  [k] udp_v4_early_demux
>      1.73%  003  [k] ip_rcv_finish

Very interesting data.
So above perf report compares to xdp-redirect-cpu this one:
Perf top on a CPU(3) that have to alloc and free SKBs etc.

# Overhead  CPU  Symbol
# ........  ...  .......................................
#
    15.51%  003  [k] fib_table_lookup
     8.91%  003  [k] cpu_map_kthread_run
     8.04%  003  [k] build_skb
     7.88%  003  [k] page_frag_free
     5.13%  003  [k] kmem_cache_alloc
     4.76%  003  [k] ip_route_input_rcu
     4.59%  003  [k] kmem_cache_free
     4.02%  003  [k] __udp4_lib_rcv
     3.20%  003  [k] fib_validate_source
     3.02%  003  [k] __netif_receive_skb_core
     3.02%  003  [k] udp_v4_early_demux
     2.90%  003  [k] ip_rcv
     2.80%  003  [k] ip_rcv_finish

right?
and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu
you can load it up all the way.
Am I interpreting all this correctly that with RPS cpu0 cannot
distributed the packets to other cpus fast enough and that's
a bottleneck?
whereas in redirect-cpu you're doing early packet distribution
before skb alloc?
So in other words with redirect-cpu all consumer cpus are doing
skb alloc and in RPS cpu0 is allocating skbs for all ?
and that's where 6M->12M performance gain comes from?

^ permalink raw reply

* [PATCH V3] r8152:  add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-27 17:28 UTC (permalink / raw)
  To: Hayes Wang, Oliver Neukum
  Cc: linux-usb, David S . Miller, LKML, netdev, Grant Grundler

This linksys dongle by default comes up in cdc_ether mode.
This patch allows r8152 to claim the device:
   Bus 002 Device 002: ID 13b1:0041 Linksys

Signed-off-by: Grant Grundler <grundler@chromium.org>
---
 drivers/net/usb/cdc_ether.c | 10 ++++++++++
 drivers/net/usb/r8152.c     |  2 ++
 2 files changed, 12 insertions(+)

V3: for backwards compat, add #ifdef CONFIG_USB_RTL8152 around
    the cdc_ether blacklist entry so the cdc_ether driver can
    still claim the device if r8152 driver isn't configured.

V2: add LINKSYS_VENDOR_ID to cdc_ether blacklist

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 8ab281b478f2..446dcc0f1f70 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -546,6 +546,7 @@ static const struct driver_info wwan_info = {
 #define DELL_VENDOR_ID		0x413C
 #define REALTEK_VENDOR_ID	0x0bda
 #define SAMSUNG_VENDOR_ID	0x04e8
+#define LINKSYS_VENDOR_ID	0x13b1
 #define LENOVO_VENDOR_ID	0x17ef
 #define NVIDIA_VENDOR_ID	0x0955
 #define HP_VENDOR_ID		0x03f0
@@ -737,6 +738,15 @@ static const struct usb_device_id	products[] = {
 	.driver_info = 0,
 },
 
+#ifdef CONFIG_USB_RTL8152
+/* Linksys USB3GIGV1 Ethernet Adapter */
+{
+	USB_DEVICE_AND_INTERFACE_INFO(LINKSYS_VENDOR_ID, 0x0041, USB_CLASS_COMM,
+			USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+	.driver_info = 0,
+},
+#endif
+
 /* ThinkPad USB-C Dock (based on Realtek RTL8153) */
 {
 	USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x3062, USB_CLASS_COMM,
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index ceb78e2ea4f0..941ece08ba78 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -613,6 +613,7 @@ enum rtl8152_flags {
 #define VENDOR_ID_MICROSOFT		0x045e
 #define VENDOR_ID_SAMSUNG		0x04e8
 #define VENDOR_ID_LENOVO		0x17ef
+#define VENDOR_ID_LINKSYS		0x13b1
 #define VENDOR_ID_NVIDIA		0x0955
 
 #define MCU_TYPE_PLA			0x0100
@@ -5316,6 +5317,7 @@ static const struct usb_device_id rtl8152_table[] = {
 	{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7205)},
 	{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x720c)},
 	{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7214)},
+	{REALTEK_USB_DEVICE(VENDOR_ID_LINKSYS, 0x0041)},
 	{REALTEK_USB_DEVICE(VENDOR_ID_NVIDIA,  0x09ff)},
 	{}
 };
-- 
2.14.2.822.g60be5d43e6-goog

^ permalink raw reply related

* Re: [PATCH v2 16/16] net: Add support for networking over Thunderbolt cable
From: Mika Westerberg @ 2017-09-27 17:27 UTC (permalink / raw)
  To: David Miller
  Cc: gregkh, andreas.noever, michael.jamet, yehezkel.bernat,
	amir.jer.levy, Mario.Limonciello, lukas, andriy.shevchenko,
	andrew, linux-kernel, netdev
In-Reply-To: <20170927.092709.1826177647592316221.davem@davemloft.net>

On Wed, Sep 27, 2017 at 09:27:09AM -0700, David Miller wrote:
> From: Mika Westerberg <mika.westerberg@linux.intel.com>
> Date: Wed, 27 Sep 2017 16:42:38 +0300
> 
> > Using build_skb() then would require to allocate larger buffer, that
> > includes NET_SKB_PAD + SKB_DATA_ALIGN(skb_shared_info) and that exceeds
> > page size. Is this something supported by build_skb()? It was not clear
> > to me based on the code and other users of build_skb() but I may be
> > missing something.
> 
> You need NET_SKB_PAD before and SKB_DATA_ALIGN(skb_shared_info) afterwards.
> An order 1 page, if that's what you need, should work just fine.

I mean in order to fit a single ThunderboltIP frame, I would need to
allocate NET_SKB_PAD+4096+SKB_DATA_ALIGN(skb_shared_info) size buffer.
Is that still fine for build_skb()? Also can I use that with
skb_add_rx_frag() which seem to take single page?

ThunderboltIP protocol basically takes advantage of TSO/LRO but it
actually does not do any segmentation. Instead it just splits the 64kB
large package into smaller 4k frames (which each include 12 byte header)
and pushes those over the Thunderbolt medium. The receiver side then
does the opposite.

Thanks and sorry for dummy questions. I'm just not too familiar with
the networking subsystem (yet).

^ permalink raw reply

* [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available
From: Paolo Abeni @ 2017-09-27 17:14 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Hannes Frederic Sowa, Wei Wang

When a host is under high ipv6 load, the updates of the ingress
route '__use' field are a source of relevant contention: such
field is updated for each packet and several cores can access
concurrently the dst, even if percpu dst entries are available
and used.

The __use value is just a rough indication of the dst usage: is
already updated concurrently from multiple CPUs without any lock,
so we can decrease the contention leveraging the percpu dst to perform
__use bulk updates: if a per cpu dst entry is found, we account on
such entry and we flush the percpu counter once per jiffy.

Performace gain under UDP flood is as follows:

nr RX queues	before		after		delta
		kpps		kpps		(%)
2		2316		2688		16
3		3033		3605		18
4		3963		4328		9
5		4379		5253		19
6		5137		6000		16

Performance gain under TCP syn flood should be measurable as well.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv6/route.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..e69f304de950 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1170,12 +1170,24 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,

 		struct rt6_info *pcpu_rt;

-		rt->dst.lastuse = jiffies;
-		rt->dst.__use++;
 		pcpu_rt = rt6_get_pcpu_route(rt);

 		if (pcpu_rt) {
+			unsigned long ts;
+
 			read_unlock_bh(&table->tb6_lock);
+
+			/* do lazy updates of rt->dst->__use, at most once
+			 * per jiffy, to avoid contention on such cacheline.
+			 */
+			ts = jiffies;
+			pcpu_rt->dst.__use++;
+			if (pcpu_rt->dst.lastuse != ts) {
+				rt->dst.__use += pcpu_rt->dst.__use;
+				rt->dst.lastuse = ts;
+				pcpu_rt->dst.__use = 0;
+				pcpu_rt->dst.lastuse = ts;
+			}
 		} else {
 			/* We have to do the read_unlock first
 			 * because rt6_make_pcpu_route() may trigger
@@ -1185,6 +1197,8 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 			read_unlock_bh(&table->tb6_lock);
 			pcpu_rt = rt6_make_pcpu_route(rt);
 			dst_release(&rt->dst);
+			rt->dst.lastuse = jiffies;
+			rt->dst.__use++;
 		}

 		trace_fib6_table_lookup(net, pcpu_rt, table->tb6_id, fl6);
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH 2/2] uapi: add a compatibility layer between linux/uio.h and glibc
From: Lee Duncan @ 2017-09-27 16:46 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Dmitry V. Levin, David Miller, Lee Duncan
In-Reply-To: <20170221201914.GA28360@altlinux.org>

Ping? I never saw any response to this.

On Wed, 22 Feb 2017 05:29:47 +0300, Dmitry V Levin wrote:
> Do not define struct iovec in linux/uio.h when <sys/uio.h> or <fcntl.h>
> is already included and provides these definitions.
> 
> This fixes the following compilation error when <sys/uio.h> or <fcntl.h>
> is included before <linux/uio.h>:
> 
> /usr/include/linux/uio.h:16:8: error: redefinition of 'struct iovec'
> 
> Signed-off-by: Dmitry V. Levin <ldv@...linux.org>
> ---
>  include/uapi/linux/libc-compat.h | 10 ++++++++++
>  include/uapi/linux/uio.h         |  3 +++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
> index 481e3b1..9b88586 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -205,6 +205,13 @@
>  #define __UAPI_DEF_TIMEZONE		1
>  #endif
>  
> +/* Coordinate with glibc bits/uio.h header. */
> +#if defined(_SYS_UIO_H) || defined(_FCNTL_H)
> +#define __UAPI_DEF_IOVEC		0
> +#else
> +#define __UAPI_DEF_IOVEC		1
> +#endif
> +
>  /* Definitions for xattr.h */
>  #if defined(_SYS_XATTR_H)
>  #define __UAPI_DEF_XATTR		0
> @@ -261,6 +268,9 @@
>  #define __UAPI_DEF_TIMEVAL			1
>  #define __UAPI_DEF_TIMEZONE			1
>  
> +/* Definitions for uio.h */
> +#define __UAPI_DEF_IOVEC		1
> +
>  /* Definitions for xattr.h */
>  #define __UAPI_DEF_XATTR		1
>  
> diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
> index 2731d56..e6e12cf 100644
> --- a/include/uapi/linux/uio.h
> +++ b/include/uapi/linux/uio.h
> @@ -9,15 +9,18 @@
>  #ifndef _UAPI__LINUX_UIO_H
>  #define _UAPI__LINUX_UIO_H
>  
> +#include <linux/libc-compat.h>
>  #include <linux/compiler.h>
>  #include <linux/types.h>
>  
>  
> +#if __UAPI_DEF_IOVEC
>  struct iovec
>  {
>  	void __user *iov_base;	/* BSD uses caddr_t (1003.1g requires void *) */
>  	__kernel_size_t iov_len; /* Must be size_t (1003.1g) */
>  };
> +#endif /* __UAPI_DEF_IOVEC */
>  
>  /*
>   *	UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
> -- 
> ldv

-- 
Lee Duncan
SUSE Labs

^ permalink raw reply

* RE: [PATCH net v2] net: dsa: mv88e6xxx: lock mutex when freeing IRQs
From: David Laight @ 2017-09-27 16:40 UTC (permalink / raw)
  To: 'Andrew Lunn'
  Cc: 'Vivien Didelot', netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel@savoirfairelinux.com,
	David S. Miller, Florian Fainelli
In-Reply-To: <20170927130711.GD13516@lunn.ch>

From: Andrew Lunn
> Sent: 27 September 2017 14:07
> To: David Laight
> On Wed, Sep 27, 2017 at 09:06:01AM +0000, David Laight wrote:
> > From: Vivien Didelot
> > > Sent: 26 September 2017 19:57
> > > mv88e6xxx_g2_irq_free locks the registers mutex, but not
> > > mv88e6xxx_g1_irq_free, which results in a stack trace from
> > > assert_reg_lock when unloading the mv88e6xxx module. Fix this.
> > >
> > > Fixes: 3460a5770ce9 ("net: dsa: mv88e6xxx: Mask g1 interrupts and free interrupt")
> > > Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> > > ---
> > >  drivers/net/dsa/mv88e6xxx/chip.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> > > index c6678aa9b4ef..e7ff7483d2fb 100644
> > > --- a/drivers/net/dsa/mv88e6xxx/chip.c
> > > +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> > > @@ -3947,7 +3947,9 @@ static void mv88e6xxx_remove(struct mdio_device *mdiodev)
> > >  	if (chip->irq > 0) {
> > >  		if (chip->info->g2_irqs > 0)
> > >  			mv88e6xxx_g2_irq_free(chip);
> > > +		mutex_lock(&chip->reg_lock);
> > >  		mv88e6xxx_g1_irq_free(chip);
> > > +		mutex_unlock(&chip->reg_lock);
> >
> > Isn't the irq_free code likely to have to sleep waiting for any
> > ISR to complete??
> 
> Hi David
> 
> Possibly. But this is a mutex, not a spinlock. So sleeping is O.K.
> Or am i missing something?

Looks like I was missing some coffee :-)

	David

^ permalink raw reply

* Re: [PATCH V2] r8152: add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-27 16:39 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Doug Anderson, Grant Grundler, David S . Miller,
	Greg Kroah-Hartman, Hayes Wang, LKML, linux-usb, netdev
In-Reply-To: <1506496556.28482.3.camel@suse.com>

On Wed, Sep 27, 2017 at 12:15 AM, Oliver Neukum <oneukum@suse.com> wrote:
> Am Dienstag, den 26.09.2017, 08:19 -0700 schrieb Doug Anderson:
>>
>> I know that for at least some of the adapters in the CDC Ethernet
>> blacklist it was claimed that the CDC Ethernet support in the adapter
>> was kinda broken anyway so the blacklist made sense.  ...but for the
>> Linksys Gigabit adapter the CDC Ethernet driver seems to work OK, it's
>> just not quite as full featured / efficient as the R8152 driver.
>>
>> Is that not a concern?  I guess you could tell people in this
>> situation that they simply need to enable the R8152 driver to get
>> continued support for their Ethernet adapter?
>
> Hi,
>
> yes, it is a valid concern. An #ifdef will be needed.

Good idea - I will post V3 shortly.

I'm assuming you mean to add #ifdef CONFIG_USB_RTL8152 around the
blacklist entry in cdc_ether driver.

cheers,
grant

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox