Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v5 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-11-30  1:07 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161130005906.GB29591@ast-mbp.thefacebook.com>

On 11/29/16 5:59 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 05:43:08PM -0700, David Ahern wrote:
>> On 11/29/16 1:01 PM, Alexei Starovoitov wrote:
>>> Could you also expose sk_protcol and sk_type as read only fields?
>>
>> Those are bitfields in struct sock, so can't use offsetof or sizeof. Any existing use cases that try to load a bitfield in a bpf that I can look at?
> 
> pkt_type, vlan are also bitfileds in skb. Please see convert_skb_access()
> There is a bit of ugliness due to __BIG_ENDIAN_BITFIELD though..
> 

Given the added complexity I'd prefer to defer this second use case to a follow on patch set. This one introduces the infra for sockets and I don't see anything needing to change with it to add the read of 3 more sock elements. Agree?

^ permalink raw reply

* Re: [PATCH net-next v3 0/4] Fix OdroidC2 Gigabit Tx link issue
From: David Miller @ 2016-11-30  1:13 UTC (permalink / raw)
  To: f.fainelli-Re5JQEeQqe8AvxtiuMwx3w
  Cc: jbrunet-rdvid1DuHRBWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, carlo-KA+7E9HrN00dnm+yROfE0A,
	khilman-rdvid1DuHRBWk0Htik3J/w, peppe.cavallaro-qxv4g6HH51o,
	alexandre.torgue-qxv4g6HH51o,
	martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg,
	neolynx-Re5JQEeQqe8AvxtiuMwx3w, andrew-g2DYL2Zd6BY,
	narmstrong-rdvid1DuHRBWk0Htik3J/w,
	linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <e14a3b0c-dc34-be14-48b3-518a0ad0c080-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Tue, 29 Nov 2016 16:43:20 -0800

> On 11/29/2016 04:38 PM, David Miller wrote:
>> From: Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
>> Date: Mon, 28 Nov 2016 10:46:45 +0100
>> 
>>> This patchset fixes an issue with the OdroidC2 board (DWMAC + RTL8211F).
>>> The platform seems to enter LPI on the Rx path too often while performing
>>> relatively high TX transfer. This eventually break the link (both Tx and
>>> Rx), and require to bring the interface down and up again to get the Rx
>>> path working again.
>>>
>>> The root cause of this issue is not fully understood yet but disabling EEE
>>> advertisement on the PHY prevent this feature to be negotiated.
>>> With this change, the link is stable and reliable, with the expected
>>> throughput performance.
>>>
>>> The patchset adds options in the generic phy driver to disable EEE
>>> advertisement, through device tree. The way it is done is very similar
>>> to the handling of the max-speed property.
>> 
>> Patches 1-3 applied to net-next, thanks.
> 
> Meh, there was a v4 submitted shortly after, and I objected to the whole
> idea of using that kind of Device Tree properties to disable EEE, we can
> send reverts though..

Sorry, I lost this in all the discussion, I can revert.

Just send me a revert of the entire merge commit
a152c91889556df17ca6d8ea134fb2cb4ac9f893 with a short
description of why and I'll apply it.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v3 0/4] Fix OdroidC2 Gigabit Tx link issue
From: Florian Fainelli @ 2016-11-30  1:15 UTC (permalink / raw)
  To: David Miller
  Cc: jbrunet-rdvid1DuHRBWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, carlo-KA+7E9HrN00dnm+yROfE0A,
	khilman-rdvid1DuHRBWk0Htik3J/w, peppe.cavallaro-qxv4g6HH51o,
	alexandre.torgue-qxv4g6HH51o,
	martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg,
	neolynx-Re5JQEeQqe8AvxtiuMwx3w, andrew-g2DYL2Zd6BY,
	narmstrong-rdvid1DuHRBWk0Htik3J/w,
	linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161129.201331.2207317476589573523.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On 11/29/2016 05:13 PM, David Miller wrote:
> From: Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Date: Tue, 29 Nov 2016 16:43:20 -0800
> 
>> On 11/29/2016 04:38 PM, David Miller wrote:
>>> From: Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
>>> Date: Mon, 28 Nov 2016 10:46:45 +0100
>>>
>>>> This patchset fixes an issue with the OdroidC2 board (DWMAC + RTL8211F).
>>>> The platform seems to enter LPI on the Rx path too often while performing
>>>> relatively high TX transfer. This eventually break the link (both Tx and
>>>> Rx), and require to bring the interface down and up again to get the Rx
>>>> path working again.
>>>>
>>>> The root cause of this issue is not fully understood yet but disabling EEE
>>>> advertisement on the PHY prevent this feature to be negotiated.
>>>> With this change, the link is stable and reliable, with the expected
>>>> throughput performance.
>>>>
>>>> The patchset adds options in the generic phy driver to disable EEE
>>>> advertisement, through device tree. The way it is done is very similar
>>>> to the handling of the max-speed property.
>>>
>>> Patches 1-3 applied to net-next, thanks.
>>
>> Meh, there was a v4 submitted shortly after, and I objected to the whole
>> idea of using that kind of Device Tree properties to disable EEE, we can
>> send reverts though..
> 
> Sorry, I lost this in all the discussion, I can revert.

Yeah, I can understand why, these freaking PHYs tend to generate a lot
of noise and discussion...

> 
> Just send me a revert of the entire merge commit
> a152c91889556df17ca6d8ea134fb2cb4ac9f893 with a short
> description of why and I'll apply it.

OK, I will talk with Jerome first to make sure that we are in agreement
with the solution to deploy to fix the OdroidC2 problem first.

Thanks!
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [net-next] neigh: remove duplicate check for same neigh
From: 张胜举 @ 2016-11-30  1:27 UTC (permalink / raw)
  To: 'David Ahern', netdev
In-Reply-To: <d641245a-8442-9238-a507-6f94184354a0@cumulusnetworks.com>

> -----Original Message-----
> From: David Ahern [mailto:dsa@cumulusnetworks.com]
> Sent: Wednesday, November 30, 2016 12:23 AM
> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>;
> netdev@vger.kernel.org
> Subject: Re: [net-next] neigh: remove duplicate check for same neigh
> 
> On 11/29/16 1:22 AM, Zhang Shengju wrote:
> > @@ -2285,20 +2295,15 @@ static int neigh_dump_table(struct neigh_table
> *tbl, struct sk_buff *skb,
> >  	rcu_read_lock_bh();
> >  	nht = rcu_dereference_bh(tbl->nht);
> >
> > -	for (h = s_h; h < (1 << nht->hash_shift); h++) {
> > -		if (h > s_h)
> > -			s_idx = 0;
> > +	for (h = s_h; h < (1 << nht->hash_shift); h++, s_idx = 0) {
> >  		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
> >  		     n != NULL;
> > -		     n = rcu_dereference_bh(n->next)) {
> > -			if (!net_eq(dev_net(n->dev), net))
> > -				continue;
> > -			if (neigh_ifindex_filtered(n->dev, filter_idx))
> > +		     n = rcu_dereference_bh(n->next), idx++) {
> 
> incrementing idx here ...
> 
> > +			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
> >  				continue;
> > -			if (neigh_master_filtered(n->dev,
filter_master_idx))
> > +			if (neigh_dump_filtered(n->dev, filter_idx,
> > +						filter_master_idx))
> >  				continue;
> > -			if (idx < s_idx)
> > -				goto next;
> >  			if (neigh_fill_info(skb, n, NETLINK_CB(cb-
> >skb).portid,
> >  					    cb->nlh->nlmsg_seq,
> >  					    RTM_NEWNEIGH,
> 
> ... causes a missed entry when an error happens here
> 
> idx is only incremented when a neigh has been successfully added to the
skb.

If an error happens, we just goto 'out' and exit this loop, this will skip
the 'idx++'.  
So no missed entry, right?

> 
> 
> Your intention is to save a few cycles by making idx an absolute index
within
> the hash bucket and jumping to the next entry to be dumped without
> evaluating any filters per entry.
> 
> So why not keep it simple and do this:

This has less changes, but I preferred to add a new function to do the
filter logic, this is the same as your code  @rtnl_dump_ifinfo().

And for the 'idx', it should be increased when neigh entry is filter out or
is successfully added to the skb. Isn't it straight-forward to put it in for
loop?

> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c index
> 2ae929f9bd06..2e49a85e696a 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -2291,14 +2291,12 @@ static int neigh_dump_table(struct neigh_table
> *tbl, struct sk_buff *skb,
>                 for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx =
0;
>                      n != NULL;
>                      n = rcu_dereference_bh(n->next)) {
> -                       if (!net_eq(dev_net(n->dev), net))
> -                               continue;
> -                       if (neigh_ifindex_filtered(n->dev, filter_idx))
> -                               continue;
> -                       if (neigh_master_filtered(n->dev,
filter_master_idx))
> -                               continue;
>                         if (idx < s_idx)
>                                 goto next;
> +                       if (!net_eq(dev_net(n->dev), net) ||
> +                           neigh_ifindex_filtered(n->dev, filter_idx) ||
> +                           neigh_master_filtered(n->dev,
filter_master_idx))
> +                               goto next;
>                         if (neigh_fill_info(skb, n,
NETLINK_CB(cb->skb).portid,
>                                             cb->nlh->nlmsg_seq,
>                                             RTM_NEWNEIGH,

^ permalink raw reply

* Re: [PATCH v2 1/1] net: macb: ensure ordering write to re-enable RX smoothly
From: David Miller @ 2016-11-30  1:34 UTC (permalink / raw)
  To: zumeng.chen; +Cc: nicolas.ferre, netdev, linux-kernel
In-Reply-To: <1480341300-17384-1-git-send-email-zumeng.chen@windriver.com>

From: Zumeng Chen <zumeng.chen@windriver.com>
Date: Mon, 28 Nov 2016 21:55:00 +0800

> When a hardware issue happened as described by inline comments, the register
> write pattern looks like the following:
> 
> <write ~MACB_BIT(RE)>
> + wmb();
> <write MACB_BIT(RE)>
> 
> There might be a memory barrier between these two write operations, so add wmb
> to ensure an flip from 0 to 1 for NCR.
> 
> Signed-off-by: Zumeng Chen <zumeng.chen@windriver.com>
> ---
> 
> V2 changes:
> 
> Add the same wmb for at91ether as well based on reviewer's suggestion.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2] net: phy: Fix use after free in phy_detach()
From: David Miller @ 2016-11-30  1:35 UTC (permalink / raw)
  To: geert+renesas
  Cc: f.fainelli, josh.cartwright, nathan.sullivan, zach.brown,
	woojung.huh, netdev, linux-renesas-soc, linux-kernel
In-Reply-To: <1480342711-26407-1-git-send-email-geert+renesas@glider.be>

From: Geert Uytterhoeven <geert+renesas@glider.be>
Date: Mon, 28 Nov 2016 15:18:31 +0100

> If device_release_driver(&phydev->mdio.dev) is called, it releases all
> resources belonging to the PHY device. Hence the subsequent call to
> phy_led_triggers_unregister() will access already freed memory when
> unregistering the LEDs.
> 
> Move the call to phy_led_triggers_unregister() before the possible call
> to device_release_driver() to fix this.
> 
> Fixes: 2e0bc452f4721520 ("net: phy: leds: add support for led triggers on phy link state change")
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
> This is v2 of "[RFC] net: phy: Fix double free in phy_detach()".
> 
> v2:
>   - Dropped RFC,
>   - Reworded, as commit a7dac9f9c1695d74 ("phy: fix error case of
>     phy_led_triggers_(un)register") fixed the double free, and thus the
>     warning I was seeing during "poweroff" on sh73a0/kzm9g,
>   - Verified use after free using CONFIG_DEBUG_DEVRES, log_devres = 1,
>     and additional debug code printing the address of
>     phy->phy_led_triggers. Adding poisoning of freed memory to
>     devres_log() will cause a crash.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2] mlxsw: switchib: add MLXSW_PCI dependency
From: David Miller @ 2016-11-30  1:36 UTC (permalink / raw)
  To: arnd; +Cc: jiri, idosch, vadimp, cera, eladr, netdev, linux-kernel
In-Reply-To: <20161128142622.2758614-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Mon, 28 Nov 2016 15:26:04 +0100

> The newly added switchib driver fails to link if MLXSW_PCI=m:
> 
> drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function^Cmlxsw_sib_module_exit':
> switchib.c:(.exit.text+0x8): undefined reference to `mlxsw_pci_driver_unregister'
> switchib.c:(.exit.text+0x10): undefined reference to `mlxsw_pci_driver_unregister'
> drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function `mlxsw_sib_module_init':
> switchib.c:(.init.text+0x28): undefined reference to `mlxsw_pci_driver_register'
> switchib.c:(.init.text+0x38): undefined reference to `mlxsw_pci_driver_register'
> switchib.c:(.init.text+0x48): undefined reference to `mlxsw_pci_driver_unregister'
> 
> The other two such sub-drivers have a dependency, so add the same one
> here. In theory we could allow this driver if MLXSW_PCI is disabled,
> but it's probably not worth it.
> 
> Fixes: d1ba52638456 ("mlxsw: switchib: Introduce SwitchIB and SwitchIB silicon driver")
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> v2: add Fixes tag

Applied, thanks Arnd.

^ permalink raw reply

* Re: [PATCH net] net/dccp: fix use-after-free in dccp_invalid_packet
From: David Miller @ 2016-11-30  1:38 UTC (permalink / raw)
  To: eric.dumazet
  Cc: andreyknvl, gerrit, dccp, netdev, linux-kernel, dvyukov, kcc,
	edumazet, syzkaller
In-Reply-To: <1480343209.18162.50.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 06:26:49 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> pskb_may_pull() can reallocate skb->head, we need to reload dh pointer
> in dccp_invalid_packet() or risk use after free.
> 
> Bug found by Andrey Konovalov using syzkaller.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: [patch net] sched: cls_flower: remove from hashtable only in case skip sw flag is not set
From: David Miller @ 2016-11-30  1:45 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jhs, idosch, eladr, ogerlitz, hadarh, amir
In-Reply-To: <1480344013-4812-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Mon, 28 Nov 2016 15:40:13 +0100

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Be symmetric to hashtable insert and remove filter from hashtable only
> in case skip sw flag is not set.
> 
> Fixes: e69985c67c33 ("net/sched: cls_flower: Introduce support in SKIP SW flag")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH 1/1] GSO: Reload iph after pskb_may_pull
From: David Miller @ 2016-11-30  1:46 UTC (permalink / raw)
  To: acme; +Cc: eric.dumazet, aduyck, andreyknvl, netdev
In-Reply-To: <20161128153658.GB4778@kernel.org>

From: Arnaldo Carvalho de Melo <acme@kernel.org>
Date: Mon, 28 Nov 2016 12:36:58 -0300

> As it may get stale and lead to use after free.
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Alexander Duyck <aduyck@mirantis.com>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID")
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Applied and queued up for -stable, thanks Arnaldo!

^ permalink raw reply

* Re: [patch net-next 0/4] mlxsw: couple of enhancements and fixes
From: David Miller @ 2016-11-30  1:49 UTC (permalink / raw)
  To: jiri; +Cc: netdev, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz
In-Reply-To: <1480352486-18518-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Mon, 28 Nov 2016 18:01:22 +0100

> Couple of enhancements and fixes from Ido.

Series applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH net-next] hv_netvsc: remove excessive logging on MTU change
From: David Miller @ 2016-11-30  1:50 UTC (permalink / raw)
  To: vkuznets; +Cc: netdev, linux-kernel, kys, haiyangz
In-Reply-To: <20161128172544.2491-1-vkuznets@redhat.com>

From: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Mon, 28 Nov 2016 18:25:44 +0100

> When we change MTU or the number of channels on a netvsc device we get the
> following logged:
> 
>  hv_netvsc bf5edba8...: net device safe to remove
>  hv_netvsc: hv_netvsc channel opened successfully
>  hv_netvsc bf5edba8...: Send section size: 6144, Section count:2560
>  hv_netvsc bf5edba8...: Device MAC 00:15:5d:1e:91:12 link state up
> 
> This information is useful as debug at most.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: [net-next] neigh: remove duplicate check for same neigh
From: David Ahern @ 2016-11-30  1:56 UTC (permalink / raw)
  To: 张胜举, netdev
In-Reply-To: <000201d24aa8$f63c6500$e2b52f00$@cmss.chinamobile.com>

On 11/29/16 6:27 PM, 张胜举 wrote:
> This has less changes, but I preferred to add a new function to do the
> filter logic, this is the same as your code  @rtnl_dump_ifinfo().

I'd prefer it to stay in line as it is now. You can save the cycles of filtering on already viewed entries without an overhaul of this code.

^ permalink raw reply

* Re: [net-next PATCH v3 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: John Fastabend @ 2016-11-30  2:50 UTC (permalink / raw)
  To: Alexei Starovoitov, Michael S. Tsirkin
  Cc: eric.dumazet, daniel, shm, davem, tgraf, john.r.fastabend, netdev,
	bblanco, brouer
In-Reply-To: <20161130003756.GD28238@ast-mbp.thefacebook.com>

On 16-11-29 04:37 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 12:11:33PM -0800, John Fastabend wrote:
>> virtio_net XDP support expects receive buffers to be contiguous.
>> If this is not the case we enable a slowpath to allow connectivity
>> to continue but at a significan performance overhead associated with
>> linearizing data. To make it painfully aware to users that XDP is
>> running in a degraded mode we throw an xdp buffer error.
>>
>> To linearize packets we allocate a page and copy the segments of
>> the data, including the header, into it. After this the page can be
>> handled by XDP code flow as normal.
>>
>> Then depending on the return code the page is either freed or sent
>> to the XDP xmit path. There is no attempt to optimize this path.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ...
>> +/* The conditions to enable XDP should preclude the underlying device from
>> + * sending packets across multiple buffers (num_buf > 1). However per spec
>> + * it does not appear to be illegal to do so but rather just against convention.
>> + * So in order to avoid making a system unresponsive the packets are pushed
>> + * into a page and the XDP program is run. This will be extremely slow and we
>> + * push a warning to the user to fix this as soon as possible. Fixing this may
>> + * require resolving the underlying hardware to determine why multiple buffers
>> + * are being received or simply loading the XDP program in the ingress stack
>> + * after the skb is built because there is no advantage to running it here
>> + * anymore.
>> + */
> ...
>>  		if (num_buf > 1) {
>>  			bpf_warn_invalid_xdp_buffer();
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here is the warn once call. I made it a defined bpf warning so that we
can easily identify if needed and it can be used by other drivers as
well.

>> -			goto err_xdp;
>> +
>> +			/* linearize data for XDP */
>> +			xdp_page = xdp_linearize_page(rq, num_buf,
>> +						      page, offset, &len);
>> +			if (!xdp_page)
>> +				goto err_xdp;
> 
> in case when we're 'lucky' the performance will silently be bad.

Were you specifically worried about the alloc failing here and no
indication? I was thinking a statistic counter could be added as a
follow on series to catch this and other such cases in non XDP paths
if needed.

> Can we do warn_once here? so at least something in dmesg points out
> that performance is not as expected. Am I reading it correctly that
> you had to do a special kernel hack to trigger this situation and
> in all normal cases it's not the case?
> 

Correct the only way to produce this with upstream qemu and Linux is to
write a kernel hack to build these types of buffers. AFAIK I caught all
the cases where it could happen otherwise in the setup xdp prog call and
required user to configure device driver so they can not happen e.g.
disable LRO, set MTU < PAGE_SIZE, etc.

If I missed anything, I don't see it now, I would call it a bug and fix
it. Again AFAIK everything should be covered though. As Micheal pointed
out earlier although unexpected its not strictly against virtio spec to
build such packets so we should handle it at least in a degraded mode
even though it is not expected in any qemu cases.

.John

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
From: John Fastabend @ 2016-11-30  2:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <20161130001504.GA28238@ast-mbp.thefacebook.com>

On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
>> Registers new BPF program types which correspond to the LWT hooks:
>>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
>>
>> The separate program types are required to differentiate between the
>> capabilities each LWT hook allows:
>>
>>  * Programs attached to dst_input() or dst_output() are restricted and
>>    may only read the data of an skb. This prevent modification and
>>    possible invalidation of already validated packet headers on receive
>>    and the construction of illegal headers while the IP headers are
>>    still being assembled.
>>
>>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>>    content as well as prepending an L2 header via a newly introduced
>>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>>    after the IP header has been assembled completely.
>>
>> All BPF programs receive an skb with L3 headers attached and may return
>> one of the following error codes:
>>
>>  BPF_OK - Continue routing as per nexthop
>>  BPF_DROP - Drop skb and return EPERM
>>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>>                 (Only valid in lwtunnel_xmit() context)
>>
>> The return codes are binary compatible with their TC_ACT_
>> relatives to ease compatibility.
>>
>> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ...
>> +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> 

hopefully not too off-topic but for XDP I would like to see this get
passed down with the program. It would be more generic and drivers could
configure the headroom on demand and more importantly verify that a
program pushing data is not going to fail at runtime.

.John

^ permalink raw reply

* Re: The ubufs->refcount maybe be subtracted twice when tun_get_user failed
From: Jason Wang @ 2016-11-30  2:53 UTC (permalink / raw)
  To: wangyunjian, mst@redhat.com, netdev@vger.kernel.org; +Cc: caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A5826@szxeml561-mbx.china.huawei.com>



On 2016年11月29日 21:27, wangyunjian wrote:
> Sorry, I didn't describe it clearly. In fact, the second subtraction happens in the function handle_tx,
> when tun_get_user fails and zcopy_used is ture. Fllowing the steps:

I get your meaning. Thanks for the reporting. Will post patches (since 
macvtap has similar issue).


>
> static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                                    void *msg_control, struct iov_iter *from,
>                                    int noblock) {
>             ...
>
>             if (zerocopy)
>                       err = zerocopy_sg_from_iter(skb, from);
>             else {
>                       err = skb_copy_datagram_from_iter(skb, 0, from, len);
>                       if (!err && msg_control) {
>                                struct ubuf_info *uarg = msg_control;
>                                uarg->callback(uarg, false);                       --> step 1, the ubufs->refcount is subtracted frist.
>                       }
>             }
>
>             if (err) {
>                       this_cpu_inc(tun->pcpu_stats->rx_dropped);
>                       kfree_skb(skb);
>                       return -EFAULT;
>             }
>
>             err = virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun));
>             if (err) {
>                       this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
>                       kfree_skb(skb);
>                       return -EINVAL;                                         -->step 2, return err.
>             }
> }
>
> static void handle_tx(struct vhost_net *net)
> {
> 	...
> 	/* TODO: Check specific error and bomb out unless ENOBUFS? */
> 	err = sock->ops->sendmsg(sock, &msg, len);
> 	if (unlikely(err < 0)) {
> 		if (zcopy_used) {
> 			vhost_net_ubuf_put(ubufs);                                        --> step 3, the ubufs->refcount will be subtracted twice, when sendmsg execution err.
> 			nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
> 				% UIO_MAXIOV;
> 		}
> 		vhost_discard_vq_desc(vq, 1);
> 		break;
> 	}
> 	...
> }
>
> -----Original Message-----

^ permalink raw reply

* [PATCH v4 3/5] net: asix: Fix AX88772x resume failures
From: ASIX_Allan [Office] @ 2016-11-30  3:03 UTC (permalink / raw)
  To: freddy-knRN6Y/kmf1NUHwG+Fw1Kw,
	Dean_Jenkins-nmGgyN9QBj3QT0dZR+AlfA,
	Mark_Craske-nmGgyN9QBj3QT0dZR+AlfA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	robert.foss-ZGY8ohtN/8qB+jHODAdFcQ,
	ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A,
	vpalatin-F7+t8E8rja9g9hUCZPvPmw,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	grundler-F7+t8E8rja9g9hUCZPvPmw,
	changchias-Re5JQEeQqe8AvxtiuMwx3w, allan-knRN6Y/kmf1NUHwG+Fw1Kw,
	andrew-g2DYL2Zd6BY, tremyfr-Re5JQEeQqe8AvxtiuMwx3w,
	colin.king-Z7WLFzj8eWMS+FvcfC7Uqw,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	vpalatin-hpIqsD4AKlfQT0dZR+AlfA

The change fixes AX88772x resume failure by
- Restore incorrect AX88772A PHY registers when resetting
- Need to stop MAC operation when suspending
- Need to restart MII when restoring PHY

Signed-off-by: Allan Chou <allan-knRN6Y/kmf1NUHwG+Fw1Kw@public.gmane.org>
Signed-off-by: Robert Foss <robert.foss-ZGY8ohtN/8qB+jHODAdFcQ@public.gmane.org>
Tested-by: Robert Foss <robert.foss-ZGY8ohtN/8qB+jHODAdFcQ@public.gmane.org>
Tested-by: Jon Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Tested-by: Allan Chou <allan-knRN6Y/kmf1NUHwG+Fw1Kw@public.gmane.org>

---
drivers/net/usb/asix_devices.c | 47
+++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index ebeb730..083dc2e 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -35,6 +35,15 @@
 
 #define	PHY_MODE_RTL8211CL	0x000C
 
+#define AX88772A_PHY14H		0x14
+#define AX88772A_PHY14H_DEFAULT 0x442C
+
+#define AX88772A_PHY15H		0x15
+#define AX88772A_PHY15H_DEFAULT 0x03C8
+
+#define AX88772A_PHY16H		0x16
+#define AX88772A_PHY16H_DEFAULT 0x4044
+
 struct ax88172_int_data {
 	__le16 res1;
 	u8 link;
@@ -424,7 +433,7 @@ static int ax88772a_hw_reset(struct usbnet *dev, int
in_pm)  {
 	struct asix_data *data = (struct asix_data *)&dev->data;
 	int ret, embd_phy;
-	u16 rx_ctl;
+	u16 rx_ctl, phy14h, phy15h, phy16h;
 	u8 chipcode = 0;
 
 	ret = asix_write_gpio(dev, AX_GPIO_RSE, 5, in_pm); @@ -482,6 +491,32
@@ static int ax88772a_hw_reset(struct usbnet *dev, int in_pm)
 				   ret);
 			goto out;
 		}
+	} else if ((chipcode & AX_CHIPCODE_MASK) == AX_AX88772A_CHIPCODE) {
+		/* Check if the PHY registers have default settings */
+		phy14h = asix_mdio_read_nopm(dev->net, dev->mii.phy_id,
+					     AX88772A_PHY14H);
+		phy15h = asix_mdio_read_nopm(dev->net, dev->mii.phy_id,
+					     AX88772A_PHY15H);
+		phy16h = asix_mdio_read_nopm(dev->net, dev->mii.phy_id,
+					     AX88772A_PHY16H);
+
+		netdev_dbg(dev->net,
+			   "772a_hw_reset: MR20=0x%x MR21=0x%x MR22=0x%x\n",
+			   phy14h, phy15h, phy16h);
+
+		/* Restore PHY registers default setting if not */
+		if (phy14h != AX88772A_PHY14H_DEFAULT)
+			asix_mdio_write_nopm(dev->net, dev->mii.phy_id,
+					     AX88772A_PHY14H,
+					     AX88772A_PHY14H_DEFAULT);
+		if (phy15h != AX88772A_PHY15H_DEFAULT)
+			asix_mdio_write_nopm(dev->net, dev->mii.phy_id,
+					     AX88772A_PHY15H,
+					     AX88772A_PHY15H_DEFAULT);
+		if (phy16h != AX88772A_PHY16H_DEFAULT)
+			asix_mdio_write_nopm(dev->net, dev->mii.phy_id,
+					     AX88772A_PHY16H,
+					     AX88772A_PHY16H_DEFAULT);
 	}
 
 	ret = asix_write_cmd(dev, AX_CMD_WRITE_IPG0, @@ -543,6 +578,15 @@
static const struct net_device_ops ax88772_netdev_ops = {  static void
ax88772_suspend(struct usbnet *dev)  {
 	struct asix_common_private *priv = dev->driver_priv;
+	u16 medium;
+
+	/* Stop MAC operation */
+	medium = asix_read_medium_status(dev, 1);
+	medium &= ~AX_MEDIUM_RE;
+	asix_write_medium_mode(dev, medium, 1);
+
+	netdev_dbg(dev->net, "ax88772_suspend: medium=0x%04x\n",
+		   asix_read_medium_status(dev, 1));
 
 	/* Preserve BMCR for restoring */
 	priv->presvd_phy_bmcr =
@@ -577,6 +621,7 @@ static void ax88772_restore_phy(struct usbnet *dev)
 		asix_mdio_write_nopm(dev->net, dev->mii.phy_id, MII_BMCR,
 				     priv->presvd_phy_bmcr);
 
+		mii_nway_restart(&dev->mii);
 		priv->presvd_phy_advertise = 0;
 		priv->presvd_phy_bmcr = 0;
 	}
--
2.7.4


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH iproute2 1/2] devlink: Add usage help for eswitch subcommand
From: Stephen Hemminger @ 2016-11-30  3:19 UTC (permalink / raw)
  To: Roi Dayan; +Cc: netdev, Or Gerlitz
In-Reply-To: <1480245663-814-2-git-send-email-roid@mellanox.com>

On Sun, 27 Nov 2016 13:21:02 +0200
Roi Dayan <roid@mellanox.com> wrote:

> Add missing usage help for devlink dev eswitch subcommand.
> 
> Signed-off-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>

Ok. Applied both, will show in next merge

^ permalink raw reply

* [net-next v2] neigh: remove duplicate check for same neigh
From: Zhang Shengju @ 2016-11-30  3:24 UTC (permalink / raw)
  To: netdev, dsa

Currently loop index 'idx' is used as the index in the neigh list of interest.
It's increased only when the neigh is dumped. It's not the absolute index in
the list. Because there is no info to record which neigh has already be scanned
by previous loop. This will cause the filtered out neighs to be scanned mulitple
times.

This patch make idx as the absolute index in the list, it will increase no matter
whether the neigh is filtered. This will prevent the above problem.

And this is in line with other dump functions.

v2:
 - take David Ahern's advice to do simple change

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
---
 net/core/neighbour.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2ae929f..782dd86 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2291,13 +2291,10 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
 		     n != NULL;
 		     n = rcu_dereference_bh(n->next)) {
-			if (!net_eq(dev_net(n->dev), net))
-				continue;
-			if (neigh_ifindex_filtered(n->dev, filter_idx))
-				continue;
-			if (neigh_master_filtered(n->dev, filter_master_idx))
-				continue;
-			if (idx < s_idx)
+			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
+				goto next;
+			if (neigh_ifindex_filtered(n->dev, filter_idx) ||
+			    neigh_master_filtered(n->dev, filter_master_idx))
 				goto next;
 			if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
 					    cb->nlh->nlmsg_seq,
@@ -2332,9 +2329,7 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		if (h > s_h)
 			s_idx = 0;
 		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
-			if (pneigh_net(n) != net)
-				continue;
-			if (idx < s_idx)
+			if (idx < s_idx || pneigh_net(n) != net)
 				goto next;
 			if (pneigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
 					    cb->nlh->nlmsg_seq,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH iproute2 V2 1/2] tc/cls_flower: Classify packet in ip tunnels
From: Stephen Hemminger @ 2016-11-30  3:26 UTC (permalink / raw)
  To: Amir Vadai
  Cc: netdev, David S. Miller, Jiri Benc, Or Gerlitz, Hadar Har-Zion,
	Roi Dayan
In-Reply-To: <20161128125136.3393-2-amir@vadai.me>

The overall design is fine, just a couple nits with the code.

> diff --git a/tc/f_flower.c b/tc/f_flower.c
> index 2d31d1aa832d..1cf0750b5b83 100644
> --- a/tc/f_flower.c
> +++ b/tc/f_flower.c

>  
> +static int flower_parse_key_id(char *str, int type, struct nlmsghdr *n)

str is not modified, therefore use: const char *str

> +{
> +	int ret;
> +	__be32 key_id;
> +
> +	ret = get_be32(&key_id, str, 10);
> +	if (ret)
> +		return -1;

Traditionally netlink attributes are in host order, why was flower
chosen to be different?

> +
> +	addattr32(n, MAX_MSG, type, key_id);
> +
> +	return 0;


Why lose the return value here?  Why not:

	ret = get_be32(&key_id, str, 10);
	if (!ret)
		addattr32(n, MAX_MSG, type, key_id);

> +}
> +
>  static int flower_parse_opt(struct filter_util *qu, char *handle,
>  			    int argc, char **argv, struct nlmsghdr *n)
>  {
> @@ -339,6 +359,38 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
>  				fprintf(stderr, "Illegal \"src_port\"\n");
>  				return -1;
>  			}
> +		} else if (matches(*argv, "enc_dst_ip") == 0) {
> +			NEXT_ARG();
> +			ret = flower_parse_ip_addr(*argv, 0,
> +						   TCA_FLOWER_KEY_ENC_IPV4_DST,
> +						   TCA_FLOWER_KEY_ENC_IPV4_DST_MASK,
> +						   TCA_FLOWER_KEY_ENC_IPV6_DST,
> +						   TCA_FLOWER_KEY_ENC_IPV6_DST_MASK,
> +						   n);
> +			if (ret < 0) {
> +				fprintf(stderr, "Illegal \"enc_dst_ip\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "enc_src_ip") == 0) {
> +			NEXT_ARG();
> +			ret = flower_parse_ip_addr(*argv, 0,
> +						   TCA_FLOWER_KEY_ENC_IPV4_SRC,
> +						   TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK,
> +						   TCA_FLOWER_KEY_ENC_IPV6_SRC,
> +						   TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK,
> +						   n);
> +			if (ret < 0) {
> +				fprintf(stderr, "Illegal \"enc_src_ip\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "enc_key_id") == 0) {
> +			NEXT_ARG();
> +			ret = flower_parse_key_id(*argv,
> +						  TCA_FLOWER_KEY_ENC_KEY_ID, n);
> +			if (ret < 0) {
> +				fprintf(stderr, "Illegal \"enc_key_id\"\n");
> +				return -1;
> +			}
>  		} else if (matches(*argv, "action") == 0) {
>  			NEXT_ARG();
>  			ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
> @@ -509,6 +561,14 @@ static void flower_print_port(FILE *f, char *name, __u8 ip_proto,
>  	fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
>  }
>  
> +static void flower_print_key_id(FILE *f, char *name,
> +				struct rtattr *attr)

const char *name?


> +{
> +	if (!attr)
> +		return;
> +	fprintf(f, "\n  %s %d", name, ntohl(rta_getattr_u32(attr)));
> +}
> +

Why short circuit, just change the order:

	if (attr)
		fprintf(f, "\n  %s %s", name, ntohl(rta_getattr_u32(attr));

You might also want to introduce rta_getattr_be32()

Please change, retest and resubmit both patches.

^ permalink raw reply

* Re: [PATCH net-next 5/5] udp: add recvmmsg implementation
From: Hannes Frederic Sowa @ 2016-11-30  3:47 UTC (permalink / raw)
  To: David Miller; +Cc: pabeni, netdev, edumazet, brouer, sd
In-Reply-To: <20161129.192233.392359849404330043.davem@davemloft.net>

Hello,

On Wed, Nov 30, 2016, at 01:22, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 25 Nov 2016 18:09:00 +0100
> 
> > During review we discussed on how to handle major errors in the kernel:
> > 
> > The old code and the new code still can report back success even though
> > the kernel got back an EFAULT while copying from kernel space to user
> > space (due to bad pointers).
> > 
> > I favor that we drop all packets (also the already received batches) in
> > this case and let the code report -EFAULT and increase sk_drops for all
> > dropped packets from the queue.
> > 
> > Currently sk_err is set so the next syscall would get an -EFAULT, which
> > seems very bad and can also be overwritten by incoming icmp packets, so
> > we never get a notification that we actually had a bad pointer somewhere
> > in the mmsghdr. Also delivering -EFAULT on the follow-up syscalls really
> > will make people confused that use strace.
> > 
> > If people would like to know the amount of packets dropped we can make
> > sk_drops readable by an getsockopt.
> > 
> > Thoughts?
> > 
> > Unfortunately the interface doesn't allow for better error handling.
> 
> I think this is a major problem.
> 
> If, as a side effect of batch dequeueing the SKBs from the socket,
> you cannot stop properly mid-transfer if an error occurs, well then
> you simply cannot batch like that.
> 
> You have to stop the exact byte where an error occurs mid-stream,
> return the successful amount of bytes transferred, and then return
> the error on the next recvmmsg call.
> 
> There is no other sane error reporting strategy.

Actually I think there is no sane error handling strategy at all.

SIGSEGV and EFAULT should be delivered reliable in my opinion and all
the details become very difficult suddenly.

E.g. if we recvmmsg with -EFAULT and we try to deliver the fault on the
following socket call, I am pretty certain most programs don't bother
with close() return values, so the application might simply ignore it.
Also -EFAULT is not in our repository for error codes to return.

In case of UDP we can simply drop the packets and I would be okay with
that (in some cases we actually guarantee correctly ordered packets,
even for UDP, so we can't simply queue those packets back).

Also I would very much prefer ptrace/gdb to show me the syscall where
the memory management fault happened and not the next one.

> If I get 4 frames, and the kernel can successfully copy the first
> three and get an -EFAULT on the 4th.  Dammit you better tell the
> application this so it can properly process the first 3 packets and
> then determine how it is going to error out and recover for the 4th
> one.
> 
> If we need to add prioritized sk_err stuff, or another value like
> "sk_app_err" to handle the ICMP vs. -EFAULT issue, so be it.
> 
> I know what you guys are thinking, in that you can't figure out a
> way to avoid the transactional overhead if it is necessary to
> "put back" some SKBs if one of them in the batch gets a fault.

I prefer correctness over performance all the time. :)

> That's too bad, we need a proper implementation and proper error
> reporting.  Those performance numbers are useless if we effectively
> lose error notifications.

We have those problems right now and besides deprecating the syscalls I
have no idea how to fix this reliably and would probably need a lot of
changes (besides the sk_app_err solution, which I don't really favor at
all).

The syscall should have been designed in a way that the struct mmsghdr
-> msg_len would be ssize_t, so we could return error codes per fragment
and test before starting the batch that we have proper memory, so we
don't fail in the management code. :(

Bye,
Hannes

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Hannes Frederic Sowa @ 2016-11-30  3:52 UTC (permalink / raw)
  To: David Lebrun, netdev
In-Reply-To: <1480439718-18019-1-git-send-email-david.lebrun@uclouvain.be>

Hi,

On Tue, Nov 29, 2016, at 18:15, David Lebrun wrote:
> When multiple nexthops are available for a given route, the routing
> engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting
> value
> indexes the nexthop to select. This method causes issues when a new
> nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to
> another
> nexthop.
> 
> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate K slices (or intervals) for each
> route with multiple nexthops. The nexthops are randomly assigned to those
> slices, in a uniform manner. The number K is configurable through a
> sysctl
> net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> nexthop, the algorithm takes the flow hash and computes an index which is
> the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> simple binary AND operation (idx = hash & (K - 1)). The resulting index
> references the selected nexthop. The lookup time complexity is thus O(1).
> 
> When a nexthop is added, it steals K/N slices from the other nexthops,
> where N is the new number of nexthops. The slices are stolen randomly and
> uniformly from the other nexthops. When a nexthop is removed, the orphan
> slices are randomly reassigned to the other nexthops.
> 
> The number of slices for a route also fixes the maximum number of
> nexthops
> possible for that route.

In the worst case this causes 2GB (order 19) allocations (x == 31) to
happen in GFP_ATOMIC (due to write lock) context and could cause update
failures to the routing table due to fragmentation. Are you sure the
upper limit of 31 is reasonable? I would very much prefer an upper limit
of below or equal 25 for x to stay within the bounds of the slab
allocators (which is still a lot and probably causes errors!).
Unfortunately because of the nature of the sysctl you can't really
create its own cache for it. :/

Also by design, one day this should all be RCU and having that much data
outstanding worries me a bit during routing table mutation.

I am a fan of consistent hashing but I am not so sure if it belongs into
a generic ECMP implementation or into its own ipvs or netfilter module
where you specifically know how much memory to burn for it.

Also please convert the sysctl to a netlink attribute if you pursue this
because if I change the sysctl while my quagga is hammering the routing
table I would like to know which nodes allocate what amount of memory.

Bye,
Hannes

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Tom Herbert @ 2016-11-30  4:04 UTC (permalink / raw)
  To: David Lebrun; +Cc: Linux Kernel Network Developers
In-Reply-To: <1480439718-18019-1-git-send-email-david.lebrun@uclouvain.be>

On Tue, Nov 29, 2016 at 9:15 AM, David Lebrun <david.lebrun@uclouvain.be> wrote:
> When multiple nexthops are available for a given route, the routing engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting value
> indexes the nexthop to select. This method causes issues when a new nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to another
> nexthop.
>
This is a lot of code to make ECMP work better. Can you be more
specific as to what the "issues" are? Assuming this is just the
transient packet reorder that happens in one link flap I am wondering
if this complexity is justified.

Tom

> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate K slices (or intervals) for each
> route with multiple nexthops. The nexthops are randomly assigned to those
> slices, in a uniform manner. The number K is configurable through a sysctl
> net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> nexthop, the algorithm takes the flow hash and computes an index which is
> the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> simple binary AND operation (idx = hash & (K - 1)). The resulting index
> references the selected nexthop. The lookup time complexity is thus O(1).
>
> When a nexthop is added, it steals K/N slices from the other nexthops,
> where N is the new number of nexthops. The slices are stolen randomly and
> uniformly from the other nexthops. When a nexthop is removed, the orphan
> slices are randomly reassigned to the other nexthops.
>
> The number of slices for a route also fixes the maximum number of nexthops
> possible for that route.
>
> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
> ---
>  include/net/ip6_fib.h    |  25 +++++++
>  include/net/netns/ipv6.h |   1 +
>  net/ipv6/ip6_fib.c       | 174 ++++++++++++++++++++++++++++++++++++++++++++++-
>  net/ipv6/route.c         |  58 ++++++++--------
>  4 files changed, 229 insertions(+), 29 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index a74e2aa..29594cc 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -93,6 +93,30 @@ struct rt6key {
>
>  struct fib6_table;
>
> +/* Multipath nexthop.
> + * @nh is the actual nexthop
> + * @nslices is the number of slices assigned to this nexthop
> + */
> +struct rt6_multi_nh {
> +       struct rt6_info *nh;
> +       unsigned int    nslices;
> +};
> +
> +/* Multipath routes map.
> + * @nhs is an array of available nexthops
> + * @size is the number of elements in @nhs
> + * @nslices is the number of slices and the number of elements in @index
> + *         and is always in the form 2^x to prevent using a division for
> + *         the computation of the modulo.
> + * @index is an array mapping a slice index to a nexthop index in @nhs
> + */
> +struct rt6_multi_map {
> +       struct rt6_multi_nh     *nhs;
> +       unsigned int            size;
> +       unsigned int            nslices;
> +       __u8                    *index;
> +};
> +
>  struct rt6_info {
>         struct dst_entry                dst;
>
> @@ -113,6 +137,7 @@ struct rt6_info {
>          */
>         struct list_head                rt6i_siblings;
>         unsigned int                    rt6i_nsiblings;
> +       struct rt6_multi_map            *rt6i_nh_map;
>
>         atomic_t                        rt6i_ref;
>
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index de7745e..4967846 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -36,6 +36,7 @@ struct netns_sysctl_ipv6 {
>         int idgen_retries;
>         int idgen_delay;
>         int flowlabel_state_ranges;
> +       int ip6_rt_ecmp_slices;
>  };
>
>  struct netns_ipv6 {
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index ef54852..b6b1895 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -727,6 +727,166 @@ static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn,
>         }
>  }
>
> +static void fib6_mp_free(struct rt6_info *rt)
> +{
> +       struct rt6_multi_map *nh_map = rt->rt6i_nh_map;
> +       struct rt6_info *sibling;
> +
> +       if (nh_map) {
> +               list_for_each_entry(sibling, &rt->rt6i_siblings,
> +                                   rt6i_siblings) {
> +                       sibling->rt6i_nh_map = NULL;
> +               }
> +
> +               rt->rt6i_nh_map = NULL;
> +
> +               kfree(nh_map->nhs);
> +               kfree(nh_map->index);
> +               kfree(nh_map);
> +       }
> +}
> +
> +static int fib6_mp_extend(struct net *net, struct rt6_info *sref,
> +                         struct rt6_info *rt)
> +{
> +       struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
> +       struct rt6_multi_nh *tmp_nhs, *old_nhs;
> +       unsigned int kslices;
> +       unsigned int i, j;
> +
> +       if (!nh_map) {
> +               __u8 *index;
> +
> +               nh_map = kmalloc(sizeof(*nh_map), GFP_ATOMIC);
> +               if (!nh_map)
> +                       return -ENOMEM;
> +
> +               nh_map->size = 1;
> +               nh_map->nslices = (1 << net->ipv6.sysctl.ip6_rt_ecmp_slices);
> +
> +               tmp_nhs = kmalloc(sizeof(*tmp_nhs), GFP_ATOMIC);
> +               if (!tmp_nhs) {
> +                       kfree(nh_map);
> +                       return -ENOMEM;
> +               }
> +
> +               tmp_nhs[0].nh = sref;
> +               tmp_nhs[0].nslices = nh_map->nslices;
> +               nh_map->nhs = tmp_nhs;
> +
> +               index = kmalloc_array(nh_map->nslices, sizeof(*index),
> +                                     GFP_ATOMIC);
> +               if (!index) {
> +                       kfree(nh_map->nhs);
> +                       kfree(nh_map);
> +                       return -ENOMEM;
> +               }
> +
> +               for (i = 0; i < nh_map->nslices; i++)
> +                       index[i] = 0;
> +
> +               nh_map->index = index;
> +               sref->rt6i_nh_map = nh_map;
> +       }
> +
> +       if (nh_map->size == nh_map->nslices)
> +               return -ENOBUFS;
> +
> +       nh_map->size++;
> +
> +       tmp_nhs = kmalloc_array(nh_map->size, sizeof(*tmp_nhs), GFP_ATOMIC);
> +       if (!tmp_nhs)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nh_map->size - 1; i++)
> +               tmp_nhs[i] = nh_map->nhs[i];
> +
> +       tmp_nhs[nh_map->size - 1].nh = rt;
> +       tmp_nhs[nh_map->size - 1].nslices = 0;
> +
> +       kslices = nh_map->nslices / nh_map->size;
> +
> +       /* Loop over nexthops and steal a random slice until we have kslices for
> +        * the new nexthop. This algorithm ensures that flows are taken as
> +        * equally as possible from current nexthops.
> +        *
> +        * At each iteration, @j is the index in tmp_nhs of the nexthop
> +        * to steal from and @idx is the index of the slice to steal
> +        * among @j's slices.
> +        */
> +       for (i = 0, j = 0; i < kslices; i++) {
> +               unsigned int idx, k = 0;
> +
> +               if (tmp_nhs[j].nslices == 1)
> +                       continue;
> +
> +               idx = (prandom_u32() % tmp_nhs[j].nslices) + 1;
> +               do {
> +                       if (nh_map->index[k++] == j)
> +                               idx--;
> +               } while (idx);
> +
> +               nh_map->index[k - 1] = nh_map->size - 1;
> +               tmp_nhs[nh_map->size - 1].nslices++;
> +               tmp_nhs[j].nslices--;
> +
> +               j = (j + 1) % (nh_map->size - 1);
> +       }
> +
> +       WARN_ON(tmp_nhs[nh_map->size - 1].nslices != kslices);
> +
> +       old_nhs = nh_map->nhs;
> +       nh_map->nhs = tmp_nhs;
> +       kfree(old_nhs);
> +
> +       rt->rt6i_nh_map = nh_map;
> +
> +       return 0;
> +}
> +
> +static int fib6_mp_shrink(struct rt6_info *sref, struct rt6_info *rt)
> +{
> +       struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
> +       struct rt6_multi_nh *tmp_nhs, *old_nhs;
> +       unsigned int i, j, idx = 0;
> +
> +       tmp_nhs = kmalloc_array(nh_map->size - 1, sizeof(*tmp_nhs), GFP_ATOMIC);
> +       if (!tmp_nhs)
> +               return -ENOMEM;
> +
> +       for (i = 0, j = 0; i < nh_map->size; i++) {
> +               if (nh_map->nhs[i].nh != rt)
> +                       tmp_nhs[j++] = nh_map->nhs[i];
> +               else
> +                       idx = i;
> +       }
> +
> +       nh_map->size--;
> +       old_nhs = nh_map->nhs;
> +       nh_map->nhs = tmp_nhs;
> +       kfree(old_nhs);
> +
> +       rt->rt6i_nh_map = NULL;
> +
> +       /* Update the slices index. For each slice mapping to the removed
> +        * nexthop, assign another random nexthop. For each slice mapping
> +        * to a nexthop that was after the removed nexthop, decrement the
> +        * nexthop index to reflect the shift. Nexthops that were before
> +        * the removed nexthop are unaffected.
> +        */
> +       for (i = 0; i < nh_map->nslices; i++) {
> +               if (nh_map->index[i] == idx) {
> +                       j = prandom_u32() % nh_map->size;
> +                       nh_map->index[i] = j;
> +                       nh_map->nhs[j].nslices++;
> +               } else if (nh_map->index[i] > idx) {
> +                       nh_map->index[i]--;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   *     Insert routing information in a node.
>   */
> @@ -837,6 +997,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                         }
>                         sibling = sibling->dst.rt6_next;
>                 }
> +               err = fib6_mp_extend(info->nl_net, sibling, rt);
> +               if (err < 0)
> +                       return err;
>                 /* For each sibling in the list, increment the counter of
>                  * siblings. BUG() if counters does not match, list of siblings
>                  * is broken!
> @@ -900,6 +1063,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                         fn->fn_flags |= RTN_RTINFO;
>                 }
>                 nsiblings = iter->rt6i_nsiblings;
> +               if (nsiblings)
> +                       fib6_mp_free(iter);
>                 fib6_purge_rt(iter, fn, info->nl_net);
>                 rt6_release(iter);
>
> @@ -1407,13 +1572,20 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
>
>         /* Remove this entry from other siblings */
>         if (rt->rt6i_nsiblings) {
> -               struct rt6_info *sibling, *next_sibling;
> +               struct rt6_info *sibling, *next_sibling, *sref;
> +
> +               sref = list_first_entry(&rt->rt6i_siblings, struct rt6_info,
> +                                       rt6i_siblings);
>
>                 list_for_each_entry_safe(sibling, next_sibling,
>                                          &rt->rt6i_siblings, rt6i_siblings)
>                         sibling->rt6i_nsiblings--;
>                 rt->rt6i_nsiblings = 0;
>                 list_del_init(&rt->rt6i_siblings);
> +               if (!sref->rt6i_nsiblings)
> +                       fib6_mp_free(sref);
> +               else
> +                       fib6_mp_shrink(sref, rt);
>         }
>
>         /* Adjust walkers */
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index b317bb1..e9f13e0 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -427,39 +427,27 @@ static bool rt6_check_expired(const struct rt6_info *rt)
>         return false;
>  }
>
> -/* Multipath route selection:
> - *   Hash based function using packet header and flowlabel.
> - * Adapted from fib_info_hashfn()
> - */
> -static int rt6_info_hash_nhsfn(unsigned int candidate_count,
> -                              const struct flowi6 *fl6)
> -{
> -       return get_hash_from_flowi6(fl6) % candidate_count;
> -}
> -
>  static struct rt6_info *rt6_multipath_select(struct rt6_info *match,
>                                              struct flowi6 *fl6, int oif,
>                                              int strict)
>  {
> -       struct rt6_info *sibling, *next_sibling;
> -       int route_choosen;
> +       struct rt6_multi_map *nh_map = match->rt6i_nh_map;
> +       __u32 hash = get_hash_from_flowi6(fl6);
> +       unsigned int slice, idx;
> +       struct rt6_info *res;
>
> -       route_choosen = rt6_info_hash_nhsfn(match->rt6i_nsiblings + 1, fl6);
> -       /* Don't change the route, if route_choosen == 0
> -        * (siblings does not include ourself)
> -        */
> -       if (route_choosen)
> -               list_for_each_entry_safe(sibling, next_sibling,
> -                               &match->rt6i_siblings, rt6i_siblings) {
> -                       route_choosen--;
> -                       if (route_choosen == 0) {
> -                               if (rt6_score_route(sibling, oif, strict) < 0)
> -                                       break;
> -                               match = sibling;
> -                               break;
> -                       }
> -               }
> -       return match;
> +       WARN_ON(!nh_map);
> +       if (!nh_map)
> +               return match;
> +
> +       slice = hash & (nh_map->nslices - 1);
> +       idx = nh_map->index[slice];
> +       res = nh_map->nhs[idx].nh;
> +
> +       if (rt6_score_route(res, oif, strict) < 0)
> +               res = match;
> +
> +       return res;
>  }
>
>  /*
> @@ -3550,6 +3538,9 @@ int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
>         return 0;
>  }
>
> +static int one = 1;
> +static int thirtyone = 31;
> +
>  struct ctl_table ipv6_route_table_template[] = {
>         {
>                 .procname       =       "flush",
> @@ -3621,6 +3612,15 @@ struct ctl_table ipv6_route_table_template[] = {
>                 .mode           =       0644,
>                 .proc_handler   =       proc_dointvec_ms_jiffies,
>         },
> +       {
> +               .procname       =       "ecmp_slices",
> +               .data           =       &init_net.ipv6.sysctl.ip6_rt_ecmp_slices,
> +               .maxlen         =       sizeof(int),
> +               .mode           =       0644,
> +               .proc_handler   =       proc_dointvec_minmax,
> +               .extra1         =       &one,
> +               .extra2         =       &thirtyone,
> +       },
>         { }
>  };
>
> @@ -3644,6 +3644,7 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
>                 table[7].data = &net->ipv6.sysctl.ip6_rt_mtu_expires;
>                 table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
>                 table[9].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
> +               table[10].data = &net->ipv6.sysctl.ip6_rt_ecmp_slices;
>
>                 /* Don't export sysctls to unprivileged users */
>                 if (net->user_ns != &init_user_ns)
> @@ -3707,6 +3708,7 @@ static int __net_init ip6_route_net_init(struct net *net)
>         net->ipv6.sysctl.ip6_rt_gc_elasticity = 9;
>         net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
>         net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
> +       net->ipv6.sysctl.ip6_rt_ecmp_slices = 5;
>
>         net->ipv6.ip6_rt_gc_expire = 30*HZ;
>
> --
> 2.7.3
>

^ permalink raw reply

* Re: [PATCH net 00/16] net: fix fixed-link phydev leaks
From: David Miller @ 2016-11-30  4:17 UTC (permalink / raw)
  To: johan
  Cc: vbridger, f.fainelli, fugang.duan, pantelis.antoniou, vbordug,
	claudiu.manoil, leoli, thomas.petazzoni, nbd, blogic,
	matthias.bgg, sergei.shtylyov, lars.persson, mugunthanvnm,
	grygorii.strashko, robh+dt, frowand.list, andrew, vivien.didelot,
	netdev, nios2-dev, linux-kernel, linuxppc-dev, linux-mediatek,
	linux-renesas-soc, linux-omap, devicetree
In-Reply-To: <1480357509-28074-1-git-send-email-johan@kernel.org>

From: Johan Hovold <johan@kernel.org>
Date: Mon, 28 Nov 2016 19:24:53 +0100

> This series fixes failures to deregister and free fixed-link phydevs
> that have been registered using the of_phy_register_fixed_link()
> interface.
> 
> All but two drivers currently fail to do this and this series fixes most
> of them with the exception of a staging driver and the stmmac drivers
> which will be fixed by follow-on patches.
> 
> Included are also a couple of fixes for related of-node leaks.
> 
> Note that all patches except the of_mdio one have been compile-tested
> only.
> 
> Also note that the series is against net due to dependencies not yet in
> net-next.

Series applied, thanks Johan.

^ permalink raw reply

* Re: [PATCH] net: macb: Write only necessary bits in NCR in macb reset
From: Harini Katakam @ 2016-11-30  4:17 UTC (permalink / raw)
  To: David Miller
  Cc: Harini Katakam, Nicolas Ferre, netdev,
	linux-kernel@vger.kernel.org, michals@xilinx.com
In-Reply-To: <20161129.190526.1414678491662084583.davem@davemloft.net>

Hi David,

On Wed, Nov 30, 2016 at 5:35 AM, David Miller <davem@davemloft.net> wrote:
> From: Harini Katakam <harini.katakam@xilinx.com>
> Date: Mon, 28 Nov 2016 14:53:49 +0530
>
>> In macb_reset_hw, use read-modify-write to disable RX and TX.
>> This way exiting settings and reserved bits wont be disturbed.
>> Use the same method for clearing statistics as well.
>>
>> Signed-off-by: Harini Katakam <harinik@xilinx.com>
>
> This doesn't make much sense to me.
>
> Consider the two callers of this function.
>
> macb_init_hw() is going to do a non-masking write to the NCR
> register:
>
>         /* Enable TX and RX */
>         macb_writel(bp, NCR, MACB_BIT(RE) | MACB_BIT(TE) | MACB_BIT(MPE));
>
> So obviously no other writable fields matter at all for programming
> the chip properly, otherwise macb_init_hw() would "or" in the bits
> after a read of NCR.  But that's not what this code does, it
> writes "RE | TE | MPE" directly.
>
> And the other caller is macb_close() which is shutting down the
> chip so can zero out all the other bits and it can't possibly
> matter, also due to the assertion above about macb_init_hw()
> showing that only the RE, TE, and MPE bits matter for proper
> functioning of the chip.
>
> You haven't shown a issue caused by the way the code works now, so
> this patch isn't fixing a bug.  In fact, the "bit preserving" would
> even be misleading to someone reading the code.  They will ask
> themselves what bits need to be preserved, and as shown above none of
> them need to be.
>
> I'm not applying this, sorry.
>

Thanks for the detailed analysis.
I noticed an issue with respect to management port enable bit
when working on the patch series for macb mdio driver for
multi MAC - multi PHY access via same mdio bus.
MPE bit of the MAC (whose phy management register
is used) was being reset. But that is a separate
series still in review.
You are right, there is no existing bug that this
patch addresses. I will come back to this later if
required.

Regards,
Harini

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox