Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [bpf PATCH v4 3/4] bpf: sockhash fix omitted bucket lock in sock_close
From: Martin KaFai Lau @ 2018-06-25 16:40 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180625153417.3057.19275.stgit@john-Precision-Tower-5810>

On Mon, Jun 25, 2018 at 08:34:17AM -0700, John Fastabend wrote:
> First in tcp_close, reduce scope of sk_callback_lock() the lock is
> only needed for protecting maps list the ingress and cork
> lists are protected by sock lock. Having the lock in wider scope is
> harmless but may confuse the reader who may infer it is in fact
> needed.
> 
> Next, in sock_hash_delete_elem() the pattern is as follows,
> 
>   sock_hash_delete_elem()
>      [...]
>      spin_lock(bucket_lock)
>      l = lookup_elem_raw()
>      if (l)
>         hlist_del_rcu()
>         write_lock(sk_callback_lock)
>          .... destroy psock ...
>         write_unlock(sk_callback_lock)
>      spin_unlock(bucket_lock)
> 
> The ordering is necessary because we only know the {p}sock after
> dereferencing the hash table which we can't do unless we have the
> bucket lock held. Once we have the bucket lock and the psock element
> it is deleted from the hashmap to ensure any other path doing a lookup
> will fail. Finally, the refcnt is decremented and if zero the psock
> is destroyed.
> 
> In parallel with the above (or free'ing the map) a tcp close event
> may trigger tcp_close(). Which at the moment omits the bucket lock
> altogether (oops!) where the flow looks like this,
> 
>   bpf_tcp_close()
>      [...]
>      write_lock(sk_callback_lock)
>      for each psock->maps // list of maps this sock is part of
>          hlist_del_rcu(ref_hash_node);
>          .... destroy psock ...
>      write_unlock(sk_callback_lock)
> 
> Obviously, and demonstrated by syzbot, this is broken because
> we can have multiple threads deleting entries via hlist_del_rcu().
> 
> To fix this we might be tempted to wrap the hlist operation in a
> bucket lock but that would create a lock inversion problem. In
> summary to follow locking rules the psocks maps list needs the
> sk_callback_lock but we need the bucket lock to do the hlist_del_rcu.
> To resolve the lock inversion problem pop the head of the maps list
> repeatedly and remove the reference until no more are left. If a
> delete happens in parallel from the BPF API that is OK as well because
> it will do a similar action, lookup the lock in the map/hash, delete
> it from the map/hash, and dec the refcnt. We check for this case
> before doing a destroy on the psock to ensure we don't have two
> threads tearing down a psock. The new logic is as follows,
> 
>   bpf_tcp_close()
>   e = psock_map_pop(psock->maps) // done with sk_callback_lock
>   bucket_lock() // lock hash list bucket
>   l = lookup_elem_raw(head, hash, key, key_size);
>   if (l) {
>      //only get here if elmnt was not already removed
>      hlist_del_rcu()
>      ... destroy psock...
>   }
>   bucket_unlock()
> 
> And finally for all the above to work add missing sk_callback_lock
> around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
> delete and update may corrupt maps list. Then add RCU annotations and
> use rcu_dereference/rcu_assign_pointer to manage values relying on
> RCU so that the object is not free'd from sock_hash_free() while it
> is being referenced in bpf_tcp_close().
> 
> (As an aside the sk_callback_lock serves two purposes. The
>  first, is to update the sock callbacks sk_data_ready, sk_write_space,
>  etc. The second is to protect the psock 'maps' list. The 'maps' list
>  is used to (as shown above) to delete all map/hash references to a
>  sock when the sock is closed)
> 
> Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply

* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Peter Robinson @ 2018-06-25 16:41 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Eric Dumazet, netdev, linux-arm-kernel, labbott
In-Reply-To: <CALeDE9PBZWJBp8KB0mB4zoNXqscmzxWzz+LnuqRA-z4t1e9T8g@mail.gmail.com>

On Mon, Jun 25, 2018 at 2:39 PM, Peter Robinson <pbrobinson@gmail.com> wrote:
> Hi Daniel,
>
>> On 06/24/2018 11:24 AM, Peter Robinson wrote:
>>>>> I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
>>>>> a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
>>>>> (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
>>>>> others, both LPAE/normal kernels.
>>
>> So this is arm32 right?
>
> Correct.
>
>>>>> I'm a bit out of my depth in this part of the kernel but I'm wondering
>>>>> if it's known, I couldn't find anything that looked obvious on a few
>>>>> mailing lists.
>>>>>
>>>>> Peter
>>>>
>>>> Hi Peter
>>>>
>>>> Could you provide symbolic information ?
>>>
>>> I passed in through scripts/decode_stacktrace.sh is that what you were after:
>>>
>>> [    8.673880] Internal error: Oops: a06 [#10] SMP ARM
>>> [    8.673949] ---[ end trace 049df4786ea3140a ]---
>>> [    8.678754] Modules linked in:
>>> [    8.678766] CPU: 1 PID: 206 Comm: systemd-udevd Tainted: G      D
>>>         4.18.0-0.rc1.git0.1.fc29.armv7hl+lpae #1
>>> [    8.678769] Hardware name: Allwinner sun8i Family
>>> [    8.678781] PC is at sk_filter_trim_cap ()
>>> [    8.678790] LR is at   (null)
>>> [    8.709463] pc : lr : psr: 60000013 ()
>>> [    8.715722] sp : c996bd60  ip : 00000000  fp : 00000000
>>> [    8.720939] r10: ee79dc00  r9 : c12c9f80  r8 : 00000000
>>> [    8.726157] r7 : 00000000  r6 : 00000001  r5 : f1648000  r4 : 00000000
>>> [    8.732674] r3 : 00000007  r2 : 00000000  r1 : 00000000  r0 : 00000000
>>> [    8.739193] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>>> [    8.746318] Control: 30c5387d  Table: 6e7bc880  DAC: ffe75ece
>>> [    8.752055] Process systemd-udevd (pid: 206, stack limit = 0x(ptrval))
>>> [    8.758574] Stack: (0xc996bd60 to 0xc996c000)
>>
>> Do you have BPF JIT enabled or disabled? Does it happen with disabled?
>
> Enabled, I can test with it disabled, BPF configs bits are:
> CONFIG_BPF_EVENTS=y
> # CONFIG_BPFILTER is not set
> CONFIG_BPF_JIT_ALWAYS_ON=y
> CONFIG_BPF_JIT=y
> CONFIG_BPF_STREAM_PARSER=y
> CONFIG_BPF_SYSCALL=y
> CONFIG_BPF=y
> CONFIG_CGROUP_BPF=y
> CONFIG_HAVE_EBPF_JIT=y
> CONFIG_IPV6_SEG6_BPF=y
> CONFIG_LWTUNNEL_BPF=y
> # CONFIG_NBPFAXI_DMA is not set
> CONFIG_NET_ACT_BPF=m
> CONFIG_NET_CLS_BPF=m
> CONFIG_NETFILTER_XT_MATCH_BPF=m
> # CONFIG_TEST_BPF is not set
>
>> I can see one bug, but your stack trace seems unrelated.
>>
>> Anyway, could you try with this?
>
> Build in process.
>
>> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
>> index 6e8b716..f6a62ae 100644
>> --- a/arch/arm/net/bpf_jit_32.c
>> +++ b/arch/arm/net/bpf_jit_32.c
>> @@ -1844,7 +1844,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>                 /* there are 2 passes here */
>>                 bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>>
>> -       set_memory_ro((unsigned long)header, header->pages);
>> +       bpf_jit_binary_lock_ro(header);
>>         prog->bpf_func = (void *)ctx.target;
>>         prog->jited = 1;
>>         prog->jited_len = image_size;

So with that and the other fix there was no improvement, with those
and the BPF JIT disabled it works, I'm not sure if the two patches
have any effect with the JIT disabled though.

Will look at the other patches shortly, there's been some other issue
introduced between rc1 and rc2 which I have to work out before I can
test those though.

Peter

^ permalink raw reply

* Re: [PATCH] selftests: bpf: enable NET_SCHED
From: Song Liu @ 2018-06-25 16:49 UTC (permalink / raw)
  To: Anders Roxell
  Cc: Alexei Starovoitov, Daniel Borkmann, shuah, Networking, open list,
	linux-kselftest
In-Reply-To: <20180625145605.13726-1-anders.roxell@linaro.org>

On Mon, Jun 25, 2018 at 7:56 AM, Anders Roxell <anders.roxell@linaro.org> wrote:
> CONFIG_NET_SCHED wasn't enabled in arm64's defconfig only for x86.
> So bpf/test_tunnel.sh tests fails with:
> RTNETLINK answers: Operation not supported
> RTNETLINK answers: Operation not supported
> We have an error talking to the kernel, -1
> Enable NET_SCHED and more tests passes.
>
> Fixes: 3bce593ac06b ("selftests: bpf: config: add config fragments")
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/testing/selftests/bpf/config | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index 1e0c547caf3c..7a6d92562dc6 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -6,6 +6,7 @@ CONFIG_TEST_BPF=m
>  CONFIG_CGROUP_BPF=y
>  CONFIG_NETDEVSIM=m
>  CONFIG_NET_CLS_ACT=y
> +CONFIG_NET_SCHED=y
>  CONFIG_NET_SCH_INGRESS=y
>  CONFIG_NET_IPIP=y
>  CONFIG_IPV6=y
> --
> 2.18.0
>

^ permalink raw reply

* Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
From: Sowmini Varadhan @ 2018-06-25 17:03 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: netdev, santosh.shilimkar, davem, rds-devel
In-Reply-To: <7f4f460079d3d78a18f7d759488048798e99c4db.1529922794.git.ka-cheong.poon@oracle.com>

On (06/25/18 03:38), Ka-Cheong Poon wrote:
> @@ -1105,8 +1105,27 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
>  			break;
>  
>  		case sizeof(*sin6): {
> -			ret = -EPROTONOSUPPORT;
> -			goto out;
> +			int addr_type;
                         :
                         :
> +			daddr = sin6->sin6_addr;
> +			dport = sin6->sin6_port;
> +			scope_id = sin6->sin6_scope_id;
> +			break;
>  		}

In rds_sendmsg, the scopeid passed to rds_conn_create_outgoing
may come from the msg_name (if msg_name is a link-local) or
may come from the rs_bound_scope_id (for connected socket, change
made in Patch 1 of the series). 

This sounds inconsistent.

If I bind to scopeid if1 and then send to fe80::1%if2 (without connect()), 
we'd create an rds_connection with dev_if set to if2. 
(first off, its a bit unexpected to be sending to fe80::1%if2 when you
are bound to a link-local on if1!)

But then, if we got back a response from fe80::1%if2, I think we would
not find a matching conn in rds_recv_incoming? 

And this is even more confusing because the fastpath in rds_sendmsg
does not take the bound_scope_id into consideration at all:
1213         if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr))
1214                 conn = rs->rs_conn;
1215         else {
1216                 conn = rds_conn_create_outgoing( /* .. */, scope_id)
so if I erroneously passed a msg_name on a connected rds socket, what
would happen? (see also question about rds_connect() itself, below)

Should we always use rs_bound_scope_id for creating the outgoing
rds_connection? (you may need something deterministic for this, 
like "if bound addr is linklocal, return error if daddr has a different
scopeid, else use the bound addr's scopeid", plus, "if bound addr is
not global, and daddr is link-local, we need a conn with the daddr's
scopeid")

Also, why is there no IPv6 support in rds_connect? 

(still looking through the rds-tcp changes, but wanted to get these
questions clarified first).

--Sowmini

^ permalink raw reply

* Re: [PATCH rdma-next 08/12] overflow.h: Add arithmetic shift helper
From: Jason Gunthorpe @ 2018-06-25 17:11 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Leon Romanovsky, Doug Ledford, Kees Cook, Leon Romanovsky,
	RDMA mailing list, Hadar Hen Zion, Matan Barak, Michael J Ruhl,
	Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
	linux-netdev, linux-kernel
In-Reply-To: <CAKwiHFhgsyWYD+q+JFb2HJEphnjiiOp=o4Airv3MW031q2jx8w@mail.gmail.com>

On Mon, Jun 25, 2018 at 11:26:05AM +0200, Rasmus Villemoes wrote:

>    check_shift_overflow(a, s, d) {
>        unsigned _nbits = 8*sizeof(a);
>        typeof(a) _a = (a);
>        typeof(s) _s = (s);
>        typeof(d) _d = (d);
> 
>        *_d = ((u64)(_a) << (_s & (_nbits-1)));
>        _s >= _nbits || (_s > 0 && (_a >> (_nbits - _s -
>    is_signed_type(a))) != 0);
>    }

Those types are not quite right.. What about this?

    check_shift_overflow(a, s, d) ({
        unsigned int _nbits = 8*sizeof(d) - is_signed_type(d);
        typeof(d) _a = a;  // Shift is always performed on type 'd'
        typeof(s) _s = s;
        typeof(d) _d = d;
 
        *_d = (_a << (_s & (_nbits-1)));

	(((*_d) >> (_s & (_nbits-1)) != _a);
    })

And can we use mathamatcial invertability to prove no overlow and
bound _a ? As above.

Jason

^ permalink raw reply

* Re: [PATCH net-next] tcp: add SNMP counter for zero-window drops
From: Eric Dumazet @ 2018-06-25 17:26 UTC (permalink / raw)
  To: Yafang Shao, davem; +Cc: netdev
In-Reply-To: <1529848974-6384-1-git-send-email-laoar.shao@gmail.com>



On 06/24/2018 07:02 AM, Yafang Shao wrote:
> It will be helpful if we could display the drops due to zero window or no
> enough window space.
> So a new SNMP MIB entry is added to track this behavior.
> This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in
> /proc/net/netstat in TcpExt line as TCPZeroWindowDrop.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---

Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [PATCH net] KEYS: DNS: fix parsing multiple options
From: Eric Biggers @ 2018-06-25 17:37 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, David Howells, keyrings, Wang Lei, Eric Biggers
In-Reply-To: <8195.1528992870@warthog.procyon.org.uk>

On Thu, Jun 14, 2018 at 05:14:30PM +0100, David Howells wrote:
> The fix seems to work, but the use of kstrtoul():
> 
> 	ret = kstrtoul(eq, 10, &derrno);
> 
> is incorrect since the buffer can't been modified to block out the next
> argument if there is one, so the following fails:
> 
> 	perl -e 'print "#dnserror=1#", "\x00" x 1' |
> 	keyctl padd dns_resolver desc @s
> 
> (Note this is preexisting and nothing to do with your patch).
> 
> I'm not sure how best to handle this.
> 
> Anyway, Dave, can you take Eric's patch into the net tree with:
> 
> 	Acked-by: David Howells <dhowells@redhat.com>
> 
> David

It could be handled by copying the option value to a temporary buffer.
Anyway, that can be a separate fix...

David (Miller), are you planning to take this through -net?

Thanks!

- Eric

^ permalink raw reply

* Re: [PATCH 00/14] ARM: davinci: step towards removing at24_platform_data
From: Andrew Lunn @ 2018-06-25 17:40 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Rob Herring, Grygorii Strashko, David Lechner, Ivan Khoronzhuk,
	Kevin Hilman, Greg Kroah-Hartman, Sekhar Nori, Russell King,
	linux-kernel, Bartosz Golaszewski, Lukas Wunner,
	Srinivas Kandagatla, linux-arm-kernel, netdev, Florian Fainelli,
	linux-omap, David S . Miller, Dan Carpenter
In-Reply-To: <20180625155025.12567-1-brgl@bgdev.pl>

On Mon, Jun 25, 2018 at 05:50:11PM +0200, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> 
> Since I took over maintainership of the at24 driver I've been working
> towards removing at24_platform_data in favor for device properties.
> 
> DaVinci is the only platform that's still using it - all other users
> have already been converted.
> 
> One of the obstacles in case of DaVinci is removing the setup() callback
> from the pdata struct, the only user of which are some davinci boards.

Hi Bartosz

Nice code.

I've got a platform i want to add sometime soon using the at24. I
noticed you doing the cleanup, so i avoided the setup() call. But
using it would of helped...

My platform is x86 based, so no device tree. I instantiate a number of
AT24 devices from a platform driver, and then add nvmem cells so i can
access data in these eeproms. This new code will make this simpler.

> The only board that's still using this callback is now mityomapl138.
> Unfortunately it stores more info in EEPROM than just the MAC address
> and will require some more work. Unfortunately I don't have access
> to this board so I can't test any actual solutions on a live hardware.

Depending on what i find in the EEPROM, i need to instantiate other
i2c devices. So i have the problem of knowing when the EEPROM has
actually probed and i can use the nvmem API to retrieve the contents.

What i have done so far, is registers a bus notifier on i2c_bus_type,
and look for BUS_NOTIFY_BOUND_DRIVER. I can then check if the i2c
client in the notifier is the at24 client. But when i then add more
i2c clients from inside the notifier i get lockdep splats. They are
false positives, but it does suggest it is not a good idea to do this.

So it would be good to have some sort of recommended alternative to
the setup() callback. Ideally it would be specific to a particular
at24, and safe to call other i2c functions from.

Do you have any ideas?

   Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
From: Ka-Cheong Poon @ 2018-06-25 17:43 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: netdev, santosh.shilimkar, davem, rds-devel
In-Reply-To: <20180625170317.GA28578@oracle.com>

On 06/26/2018 01:03 AM, Sowmini Varadhan wrote:
> On (06/25/18 03:38), Ka-Cheong Poon wrote:
>> @@ -1105,8 +1105,27 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
>>   			break;
>>   
>>   		case sizeof(*sin6): {
>> -			ret = -EPROTONOSUPPORT;
>> -			goto out;
>> +			int addr_type;
>                           :
>                           :
>> +			daddr = sin6->sin6_addr;
>> +			dport = sin6->sin6_port;
>> +			scope_id = sin6->sin6_scope_id;
>> +			break;
>>   		}
> 
> In rds_sendmsg, the scopeid passed to rds_conn_create_outgoing
> may come from the msg_name (if msg_name is a link-local) or
> may come from the rs_bound_scope_id (for connected socket, change
> made in Patch 1 of the series).
> 
> This sounds inconsistent.
> 
> If I bind to scopeid if1 and then send to fe80::1%if2 (without connect()),
> we'd create an rds_connection with dev_if set to if2.
> (first off, its a bit unexpected to be sending to fe80::1%if2 when you
> are bound to a link-local on if1!)
> 
> But then, if we got back a response from fe80::1%if2, I think we would
> not find a matching conn in rds_recv_incoming?


Yes, I think if the socket is bound, it should check the scope_id
in msg_name (if not NULL) to make sure that they match.  A bound
RDS socket can send to multiple peers.  But if the bound local
address is link local, it should only be allowed to send to peers
on the same link.


> And this is even more confusing because the fastpath in rds_sendmsg
> does not take the bound_scope_id into consideration at all:
> 1213         if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr))
> 1214                 conn = rs->rs_conn;
> 1215         else {
> 1216                 conn = rds_conn_create_outgoing( /* .. */, scope_id)
> so if I erroneously passed a msg_name on a connected rds socket, what
> would happen? (see also question about rds_connect() itself, below)


The check added above takes care of this.  The scope_id should
match.


> Should we always use rs_bound_scope_id for creating the outgoing
> rds_connection? (you may need something deterministic for this,
> like "if bound addr is linklocal, return error if daddr has a different
> scopeid, else use the bound addr's scopeid", plus, "if bound addr is
> not global, and daddr is link-local, we need a conn with the daddr's
> scopeid")


If a socket is bound, I guess the scope_id should be used.  So
if a socket is not bound to a link local address and the socket
is used to sent to a link local peer, it should fail.


> Also, why is there no IPv6 support in rds_connect?


Oops, I missed this when I ported the internal version to the
net-next version.  Will add it back.



-- 
K. Poon
ka-cheong.poon@oracle.com

^ permalink raw reply

* Re: [PATCH 00/14] ARM: davinci: step towards removing at24_platform_data
From: Bartosz Golaszewski @ 2018-06-25 17:46 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Sekhar Nori, Kevin Hilman, Russell King, Grygorii Strashko,
	David S . Miller, Srinivas Kandagatla, Lukas Wunner, Rob Herring,
	Florian Fainelli, Dan Carpenter, Ivan Khoronzhuk, David Lechner,
	Greg Kroah-Hartman, Linux ARM, Linux Kernel Mailing List,
	linux-omap, netdev, Bartosz Golaszewski
In-Reply-To: <20180625174024.GB17417@lunn.ch>

2018-06-25 19:40 GMT+02:00 Andrew Lunn <andrew@lunn.ch>:
> On Mon, Jun 25, 2018 at 05:50:11PM +0200, Bartosz Golaszewski wrote:
>> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
>>
>> Since I took over maintainership of the at24 driver I've been working
>> towards removing at24_platform_data in favor for device properties.
>>
>> DaVinci is the only platform that's still using it - all other users
>> have already been converted.
>>
>> One of the obstacles in case of DaVinci is removing the setup() callback
>> from the pdata struct, the only user of which are some davinci boards.
>
> Hi Bartosz
>
> Nice code.
>
> I've got a platform i want to add sometime soon using the at24. I
> noticed you doing the cleanup, so i avoided the setup() call. But
> using it would of helped...
>
> My platform is x86 based, so no device tree. I instantiate a number of
> AT24 devices from a platform driver, and then add nvmem cells so i can
> access data in these eeproms. This new code will make this simpler.
>
>> The only board that's still using this callback is now mityomapl138.
>> Unfortunately it stores more info in EEPROM than just the MAC address
>> and will require some more work. Unfortunately I don't have access
>> to this board so I can't test any actual solutions on a live hardware.
>
> Depending on what i find in the EEPROM, i need to instantiate other
> i2c devices. So i have the problem of knowing when the EEPROM has
> actually probed and i can use the nvmem API to retrieve the contents.
>
> What i have done so far, is registers a bus notifier on i2c_bus_type,
> and look for BUS_NOTIFY_BOUND_DRIVER. I can then check if the i2c
> client in the notifier is the at24 client. But when i then add more
> i2c clients from inside the notifier i get lockdep splats. They are
> false positives, but it does suggest it is not a good idea to do this.
>
> So it would be good to have some sort of recommended alternative to
> the setup() callback. Ideally it would be specific to a particular
> at24, and safe to call other i2c functions from.
>
> Do you have any ideas?
>
>    Andrew

With my patch 1/14 you'll get -EPROBE_DEFER from nvmem_cell_get() if
the nvmem provider is not yet registered. Will that help in your case?

Bart

^ permalink raw reply

* Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
From: Sowmini Varadhan @ 2018-06-25 17:50 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: netdev, santosh.shilimkar, davem, rds-devel
In-Reply-To: <25e1afda-7497-7f08-815a-286cf775bc09@oracle.com>

On (06/26/18 01:43), Ka-Cheong Poon wrote:
> 
> Yes, I think if the socket is bound, it should check the scope_id
> in msg_name (if not NULL) to make sure that they match.  A bound
> RDS socket can send to multiple peers.  But if the bound local
> address is link local, it should only be allowed to send to peers
> on the same link.

agree.


> If a socket is bound, I guess the scope_id should be used.  So
> if a socket is not bound to a link local address and the socket
> is used to sent to a link local peer, it should fail.

PF_RDS sockets *MUST* alwasy be bound.  See
Documentation/networking/rds.txt:
"   Sockets must be bound before you can send or receive data.
    This is needed because binding also selects a transport and
    attaches it to the socket. Once bound, the transport assignment
    does not change."

Also, rds_sendmsg checks this (from net-next, your version
has the equivalent ipv6_addr_any etc check):

        if (daddr == 0 || rs->rs_bound_addr == 0) {
                release_sock(sk);
                ret = -ENOTCONN; /* XXX not a great errno */
                goto out;
        }

> 
> >Also, why is there no IPv6 support in rds_connect?
> 
> 
> Oops, I missed this when I ported the internal version to the
> net-next version.  Will add it back.

Ok

--Sowmini

^ permalink raw reply

* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-25 17:54 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Cornelia Huck, Alexander Duyck, virtio-dev, aaron.f.brown,
	Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel, virtualization,
	konrad.wilk, boris.ostrovsky, Joao Martins, Venu Busireddy,
	vijay.balakrishna
In-Reply-To: <CADGSJ22DX8=rNPY+gNS_q=LCYLR5ieoE7oD8p4+8AnpzQsWSCg@mail.gmail.com>

On 6/22/2018 5:17 PM, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>>>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>>>>>> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>>>>> On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>>>>>>>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>>>>>> On Wed, 20 Jun 2018 22:48:58 +0300
>>>>>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>>>>>>>>>>> In any case, I'm not sure anymore why we'd want the extra uuid.
>>>>>>>>>>> It's mostly so we can have e.g. multiple devices with same MAC
>>>>>>>>>>> (which some people seem to want in order to then use
>>>>>>>>>>> then with different containers).
>>>>>>>>>>>
>>>>>>>>>>> But it is also handy for when you assign a PF, since then you
>>>>>>>>>>> can't set the MAC.
>>>>>>>>>>>
>>>>>>>>>> OK, so what about the following:
>>>>>>>>>>
>>>>>>>>>> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>>>>>>>>>    that we have a new uuid field in the virtio-net config space
>>>>>>>>>> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>>>>>>>>>    offer VIRTIO_NET_F_STANDBY_UUID if set
>>>>>>>>>> - when configuring, set the property to the group UUID of the vfio-pci
>>>>>>>>>>    device
>>>>>>>>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>>>>>>>> to still expose UUID in the config space on virtio-pci?
>>>>>>>>
>>>>>>>> Yes but guest is not supposed to read it.
>>>>>>>>
>>>>>>>>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>>>>>>>> where the corresponding vfio-pci device attached to for a guest which
>>>>>>>>> doesn't support the feature (legacy).
>>>>>>>>>
>>>>>>>>> -Siwei
>>>>>>>> Yes but you won't add the primary behind such a bridge.
>>>>>>> I assume the UUID feature is a new one besides the exiting
>>>>>>> VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>>>>>>> if UUID feature is present and supported by guest, we'll do pairing
>>>>>>> based on UUID; if UUID feature present
>>>>>>                                   ^^^^^^^  is NOT present
>>>>> Let's go back for a bit.
>>>>>
>>>>> What is wrong with matching by mac?
>>>>>
>>>>> 1. Does not allow multiple NICs with same mac
>>>>> 2. Requires MAC to be programmed by host in the PT device
>>>>>     (which is often possible with VFs but isn't possible with all PT
>>>>>     devices)
>>>> Might not neccessarily be something wrong, but it's very limited to
>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>> I think Sridhar and Jiri might be better person to answer it. My
>> impression was that sync'ing the MAC address change between all 3
>> devices is challenging, as the failover driver uses MAC address to
>> match net_device internally.

Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
don't allow changing guest MAC unless it is a trusted VF.

^ permalink raw reply

* [PATCH 1/4] net: lan78xx: Allow for VLAN headers in timeout calcs
From: Dave Stevenson @ 2018-06-25 14:07 UTC (permalink / raw)
  To: woojung.huh, UNGLinuxDriver, davem, netdev; +Cc: Dave Stevenson
In-Reply-To: <cover.1529935234.git.dave.stevenson@raspberrypi.org>

The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.

Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
---
 drivers/net/usb/lan78xx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a89570f..2f793d4 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2298,7 +2298,7 @@ static int lan78xx_change_mtu(struct net_device *netdev, int new_mtu)
 	if ((ll_mtu % dev->maxpacket) == 0)
 		return -EDOM;
 
-	ret = lan78xx_set_rx_max_frame_length(dev, new_mtu + ETH_HLEN);
+	ret = lan78xx_set_rx_max_frame_length(dev, new_mtu + VLAN_ETH_HLEN);
 
 	netdev->mtu = new_mtu;
 
@@ -2587,7 +2587,8 @@ static int lan78xx_reset(struct lan78xx_net *dev)
 	buf |= FCT_TX_CTL_EN_;
 	ret = lan78xx_write_reg(dev, FCT_TX_CTL, buf);
 
-	ret = lan78xx_set_rx_max_frame_length(dev, dev->net->mtu + ETH_HLEN);
+	ret = lan78xx_set_rx_max_frame_length(dev,
+					      dev->net->mtu + VLAN_ETH_HLEN);
 
 	ret = lan78xx_read_reg(dev, MAC_RX, &buf);
 	buf |= MAC_RX_RXEN_;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 0/4] lan78xx minor fixes
From: Dave Stevenson @ 2018-06-25 14:07 UTC (permalink / raw)
  To: woojung.huh, UNGLinuxDriver, davem, netdev; +Cc: Dave Stevenson

This is a small set of patches for the Microchip LAN78xx chip,
as used in the Raspberry Pi 3B+.
The main debug/discussion was on
https://github.com/raspberrypi/linux/issues/2458

Initial symptoms were that VLANs were very unreliable.
A couple of things were found:
- firstly that the hardware timeout value set failed to 
  take into account the VLAN tag, so a full MTU packet
  would be timed out.
- second was that regular checksum failures were being
  reported. Disabling checksum offload confirmed that
  the checksums were valid, and further experimentation
  identified that it was only if the VLAN tags were being
  passed through to the kernel that there were issues.
  The hardware supports VLAN filtering and tag stripping,
  therefore those have been implemented (much of the work
  was already done), and the driver drops back to s/w
  checksums should the choice be made not to use the h/w
  VLAN stripping.

Dave Stevenson (4):
  net: lan78xx: Allow for VLAN headers in timeout calcs
  net: lan78xx: Add support for VLAN filtering.
  net: lan78xx: Add support for VLAN tag stripping.
  net: lan78xx: Use s/w csum check on VLANs without tag stripping

 drivers/net/usb/lan78xx.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 00/14] ARM: davinci: step towards removing at24_platform_data
From: Andrew Lunn @ 2018-06-25 18:02 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Sekhar Nori, Kevin Hilman, Russell King, Grygorii Strashko,
	David S . Miller, Srinivas Kandagatla, Lukas Wunner, Rob Herring,
	Florian Fainelli, Dan Carpenter, Ivan Khoronzhuk, David Lechner,
	Greg Kroah-Hartman, Linux ARM, Linux Kernel Mailing List,
	linux-omap, netdev, Bartosz Golaszewski
In-Reply-To: <CAMRc=MdEn74G5SJAaGLAfNSS7Zy7wqH04aN5CMZ45UXqhu_58g@mail.gmail.com>

> With my patch 1/14 you'll get -EPROBE_DEFER from nvmem_cell_get() if
> the nvmem provider is not yet registered. Will that help in your case?

I don't think so. My driver instantiates the AT24 device. So if i get
-EPROBE_DEFER, i need to cleanup the probe, and return -EPROBDE_DEFER
to the code. Which means i need to remove the AT24 device...

       Andrew

^ permalink raw reply

* Re: [PATCH] net/nfp: cast sizeof() to int when comparing with error code
From: Jakub Kicinski @ 2018-06-25 18:11 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: davem, oss-drivers, netdev
In-Reply-To: <20180625073318.6828-1-cgxu519@gmx.com>

On Mon, 25 Jun 2018 15:33:18 +0800, Chengguang Xu wrote:
> Negative error code will be larger than sizeof().
> 
> Signed-off-by: Chengguang Xu <cgxu519@gmx.com>

You should include the name of the tree in the tag, e.g. [PATCH net] or
[PATCH net-next] and if it's a fix, i.e. directed at the net tree, you
should add a fixes tag, in this case it would most likely be:

Fixes: a0d8e02c35ff ("nfp: add support for reading nffw info")

You can find this and more information about the development process of
the networking subsystem in:

https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thank you!

^ permalink raw reply

* [PATCH net-next] r8169: reject unsupported WoL options
From: Heiner Kallweit @ 2018-06-25 18:34 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller; +Cc: netdev@vger.kernel.org

So far unsupported WoL options are silently ignored. Change this and
reject attempts to set unsupported options. This prevents situations
where a user tries to set an unsupported WoL option and is under the
impression it was successful because ethtool doesn't complain.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 1d33672c..70c13cc2 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1712,11 +1712,14 @@ static int rtl8169_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
 	struct rtl8169_private *tp = netdev_priv(dev);
 	struct device *d = tp_to_dev(tp);
 
+	if (wol->wolopts & ~WAKE_ANY)
+		return -EINVAL;
+
 	pm_runtime_get_noresume(d);
 
 	rtl_lock_work(tp);
 
-	tp->saved_wolopts = wol->wolopts & WAKE_ANY;
+	tp->saved_wolopts = wol->wolopts;
 
 	if (pm_runtime_active(d))
 		__rtl8169_set_wol(tp, tp->saved_wolopts);
-- 
2.18.0

^ permalink raw reply related

* Re: [net regression] "fib_rules: move common handling of newrule delrule msgs into fib_nl2rule" breaks suppress_prefixlength
From: Roopa Prabhu @ 2018-06-25 18:36 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Netdev
In-Reply-To: <CAJieiUiDOQF0NTUGAYmQEtwnrsXoqTAt-cW+dueeRXVP8rhaAw@mail.gmail.com>

On Mon, Jun 25, 2018 at 8:23 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> On Sat, Jun 23, 2018 at 8:46 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>> Hey Roopa,
>>
>> On a kernel with a minimal networking config,
>> CONFIG_IP_MULTIPLE_TABLES appears to be broken for certain rules after
>> f9d4b0c1e9695e3de7af3768205bacc27312320c.
>>
>> Try, for example, running:
>>
>> $ ip -4 rule add table main suppress_prefixlength 0
>>
>> It returns with EEXIST.
>>
>> Perhaps the reason is that the new rule_find function does not match
>> on suppress_prefixlength? However, rule_exist from before didn't do
>> that either. I'll keep playing and see if I can track it down myself,
>> but thought I should let you know first.
>
> I am surprised at that also. I cannot find prior rule_exist looking at
> suppress_prefixlength.
> I will dig deeper also today. But your patch LGTM with a small change
> I commented on it.
>

So the previous rule_exists code did not check for attribute matches correctly.
It would ignore a rule at the first non-existent attribute mis-match.
eg in your case, it would
return at pref mismatch.


$ip -4 rule add table main suppress_prefixlength 0
$ip -4 rule add table main suppress_prefixlength 0
$ip -4 rule add table main suppress_prefixlength 0

$ip rule show
0:      from all lookup local

32763:  from all lookup main suppress_prefixlength 0

32764:  from all lookup main suppress_prefixlength 0

32765:  from all lookup main suppress_prefixlength 0

32766:  from all lookup main

32767:  from all lookup default


With your patch (with my proposed fixes), you should get proper EXISTS check

$ git diff HEAD

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c

index 126ffc5..de4c171 100644

--- a/net/core/fib_rules.c

+++ b/net/core/fib_rules.c

@@ -428,6 +428,14 @@ static struct fib_rule *rule_find(struct
fib_rules_ops *ops,

                if (rule->l3mdev && r->l3mdev != rule->l3mdev)

                        continue;



+               if (rule->suppress_ifgroup != -1 &&

+                   (r->suppress_ifgroup != rule->suppress_ifgroup))

+                       continue;

+

+               if (rule->suppress_prefixlen != -1 &&

+                   (r->suppress_prefixlen != rule->suppress_prefixlen))

+                       continue;

+

                if (uid_range_set(&rule->uid_range) &&

                    (!uid_eq(r->uid_range.start, rule->uid_range.start) ||

                    !uid_eq(r->uid_range.end, rule->uid_range.end)))



$ ip -4 rule add table main suppress_prefixlength 0

$ ip -4 rule add table main suppress_prefixlength 0

RTNETLINK answers: File exists

^ permalink raw reply

* Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
From: Santosh Shilimkar @ 2018-06-25 18:44 UTC (permalink / raw)
  To: Sowmini Varadhan, Ka-Cheong Poon; +Cc: netdev, davem, rds-devel
In-Reply-To: <20180625175006.GI14823@oracle.com>

On 6/25/2018 10:50 AM, Sowmini Varadhan wrote:
> On (06/26/18 01:43), Ka-Cheong Poon wrote:
>>
>> Yes, I think if the socket is bound, it should check the scope_id
>> in msg_name (if not NULL) to make sure that they match.  A bound
>> RDS socket can send to multiple peers.  But if the bound local
>> address is link local, it should only be allowed to send to peers
>> on the same link.
> 
> agree.
Yep. Its inline with RDS bind behavior.

> 
> 
>> If a socket is bound, I guess the scope_id should be used.  So
>> if a socket is not bound to a link local address and the socket
>> is used to sent to a link local peer, it should fail.
> 
> PF_RDS sockets *MUST* alwasy be bound.  See
> Documentation/networking/rds.txt:
> "   Sockets must be bound before you can send or receive data.
>      This is needed because binding also selects a transport and
>      attaches it to the socket. Once bound, the transport assignment
>      does not change."
> 
In any case link local or not, the socket needs to be bound before
any data can be sent as documented. Send path already enforces
it.

>>> Also, why is there no IPv6 support in rds_connect?
>>
>>
>> Oops, I missed this when I ported the internal version to the
>> net-next version.  Will add it back.
> 
So the net-next wasn't tested? IPv6 connections
itself wouldn't be formed with this missing. As mentioned
already, please test v2 before posting on list.

Regards,
Santosh

^ permalink raw reply

* Re: Route fallback issue
From: Julian Anastasov @ 2018-06-25 18:50 UTC (permalink / raw)
  To: Grant Taylor; +Cc: Akshat Kakkar, netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <544875a1-68e6-a8f3-4eb7-44f053605c3e@spamtrap.tnetconsulting.net>


	Hello,

On Thu, 21 Jun 2018, Grant Taylor wrote:

> On 06/21/2018 01:57 PM, Julian Anastasov wrote:
> > Hello,
> 
> > http://ja.ssi.bg/dgd-usage.txt
> 
> "DGD" or "Dead Gateway Detection" sounds very familiar.  I referenced it in an
> earlier reply.
> 
> I distinctly remember DGD not behaving satisfactorily years ago.  Where
> unsatisfactorily was something like 90 seconds (or more) to recover. Which
> actually matches what I was getting without the ignore_routes_with_linkdown=1
> setting that David A. mentioned.

	Yes, ARP state for unreachable GWs may be updated slowly,
there is in-time feedback only for reachable state.

> With ignore_routes_with_linkdown=1 things behaved much better.
> 
> > Not true. net/ipv4/fib_semantics.c:fib_select_path() calls
> > fib_select_default() only when prefixlen = 0 (default route).
> 
> Okay....  My testing last night disagrees with you.  Specifically, I was able
> to add a alternate routes to the same prefix, 192.0.2.128/26. There was not
> any default gateway configured on any of the NetNSs.  So everything was using
> routes for locally attacked or the two added via "ip route append".
> 
> What am I misinterpreting?  Or where are we otherwise talking past each other?

	You can create the two routes, of course. But only the
default routes are alternative.

> 
> > Otherwise, only the first route will be considered.
> 
> "only the first route" almost sounds like something akin to Equal Cost Multi
> Path.
> 
> I was not expecting "alternative routes" to use more than one route at a time,
> equally or otherwise.  I was wanting for the kernel to fall back to an
> alternate route / gateway / path in the event that the one that was being used
> became unusable / unreachable.
> 
> So what should "Alternative Routes" do?  How does this compare / contract to
> E.C.M.P. or D.G.D.

	The alternative routes work in this way:

- on lookup, routes are walked in order - as listed in table

- as long as route contains reachable gateway (ARP state), only this route 
is used

- if some gateway becomes unreachable (ARP state), next alternative routes 
are tried

- if ARP entry is expired (missing), this gateway can be probed if the 
route is before the currently used route. This is what happens initially
when no ARP state is present for the GWs. It is bad luck if the probed
GW is actually unreachable.

- active probing by user space (ping GWs) can only help to keep the
ARP state present for the used gateways. By this way, if ARP entry 
for GW is missing, the kernel will not risk to select unavailable route 
with the goal to probe the GW.

> > fib_select_default() is the function that decides which nexthop is reachable
> > and whether to contact it. It uses the ARP state via fib_detect_death().
> > That is all code that is behind this feature called "alternative routes":
> > the kernel selects one based on nexthop's ARP state.
> 
> Please confirm that you aren't entering / referring to E.C.M.P. territory when
> you say "nexthop".  I think that you are not, but I want to ask and be sure,
> particularly seeing as how things are very closely related.

	nexthop is the GW in the route

> It sounds like you're referring to literally the router that is the next hop
> in the path.  I.e. the device on the other end of the wire.

	Yes, the kernel avoids alternative routes with unreachable GWs

> I want to do some testing to see if fib_multipath_use_neigh alters this
> behavior at all.  I'm hoping that it will invalidate an alternate route if the
> MAC is not resolvable even if the physical link stays up.

	The multipath route uses all its alive nexthops at the same 
time... But you may need in the same way active probing by user space,
otherwise unavailable GW can be selected.

> Sure, the ARP cache may have a 30 ~ 120 second timeout before triggering this
> behavior.  But having that timeout and starting to use an alternative route is
> considerably better than not using an alternative route.

	Yes, if you prefer, you may run PING every second to avoid such 
delays...

Regards

^ permalink raw reply

* Re: Route fallback issue
From: Julian Anastasov @ 2018-06-25 19:02 UTC (permalink / raw)
  To: Erik Auerswald; +Cc: Grant Taylor, Akshat Kakkar, netdev, cronolog+lartc, lartc
In-Reply-To: <20180624134524.GA9925@unix-ag.uni-kl.de>


	Hello,

On Sun, 24 Jun 2018, Erik Auerswald wrote:

> Hello Julien,
> 
> > http://ja.ssi.bg/dgd-usage.txt
> 
> Thanks for that info!
> 
> Can you tell us what parts from the above text is actually implemented
> in the upstream Linux kernel, and starting with which version(s)
> (approximately)? The text describes ideas and patches from nearly two
> decades ago, is more recent documentation available somewhere?

	Nothing is included in kernel. The idea is that user space
has more control. It is best done with CONNMARK: stick NATed
connection to some path (via alive ISP), use route lookup just
to select alive path for the first packet in connection. So, what
we balance are connections, not packets (which does not work with
different ISPs). Probe GWs to keep only alive routes in the table.

Regards

^ permalink raw reply

* Re: Route fallback issue
From: Grant Taylor @ 2018-06-25 20:07 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Akshat Kakkar, netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <alpine.LFD.2.20.1806252058210.2593@ja.home.ssi.bg>

[-- Attachment #1: Type: text/plain, Size: 3308 bytes --]

On 06/25/2018 12:50 PM, Julian Anastasov wrote:
> Hello,

Hi Julian,

> Yes, ARP state for unreachable GWs may be updated slowly, there is 
> in-time feedback only for reachable state.

Fair.

Most of the installations where I needed D.G.D. to work would be okay 
with a < 5 minute timeout.  Obviously they would like faster, but 
automation is a LOT better than waiting on manual intervention.

IMHO < 30 seconds is great.  < 90 seconds is acceptable.  < 300 seconds 
leaves some room for improvement.

> You can create the two routes, of course. But only the default routes 
> are alternative.

Are you saying that the functionality I'm describing only works for 
default gateways or that the term "alternative route" only applies to 
default gateways?

The testing that I did indicated that alternative routes worked for 
specific prefixes too.

I tested multiple NetNSs with only directly attached routes and appended 
routes to a destination prefix, no default gateway / route of last resort.

The behavior seemed to be different when ignore_routes_with_linkdown was 
set verses unset.  Specifically, ignore_routes_with_linkdown seemed to 
help considerably.

Hence why I question the requirement for the "default" route verses a 
route to a specific prefix.

Can you explain why I saw the behavior difference with 
ignore_routes_with_linkdown if it only applies to the default route?

> The alternative routes work in this way:
> 
> - on lookup, routes are walked in order - as listed in table
> 
> - as long as route contains reachable gateway (ARP state), only this 
> route is used
> 
> - if some gateway becomes unreachable (ARP state), next alternative 
> routes are tried
> 
> - if ARP entry is expired (missing), this gateway can be probed if the 
> route is before the currently used route. This is what happens initially 
> when no ARP state is present for the GWs. It is bad luck if the probed 
> GW is actually unreachable.
> 
> - active probing by user space (ping GWs) can only help to keep the ARP 
> state present for the used gateways. By this way, if ARP entry for GW 
> is missing, the kernel will not risk to select unavailable route with 
> the goal to probe the GW.

This all makes sense.

Please confirm if "gateway" in this context is the "/default/ gateway" 
or not.  I ask because arguably "gateway" can be used as a term to 
describe the next hop for a route, or gateway, to a prefix.  Further, 
the "/default/ (gateway,router)" is the gateway or route of last resort. 
  Which to me means that "gateway" can be any route in this context.

> nexthop is the GW in the route

Thank you for confirming.

> Yes, the kernel avoids alternative routes with unreachable GWs

Fair enough.

> The multipath route uses all its alive nexthops at the same time... But 
> you may need in the same way active probing by user space, otherwise 
> unavailable GW can be selected.

I assume that the dead ECMP NEXTHOP is also subject to similar timeouts 
as alternative routes.  Correct?

> Yes, if you prefer, you may run PING every second to avoid such delays...

Agreed.

I'm trying to make sure I understand basic functionality before I do 
things to modify it.

-- 
Grant. . . .
unix || die

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3982 bytes --]

^ permalink raw reply

* [PATCH net 0/2] nfp: MPLS and shared blocks TC offload fixes
From: Jakub Kicinski @ 2018-06-25 20:22 UTC (permalink / raw)
  To: davem; +Cc: oss-drivers, netdev, Jakub Kicinski

Hi!

This series brings two fixes to TC filter/action offload code.
Pieter fixes matching MPLS packets when the match is purely on
the MPLS ethertype and none of the MPLS fields are used.
John provides a fix for offload of shared blocks.  Unfortunately,
with shared blocks there is currently no guarantee that filters
which were added by the core will be removed before block unbind.
Our simple fix is to not support offload of rules on shared blocks
at all, a revert of this fix will be send for -next once the
reoffload infrastructure lands.  The shared blocks became important
as we are trying to use them for bonding offload (managed from user
space) and lack of remove calls leads to resource leaks.

John Hurley (1):
  nfp: reject binding to shared blocks

Pieter Jansen van Vuuren (1):
  nfp: flower: fix mpls ether type detection

 drivers/net/ethernet/netronome/nfp/bpf/main.c      |  3 +++
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 14 ++++++++++++++
 .../net/ethernet/netronome/nfp/flower/offload.c    | 12 ++++++++++++
 3 files changed, 29 insertions(+)

-- 
2.17.1

^ permalink raw reply

* [PATCH net 1/2] nfp: flower: fix mpls ether type detection
From: Jakub Kicinski @ 2018-06-25 20:22 UTC (permalink / raw)
  To: davem; +Cc: oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180625202246.28871-1-jakub.kicinski@netronome.com>

From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>

Previously it was not possible to distinguish between mpls ether types and
other ether types. This leads to incorrect classification of offloaded
filters that match on mpls ether type. For example the following two
filters overlap:

 # tc filter add dev eth0 parent ffff: \
    protocol 0x8847 flower \
    action mirred egress redirect dev eth1

 # tc filter add dev eth0 parent ffff: \
    protocol 0x0800 flower \
    action mirred egress redirect dev eth2

The driver now correctly includes the mac_mpls layer where HW stores mpls
fields, when it detects an mpls ether type. It also sets the MPLS_Q bit to
indicate that the filter should match mpls packets.

Fixes: bb055c198d9b ("nfp: add mpls match offloading support")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 14 ++++++++++++++
 .../net/ethernet/netronome/nfp/flower/offload.c    |  8 ++++++++
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c b/drivers/net/ethernet/netronome/nfp/flower/match.c
index 91935405f586..84f7a5dbea9d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/match.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/match.c
@@ -123,6 +123,20 @@ nfp_flower_compile_mac(struct nfp_flower_mac_mpls *frame,
 			 NFP_FLOWER_MASK_MPLS_Q;
 
 		frame->mpls_lse = cpu_to_be32(t_mpls);
+	} else if (dissector_uses_key(flow->dissector,
+				      FLOW_DISSECTOR_KEY_BASIC)) {
+		/* Check for mpls ether type and set NFP_FLOWER_MASK_MPLS_Q
+		 * bit, which indicates an mpls ether type but without any
+		 * mpls fields.
+		 */
+		struct flow_dissector_key_basic *key_basic;
+
+		key_basic = skb_flow_dissector_target(flow->dissector,
+						      FLOW_DISSECTOR_KEY_BASIC,
+						      flow->key);
+		if (key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_UC) ||
+		    key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_MC))
+			frame->mpls_lse = cpu_to_be32(NFP_FLOWER_MASK_MPLS_Q);
 	}
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index c42e64f32333..477f584f6d28 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -264,6 +264,14 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 		case cpu_to_be16(ETH_P_ARP):
 			return -EOPNOTSUPP;
 
+		case cpu_to_be16(ETH_P_MPLS_UC):
+		case cpu_to_be16(ETH_P_MPLS_MC):
+			if (!(key_layer & NFP_FLOWER_LAYER_MAC)) {
+				key_layer |= NFP_FLOWER_LAYER_MAC;
+				key_size += sizeof(struct nfp_flower_mac_mpls);
+			}
+			break;
+
 		/* Will be included in layer 2. */
 		case cpu_to_be16(ETH_P_8021Q):
 			break;
-- 
2.17.1

^ permalink raw reply related

* [PATCH net 2/2] nfp: reject binding to shared blocks
From: Jakub Kicinski @ 2018-06-25 20:22 UTC (permalink / raw)
  To: davem; +Cc: oss-drivers, netdev, John Hurley, Jakub Kicinski
In-Reply-To: <20180625202246.28871-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

TC shared blocks allow multiple qdiscs to be grouped together and filters
shared between them. Currently the chains of filters attached to a block
are only flushed when the block is removed. If a qdisc is removed from a
block but the block still exists, flow del messages are not passed to the
callback registered for that qdisc. For the NFP, this presents the
possibility of rules still existing in hw when they should be removed.

Prevent binding to shared blocks until the kernel can send per qdisc del
messages when block unbinds occur.

Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks infrastructure")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c       | 3 +++
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index fcdfb8e7fdea..6b15e3b11956 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev,
 	if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
 		return -EOPNOTSUPP;
 
+	if (tcf_block_shared(f->block))
+		return -EOPNOTSUPP;
+
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 477f584f6d28..525057bee0ed 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device *netdev,
 	if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
 		return -EOPNOTSUPP;
 
+	if (tcf_block_shared(f->block))
+		return -EOPNOTSUPP;
+
 	switch (f->command) {
 	case TC_BLOCK_BIND:
 		return tcf_block_cb_register(f->block,
-- 
2.17.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox