Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ipsec doesn't route TCP with 4.11 kernel
From: Don Bowman @ 2017-04-27 22:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Cong Wang, linux-kernel@vger.kernel.org, Herbert Xu,
	Linux Kernel Network Developers
In-Reply-To: <20170427084238.GX2649@secunet.com>

On 27 April 2017 at 04:42, Steffen Klassert
<steffen.klassert@secunet.com> wrote:
> On Wed, Apr 26, 2017 at 10:01:34PM -0700, Cong Wang wrote:
>> (Cc'ing netdev and IPSec maintainers)
>>
>> On Tue, Apr 25, 2017 at 6:08 PM, Don Bowman <db@donbowman.ca> wrote:

for 'esp' question, i have ' esp = aes256-sha256-modp1536!' is that
what you mean?
its nat-aware tunnel [from my desktop pc to my office]

root@office:~# ip -s x s
src 172.16.0.8 dst 64.7.137.180
        proto esp spi 0x0d588366(223904614) reqid 1(0x00000001) mode tunnel
        replay-window 0 seq 0x00000000 flag af-unspec (0x00100000)
        auth-trunc hmac(sha256)
0x046cafdf19c5d78d1c29165d96a0b9fce1c500029d77be0fe956dce1bf80a86a
(256 bits) 128
        enc cbc(aes)
0x79ff2fbc2178eb468de6ff16612f0603b514a1d1d5f375c67222294463ec7c62
(256 bits)
        encap type espinudp sport 4500 dport 4500 addr 0.0.0.0
        anti-replay context: seq 0x0, oseq 0x28, bitmap 0x00000000
        lifetime config:
          limit: soft (INF)(bytes), hard (INF)(bytes)
          limit: soft (INF)(packets), hard (INF)(packets)
          expire add: soft 42843(sec), hard 43200(sec)
          expire use: soft 0(sec), hard 0(sec)
        lifetime current:
          2986(bytes), 40(packets)
          add 2017-04-27 18:08:12 use 2017-04-27 18:08:12
        stats:
          replay-window 0 replay 0 failed 0
src 64.7.137.180 dst 172.16.0.8
        proto esp spi 0xcd366c03(3442895875) reqid 1(0x00000001) mode tunnel
        replay-window 32 seq 0x00000000 flag af-unspec (0x00100000)
        auth-trunc hmac(sha256)
0x4158741cc971c49417d60165f19ed02249385c7bba808927d4a9e7b45fb30c5b
(256 bits) 128
        enc cbc(aes)
0x77592c79c964787bca5012214b85172b06deb7b3f06aac02e3934dd9ead67c15
(256 bits)
        encap type espinudp sport 4500 dport 4500 addr 0.0.0.0
        anti-replay context: seq 0x27, oseq 0x0, bitmap 0xffffffff
        lifetime config:
          limit: soft (INF)(bytes), hard (INF)(bytes)
          limit: soft (INF)(packets), hard (INF)(packets)
          expire add: soft 42873(sec), hard 43200(sec)
          expire use: soft 0(sec), hard 0(sec)
        lifetime current:
          4501(bytes), 38(packets)
          add 2017-04-27 18:08:12 use 2017-04-27 18:08:12
        stats:
          replay-window 0 replay 0 failed 0


>> >
>> > My ipsec tunnel comes up ok.
>
> When talking about IPsec, I guess you use ESP, right?
 ...

>
> If it is a GRO issue, then it is on the receive side, could you do
> tcpdump on the receiving interface to see what you get there?

I'm not sure what you mean the receiving interface, you mean the
outer, the native interface?
listening on eno1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:11:32.061501 IP 172.16.0.8.3416 > 64.7.137.180.33638: truncated-udplength 0
18:11:32.788091 IP 64.7.137.180.4500 > 172.16.0.8.4500: NONESP-encap:
isakmp: child_sa  inf2
18:11:32.788354 IP 172.16.0.8.4500 > 64.7.137.180.4500: NONESP-encap:
isakmp: child_sa  inf2[IR]
18:11:33.066830 IP 172.16.0.8.3416 > 64.7.137.180.33638: truncated-udplength 0
18:11:35.082839 IP 172.16.0.8.3416 > 64.7.137.180.33638: truncated-udplength 0
18:11:37.807945 IP 64.7.137.180.4500 > 172.16.0.8.4500: NONESP-encap:
isakmp: child_sa  inf2
18:11:37.808300 IP 172.16.0.8.4500 > 64.7.137.180.4500: NONESP-encap:
isakmp: child_sa  inf2[IR]

is what i see there for the 'curl' command that doesn't complete.

>
> What shows /proc/net/xfrm_stat?

root@office:~# cat /proc/net/xfrm_stat
XfrmInError                     0
XfrmInBufferError               0
XfrmInHdrError                  0
XfrmInNoStates                  0
XfrmInStateProtoError           0
XfrmInStateModeError            0
XfrmInStateSeqError             0
XfrmInStateExpired              0
XfrmInStateMismatch             0
XfrmInStateInvalid              0
XfrmInTmplMismatch              0
XfrmInNoPols                    0
XfrmInPolBlock                  0
XfrmInPolError                  0
XfrmOutError                    0
XfrmOutBundleGenError           0
XfrmOutBundleCheckError         0
XfrmOutNoStates                 0
XfrmOutStateProtoError          0
XfrmOutStateModeError           0
XfrmOutStateSeqError            0
XfrmOutStateExpired             0
XfrmOutPolBlock                 0
XfrmOutPolDead                  0
XfrmOutPolError                 0
XfrmFwdHdrError                 0
XfrmOutStateInvalid             0
XfrmAcquireError                0

>
> Can you do 'ip -s x s' to see if the SAs are used?
>
> Do you have INET_ESP_OFFLOAD enabled?
>

CONFIG_INET_ESP=m
CONFIG_INET_ESP_OFFLOAD=m
CONFIG_INET6_ESP=m
CONFIG_INET6_ESP_OFFLOAD=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y


# lsmod |grep esp
esp4                   20480  2
xfrm_algo              16384  5 xfrm_user,esp4,ah4,af_key,xfrm_ipcomp

^ permalink raw reply

* Re: rhashtable - Cap total number of entries to 2^31
From: Florian Fainelli @ 2017-04-27 22:21 UTC (permalink / raw)
  To: Herbert Xu, David Miller; +Cc: fw, netdev, Thomas Graf
In-Reply-To: <0cd0286d-b81d-7bf4-d345-7ef098b9a998@broadcom.com>

On 04/27/2017 02:16 PM, Florian Fainelli wrote:
> Hi Herbert,
> 
> On 04/26/2017 10:44 PM, Herbert Xu wrote:
>> On Tue, Apr 25, 2017 at 10:48:22AM -0400, David Miller wrote:
>>> From: Florian Westphal <fw@strlen.de>
>>> Date: Tue, 25 Apr 2017 16:17:49 +0200
>>>
>>>> I'd have less of an issue with this if we'd be talking about
>>>> something computationally expensive, but this is about storing
>>>> an extra value inside a struct just to avoid one "shr" in insert path...
>>>
>>> Agreed, this shift is probably filling an available cpu cycle :-)
>>
>> OK, but we need to have an extra field for another reason anyway.
>> The problem is that we're not capping the total number of elements
>> in the hashtable when max_size is not set, this means that nelems
>> can overflow which will cause havoc with the automatic shrinking
>> when it tries to fit 2^32 entries into a minimum-sized table.
>>
>> So I'm taking that hole back for now :)
>>
>> ---8<---
>> When max_size is not set or if it set to a sufficiently large
>> value, the nelems counter can overflow.  This would cause havoc
>> with the automatic shrinking as it would then attempt to fit a
>> huge number of entries into a tiny hash table.
>>
>> This patch fixes this by adding max_elems to struct rhashtable
>> to cap the number of elements.  This is set to 2^31 as nelems is
>> not a precise count.  This is sufficiently smaller than UINT_MAX
>> that it should be safe.
>>
>> When max_size is set max_elems will be lowered to at most twice
>> max_size as is the status quo.
>>
>> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> This commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=6d684e54690caef45cf14051ddeb7c71beeb681b
> 
> makes my ARMv7 (32-bit) system panic on boot with the log below. I can
> test net-next (or net) and report back if you want me to test anything.
> Thanks!

And another on with a QEMU guest:

[    0.389212] NET: Registered protocol family 16
[    0.388807] Kernel panic - not syncing: rtnetlink_init: cannot
initialize rtnetlink
[    0.388807]
[    0.389445] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
4.11.0-rc8-02077-ge221c1f0fe25 #1
[    0.389745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[    0.390219] Call Trace:
[    0.391406]  dump_stack+0x51/0x78
[    0.391585]  panic+0xc7/0x20e
[    0.391740]  ? register_pernet_operations+0xa1/0xd0
[    0.392031]  rtnetlink_init+0x22/0x1a0
[    0.392190]  netlink_proto_init+0x168/0x184
[    0.392359]  ? ptp_classifier_init+0x26/0x30
[    0.392528]  ? netlink_net_init+0x2e/0x2e
[    0.392692]  do_one_initcall+0x54/0x190
[    0.392852]  ? parse_args+0x248/0x400
[    0.393033]  kernel_init_freeable+0x127/0x1b6
[    0.393208]  ? kernel_init_freeable+0x1b6/0x1b6
[    0.393389]  ? rest_init+0x70/0x70
[    0.393533]  kernel_init+0x9/0x100
[    0.393676]  ret_from_fork+0x29/0x40
[    0.394555] ---[ end Kernel panic - not syncing: rtnetlink_init:
cannot initialize rtnetlink
[    0.394555]

I traced this down to:

rtnetlink_net_init()
  netlink_kernel_create()
     netlink_insert()
	__netlink_insert()
	   rhashtable_lookup_insert_key()
	      __rhashtable_insert_fast()
                rht_grow_above_max()

And indeed we have:

ht->nelemts = 0
ht->max_elems = 0

such that rht_grow_above_max() returns true.

With your commit we actually take this branch:

if (ht->p.max_size < ht->max_elems / 2)
	ht->max_elems = ht->p.max_size * 2;

since max_size = 0 we have max_elems = 0 as well.

Candidate fix #1:

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 45f89369c4c8..ad9020e1609c 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -329,7 +329,7 @@ static inline bool rht_grow_above_100(const struct
rhashtable *ht,
 static inline bool rht_grow_above_max(const struct rhashtable *ht,
                                      const struct bucket_table *tbl)
 {
-       return atomic_read(&ht->nelems) >= ht->max_elems;
+       return ht->p.max_size && atomic_read(&ht->nelems) >= ht->max_elems;
 }

Candidate fix #2:

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 751630bbe409..6b4f07760fec 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,

        /* Cap total entries at 2^31 to avoid nelems overflow. */
        ht->max_elems = 1u << 31;
-       if (ht->p.max_size < ht->max_elems / 2)
+       if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
                ht->max_elems = ht->p.max_size * 2;

        ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);

Number #2 does not introduce an additional conditional on the fastpath,
so I suppose that would be what we would prefer?

> 
> [    0.158619] futex hash table entries: 1024 (order: 4, 65536 bytes)
> [    0.166386] NET: Registered protocol family 16
> [    0.179596] Kernel panic - not syncing: rtnetlink_init: cannot
> initialize rtnetlink
> [    0.179596]
> [    0.189350] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [    0.197908] Hardware name: Broadcom STB (Flattened Device Tree)
> [    0.204254] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [    0.212447] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [    0.220144] [<c04bc454>] (dump_stack) from [<c02ab684>]
> (panic+0xf0/0x270)
> [    0.227460] [<c02ab684>] (panic) from [<c0c2705c>]
> (rtnetlink_init+0x24/0x1d4)
> [    0.235145] [<c0c2705c>] (rtnetlink_init) from [<c0c27630>]
> (netlink_proto_init+0x124/0x148)
> [    0.244124] [<c0c27630>] (netlink_proto_init) from [<c02017f8>]
> (do_one_initcall+0x40/0x168)
> [    0.253072] [<c02017f8>] (do_one_initcall) from [<c0c00dfc>]
> (kernel_init_freeable+0x164/0x200)
> [    0.262304] [<c0c00dfc>] (kernel_init_freeable) from [<c087bfd8>]
> (kernel_init+0x8/0x110)
> [    0.270970] [<c087bfd8>] (kernel_init) from [<c0207fa8>]
> (ret_from_fork+0x14/0x2c)
> [    0.279014] CPU1: stopping
> [    0.281916] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [    0.290499] Hardware name: Broadcom STB (Flattened Device Tree)
> [    0.296796] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [    0.305018] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [    0.312684] [<c04bc454>] (dump_stack) from [<c020e984>]
> (handle_IPI+0x170/0x190)
> [    0.320531] [<c020e984>] (handle_IPI) from [<c020144c>]
> (gic_handle_irq+0x88/0x8c)
> [    0.328586] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
> (__irq_svc+0x58/0x74)
> [    0.336543] Exception stack(0xee055f68 to 0xee055fb0)
> [    0.341938] 5f60:                   00000001 00000000 ee055fc0
> c0219b60 ee054000 c1603cc8
> [    0.350661] 5f80: c1603c6c 00000000 00000000 c1486188 ee055fc0
> c1603cd4 c1483408 ee055fb8
> [    0.359323] 5fa0: c0208a40 c0208a44 60000013 ffffffff
> [    0.364745] [<c020bd78>] (__irq_svc) from [<c0208a44>]
> (arch_cpu_idle+0x38/0x3c)
> [    0.372613] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
> (do_idle+0x168/0x204)
> [    0.380479] [<c0255e98>] (do_idle) from [<c02561ac>]
> (cpu_startup_entry+0x18/0x1c)
> [    0.388493] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
> [    0.395687] CPU3: stopping
> [    0.398606] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [    0.407242] Hardware name: Broadcom STB (Flattened Device Tree)
> [    0.413564] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [    0.421795] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [    0.429495] [<c04bc454>] (dump_stack) from [<c020e984>]
> (handle_IPI+0x170/0x190)
> [    0.437394] [<c020e984>] (handle_IPI) from [<c020144c>]
> (gic_handle_irq+0x88/0x8c)
> [    0.445475] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
> (__irq_svc+0x58/0x74)
> [    0.453406] Exception stack(0xee059f68 to 0xee059fb0)
> [    0.458792] 9f60:                   00000001 00000000 ee059fc0
> c0219b60 ee058000 c1603cc8
> [    0.467489] 9f80: c1603c6c 00000000 00000000 c1486188 ee059fc0
> c1603cd4 c1483408 ee059fb8
> [    0.476177] 9fa0: c0208a40 c0208a44 60000013 ffffffff
> [    0.481581] [<c020bd78>] (__irq_svc) from [<c0208a44>]
> (arch_cpu_idle+0x38/0x3c)
> [    0.489474] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
> (do_idle+0x168/0x204)
> [    0.497331] [<c0255e98>] (do_idle) from [<c02561ac>]
> (cpu_startup_entry+0x18/0x1c)
> [    0.505369] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
> [    0.512562] CPU2: stopping
> [    0.515463] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [    0.524047] Hardware name: Broadcom STB (Flattened Device Tree)
> [    0.530368] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [    0.538573] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [    0.546195] [<c04bc454>] (dump_stack) from [<c020e984>]
> (handle_IPI+0x170/0x190)
> [    0.554050] [<c020e984>] (handle_IPI) from [<c020144c>]
> (gic_handle_irq+0x88/0x8c)
> [    0.562096] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
> (__irq_svc+0x58/0x74)
> [    0.570044] Exception stack(0xee057f68 to 0xee057fb0)
> [    0.575465] 7f60:                   00000001 00000000 ee057fc0
> c0219b60 ee056000 c1603cc8
> [    0.584145] 7f80: c1603c6c 00000000 00000000 c1486188 ee057fc0
> c1603cd4 c1483408 ee057fb8
> [    0.592806] 7fa0: c0208a40 c0208a44 60000013 ffffffff
> [    0.598220] [<c020bd78>] (__irq_svc) from [<c0208a44>]
> (arch_cpu_idle+0x38/0x3c)
> [    0.606103] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
> (do_idle+0x168/0x204)
> [    0.613960] [<c0255e98>] (do_idle) from [<c02561ac>]
> (cpu_startup_entry+0x18/0x1c)
> [    0.621990] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
> [    0.629201] ---[ end Kernel panic - not syncing: rtnetlink_init:
> cannot initialize rtnetlink
> [    0.629201]
> 


-- 
Florian

^ permalink raw reply related

* [PATCH net-next] rhashtable: Make sure max_size is non zero
From: Florian Fainelli @ 2017-04-27 22:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, herbert, fw, tgraf, Florian Fainelli
In-Reply-To: <56843a86-9a09-16e8-acec-05a80396f282@gmail.com>

After commit 6d684e54690c ("rhashtable: Cap total number of
entries to 2^31"), we would be hitting a panic() in net/core/rtnetlink.c
during initialization. The call stack would look like this:

register_pernet_subsys()
    ...
    ops->init()
	rtnetlink_net_init()
	  netlink_kernel_create()
	     netlink_insert()
		__netlink_insert()
		   rhashtable_lookup_insert_key()
		      __rhashtable_insert_fast()
			rht_grow_above_max()

And here, we have rht_grow_above_max() return true, because ht->nelemts = 0 (legit)
&& ht->max_elems = 0 (looks bogus).

Eventually, we would be return -E2BIG from __rhashtable_insert_fast()
and propagate this all the way back to the caller.

After commit 6d684e54690c what changed is that we would take the
following condition:

if (ht->p.max_size < ht->max_elems / 2)
	ht->max_elems = ht->p.max_size * 2;

and since ht->p.max_size = 0, we would set ht->max_elems to 0 as well.

Fix this by taking this branch only when ht->p.max_size is non-zero

Fixes: Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
---
 lib/rhashtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 751630bbe409..6b4f07760fec 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
 
 	/* Cap total entries at 2^31 to avoid nelems overflow. */
 	ht->max_elems = 1u << 31;
-	if (ht->p.max_size < ht->max_elems / 2)
+	if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
 		ht->max_elems = ht->p.max_size * 2;
 
 	ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
-- 
2.12.2

^ permalink raw reply related

* [PATCH net-next] rhashtable: Make sure max_size is non zero
From: Florian Fainelli @ 2017-04-27 22:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, herbert, fw, tgraf, Florian Fainelli
In-Reply-To: <56843a86-9a09-16e8-acec-05a80396f282@gmail.com>

After commit 6d684e54690c ("rhashtable: Cap total number of
entries to 2^31"), we would be hitting a panic() in net/core/rtnetlink.c
during initialization. The call stack would look like this:

register_pernet_subsys()
    ...
    ops->init()
	rtnetlink_net_init()
	  netlink_kernel_create()
	     netlink_insert()
		__netlink_insert()
		   rhashtable_lookup_insert_key()
		      __rhashtable_insert_fast()
			rht_grow_above_max()

And here, we have rht_grow_above_max() return true, because ht->nelemts = 0 (legit)
&& ht->max_elems = 0 (looks bogus).

Eventually, we would be return -E2BIG from __rhashtable_insert_fast()
and propagate this all the way back to the caller.

After commit 6d684e54690c what changed is that we would take the
following condition:

if (ht->p.max_size < ht->max_elems / 2)
	ht->max_elems = ht->p.max_size * 2;

and since ht->p.max_size = 0, we would set ht->max_elems to 0 as well.

Fix this by taking this branch only when ht->p.max_size is non-zero

Fixes: Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 lib/rhashtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 751630bbe409..6b4f07760fec 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
 
 	/* Cap total entries at 2^31 to avoid nelems overflow. */
 	ht->max_elems = 1u << 31;
-	if (ht->p.max_size < ht->max_elems / 2)
+	if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
 		ht->max_elems = ht->p.max_size * 2;
 
 	ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCH net-next] rhashtable: Make sure max_size is non zero
From: Florian Fainelli @ 2017-04-27 22:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, herbert, fw, tgraf
In-Reply-To: <20170427222824.31936-1-florian.fainelli@broadcom.com>

On 04/27/2017 03:28 PM, Florian Fainelli wrote:
> After commit 6d684e54690c ("rhashtable: Cap total number of
> entries to 2^31"), we would be hitting a panic() in net/core/rtnetlink.c
> during initialization. The call stack would look like this:
> 
> register_pernet_subsys()
>     ...
>     ops->init()
> 	rtnetlink_net_init()
> 	  netlink_kernel_create()
> 	     netlink_insert()
> 		__netlink_insert()
> 		   rhashtable_lookup_insert_key()
> 		      __rhashtable_insert_fast()
> 			rht_grow_above_max()
> 
> And here, we have rht_grow_above_max() return true, because ht->nelemts = 0 (legit)
> && ht->max_elems = 0 (looks bogus).
> 
> Eventually, we would be return -E2BIG from __rhashtable_insert_fast()
> and propagate this all the way back to the caller.
> 
> After commit 6d684e54690c what changed is that we would take the
> following condition:
> 
> if (ht->p.max_size < ht->max_elems / 2)
> 	ht->max_elems = ht->p.max_size * 2;
> 
> and since ht->p.max_size = 0, we would set ht->max_elems to 0 as well.
> 
> Fix this by taking this branch only when ht->p.max_size is non-zero
> 
> Fixes: Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>

Sent another version with the correct email address and marked this one
as superseded in patchwork, not that this email is not valid, but it's
all about consistency.

David pleas apply this one instead:
http://patchwork.ozlabs.org/patch/756172/

/me remembers to stop switching between machines.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
From: Marcelo Ricardo Leitner @ 2017-04-27 22:54 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Jay Vosburgh, David S. Miller,
	Honggang LI, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <733d454d3c36e99b55de5374c7664364975b171d.1493313626.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, Apr 27, 2017 at 07:29:34PM +0200, Paolo Abeni wrote:
> On slave list updates, the bonding driver computes its hard_header_len
> as the maximum of all enslaved devices's hard_header_len.
> If the slave list is empty, e.g. on last enslaved device removal,
> ETH_HLEN is used.
> 
> Since the bonding header_ops are set only when the first enslaved
> device is attached, the above can lead to header_ops->create()
> being called with the wrong skb headroom in place.
> 
> If bond0 is configured on top of ipoib devices, with the
> following commands:
> 
> ifup bond0
> for slave in $BOND_SLAVES_LIST; do
> 	ip link set dev $slave nomaster
> done
> ping -c 1 <ip on bond0 subnet>
> 
> we will obtain a skb_under_panic() with a similar call trace:
> 	skb_push+0x3d/0x40
> 	push_pseudo_header+0x17/0x30 [ib_ipoib]
> 	ipoib_hard_header+0x4e/0x80 [ib_ipoib]
> 	arp_create+0x12f/0x220
> 	arp_send_dst.part.19+0x28/0x50
> 	arp_solicit+0x115/0x290
> 	neigh_probe+0x4d/0x70
> 	__neigh_event_send+0xa7/0x230
> 	neigh_resolve_output+0x12e/0x1c0
> 	ip_finish_output2+0x14b/0x390
> 	ip_finish_output+0x136/0x1e0
> 	ip_output+0x76/0xe0
> 	ip_local_out+0x35/0x40
> 	ip_send_skb+0x19/0x40
> 	ip_push_pending_frames+0x33/0x40
> 	raw_sendmsg+0x7d3/0xb50
> 	inet_sendmsg+0x31/0xb0
> 	sock_sendmsg+0x38/0x50
> 	SYSC_sendto+0x102/0x190
> 	SyS_sendto+0xe/0x10
> 	do_syscall_64+0x67/0x180
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> This change addresses the issue avoiding updating the bonding device
> hard_header_len when the slaves list become empty, forbidding to
> shrink it below the value used by header_ops->create().
> 
> The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large
> hard_header_len") but the panic can be triggered only since
> commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard
> header").
> 
> Reported-by: Norbert P <noe-PRwTpj6vllL463JZfw7VRw@public.gmane.org>
> Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len")
> Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header")
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---

Thanks Paolo.

>  drivers/net/bonding/bond_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 8a4ba8b..34481c9 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1104,11 +1104,11 @@ static void bond_compute_features(struct bonding *bond)
>  		gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
>  		gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
>  	}
> +	bond_dev->hard_header_len = max_hard_header_len;
>  
>  done:
>  	bond_dev->vlan_features = vlan_features;
>  	bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
> -	bond_dev->hard_header_len = max_hard_header_len;
>  	bond_dev->gso_max_segs = gso_max_segs;
>  	netif_set_gso_max_size(bond_dev, gso_max_size);
>  
> -- 
> 2.9.3
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-27 23:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
	Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <20170427221132.GA30036-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>



On 27/04/17 04:11 PM, Jason Gunthorpe wrote:
> On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> Well, that is in the current form, with more users it would make sense
> to optimize for the single page case, eg by providing the existing
> call, providing a faster single-page-only variant of the copy, perhaps
> even one that is inlined.

Ok, does it make sense then to have an sg_copy_page_to_buffer (or some
such... I'm having trouble thinking of a sane name that isn't too long).
That just does k(un)map_atomic and memcpy? I could try that if it makes
sense to people.

>> Switching the for_each_sg to sg_miter is probably the nicer solution as
>> it takes care of the mapping and the offset/length accounting for you
>> and will have similar performance.
> 
> sg_miter will still fail when the sg contains __iomem, however I would
> expect that the sg_copy will work with iomem, by using the __iomem
> memcpy variant.

Yes, that's true. Any sg_miters that ever see iomem will need to be
converted to support it. This isn't much different than the other
kmap(sg_page()) users I was converting that will also fail if they see
iomem. Though, I suspect an sg_miter user would be easier to convert to
iomem than a random kmap user.

Logan

^ permalink raw reply

* Re: [PATCH net] bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
From: Jay Vosburgh @ 2017-04-27 23:08 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, Honggang LI,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <733d454d3c36e99b55de5374c7664364975b171d.1493313626.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

>On slave list updates, the bonding driver computes its hard_header_len
>as the maximum of all enslaved devices's hard_header_len.
>If the slave list is empty, e.g. on last enslaved device removal,
>ETH_HLEN is used.
>
>Since the bonding header_ops are set only when the first enslaved
>device is attached, the above can lead to header_ops->create()
>being called with the wrong skb headroom in place.
>
>If bond0 is configured on top of ipoib devices, with the
>following commands:
>
>ifup bond0
>for slave in $BOND_SLAVES_LIST; do
>	ip link set dev $slave nomaster
>done
>ping -c 1 <ip on bond0 subnet>
>
>we will obtain a skb_under_panic() with a similar call trace:
>	skb_push+0x3d/0x40
>	push_pseudo_header+0x17/0x30 [ib_ipoib]
>	ipoib_hard_header+0x4e/0x80 [ib_ipoib]
>	arp_create+0x12f/0x220
>	arp_send_dst.part.19+0x28/0x50
>	arp_solicit+0x115/0x290
>	neigh_probe+0x4d/0x70
>	__neigh_event_send+0xa7/0x230
>	neigh_resolve_output+0x12e/0x1c0
>	ip_finish_output2+0x14b/0x390
>	ip_finish_output+0x136/0x1e0
>	ip_output+0x76/0xe0
>	ip_local_out+0x35/0x40
>	ip_send_skb+0x19/0x40
>	ip_push_pending_frames+0x33/0x40
>	raw_sendmsg+0x7d3/0xb50
>	inet_sendmsg+0x31/0xb0
>	sock_sendmsg+0x38/0x50
>	SYSC_sendto+0x102/0x190
>	SyS_sendto+0xe/0x10
>	do_syscall_64+0x67/0x180
>	entry_SYSCALL64_slow_path+0x25/0x25
>
>This change addresses the issue avoiding updating the bonding device
>hard_header_len when the slaves list become empty, forbidding to
>shrink it below the value used by header_ops->create().
>
>The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large
>hard_header_len") but the panic can be triggered only since
>commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard
>header").
>
>Reported-by: Norbert P <noe-PRwTpj6vllL463JZfw7VRw@public.gmane.org>
>Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len")
>Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header")
>Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Jay Vosburgh <jay.vosburgh-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>


> drivers/net/bonding/bond_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 8a4ba8b..34481c9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1104,11 +1104,11 @@ static void bond_compute_features(struct bonding *bond)
> 		gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
> 		gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
> 	}
>+	bond_dev->hard_header_len = max_hard_header_len;
> 
> done:
> 	bond_dev->vlan_features = vlan_features;
> 	bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
>-	bond_dev->hard_header_len = max_hard_header_len;
> 	bond_dev->gso_max_segs = gso_max_segs;
> 	netif_set_gso_max_size(bond_dev, gso_max_size);
> 
>-- 
>2.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Jason Gunthorpe @ 2017-04-27 23:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
	Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <3a7c0d27-0744-4e91-b37f-3885c50455e8-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

On Thu, Apr 27, 2017 at 05:03:45PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 27/04/17 04:11 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> > Well, that is in the current form, with more users it would make sense
> > to optimize for the single page case, eg by providing the existing
> > call, providing a faster single-page-only variant of the copy, perhaps
> > even one that is inlined.
> 
> Ok, does it make sense then to have an sg_copy_page_to_buffer (or some
> such... I'm having trouble thinking of a sane name that isn't too long).
> That just does k(un)map_atomic and memcpy? I could try that if it makes
> sense to people.

It seems the most robust: test for iomem, and jump to a slow path
copy, otherwise inline the kmap and memcpy

Every place doing memcpy from sgl will need that pattern to be
correct.

> > sg_miter will still fail when the sg contains __iomem, however I would
> > expect that the sg_copy will work with iomem, by using the __iomem
> > memcpy variant.
> 
> Yes, that's true. Any sg_miters that ever see iomem will need to be
> converted to support it. This isn't much different than the other
> kmap(sg_page()) users I was converting that will also fail if they see
> iomem. Though, I suspect an sg_miter user would be easier to convert to
> iomem than a random kmap user.

How? sg_miter seems like the next nightmare down this path, what is
sg_miter_next supposed to do when something hits an iomem sgl?

miter.addr is supposed to be a kernel pointer that must not be
__iomem..

Jason

^ permalink raw reply

* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-27 23:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Boris Ostrovsky, linux-nvdimm, dri-devel, Stephen Bates, dm-devel,
	target-devel, Christoph Hellwig, devel, James E.J. Bottomley,
	linux-scsi, Matthew Wilcox, linux-rdma, Sumit Semwal,
	Ross Zwisler, open-iscsi, linux-media, Juergen Gross,
	Julien Grall, Konrad Rzeszutek Wilk, intel-gfx, sparmaintainer,
	linux-raid, Dan Williams, megaraidlinux.pdl, Jens Axboe,
	"Martin K. Petersen" <martin.p
In-Reply-To: <20170427232022.GA30398@obsidianresearch.com>



On 27/04/17 05:20 PM, Jason Gunthorpe wrote:
> It seems the most robust: test for iomem, and jump to a slow path
> copy, otherwise inline the kmap and memcpy
> 
> Every place doing memcpy from sgl will need that pattern to be
> correct.

Ok, sounds like a good place to start to me. I'll see what I can do for
a v3 of this set. Though, I probably won't send anything until after the
merge window.

>>> sg_miter will still fail when the sg contains __iomem, however I would
>>> expect that the sg_copy will work with iomem, by using the __iomem
>>> memcpy variant.
>>
>> Yes, that's true. Any sg_miters that ever see iomem will need to be
>> converted to support it. This isn't much different than the other
>> kmap(sg_page()) users I was converting that will also fail if they see
>> iomem. Though, I suspect an sg_miter user would be easier to convert to
>> iomem than a random kmap user.
> 
> How? sg_miter seems like the next nightmare down this path, what is
> sg_miter_next supposed to do when something hits an iomem sgl?

My proposal is roughly included in the draft I sent upthread. We add an
sg_miter flag indicating the iteratee supports iomem and if miter finds
iomem (with the support flag set) it sets ioaddr which is __iomem. The
iteratee then just needs to null check addr and ioaddr and perform the
appropriate action.

Logan

^ permalink raw reply

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: Alexei Starovoitov @ 2017-04-27 23:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Andy Gospodarek
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Daniel Borkmann, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org
In-Reply-To: <20170427104121.32df2178@redhat.com>

On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
> When registering/attaching a XDP/bpf program, we would just send the
> file-descriptor for this port-map along (like we do with the bpf_prog
> FD). Plus, it own ingress-port number this program is in the port-map.
>
> It is not clear to me, in-which-data-structure on the kernel-side we
> store this reference to the port-map and ingress-port. As today we only
> have the "raw" struct bpf_prog pointer. I see several options:
>
> 1. Create a new xdp_prog struct that contains existing bpf_prog,
> a port-map pointer and ingress-port. (IMHO easiest solution)
>
> 2. Just create a new pointer to port-map and store it in driver rx-ring
> struct (like existing bpf_prog), but this create a race-challenge
> replacing (cmpxchg) the program (or perhaps it's not a problem as it
> runs under rcu and RTNL-lock).
>
> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
> fast-way to access it.  I assume it will be accessible via
> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.

I'm not sure I completely follow the 3 proposals.
Are you suggesting to have only one netdev_array per program?
Why not to allow any number like we do for tailcall+prog_array, etc?
We can teach verifier to allow new helper
bpf_tx_port(netdev_array, port_num);
to only be used with netdev_array map type.
It will fetch netdevice pointer from netdev_array[port_num]
and will tx the packet into it.
We can make it similar to bpf_tail_call(), so that program will
finish on successful bpf_tx_port() or
make it into 'delayed' tx which will be executed when program finishes.
Not sure which approach is better.

We can also extend this netdev_array into broadcast/multicast. Like
bpf_tx_allports(&netdev_array);
call from the program will xmit the packet to all netdevices
in that 'netdev_array' map type.

The map-in-map support can be trivially extended to allow netdev_array,
then the program can create N multicast groups of netdevices.
Each multicast group == one netdev_array map.
The user space will populate a hashmap with these netdev_arrays and
bpf kernel side can select dynamically which multicast group to use
to send the packets to.
bpf kernel side may look like:
struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key);
if (!netdev_array)
   ...
if (my_condition)
    bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
else
    bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */

that's an artificial example. Just trying to point out
that we shouldn't restrict the feature too soon.

^ permalink raw reply

* Re: [Patch net-next] ipv4: get rid of ip_ra_lock
From: Cong Wang @ 2017-04-27 23:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Network Developers
In-Reply-To: <1493297179.6453.105.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, Apr 27, 2017 at 5:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-04-26 at 13:55 -0700, Cong Wang wrote:
>> After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
>> we always take RTNL lock for ip_ra_control() which is the only place
>> we update the list ip_ra_chain, so the ip_ra_lock is no longer needed,
>> we just need to disable BH there.
>>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>
> Looks great, but reading again this code, I believe we do not need to
> disable BH at all ?
>

Hmm, if we don't disable BH here, a reader in BH could jump in and
break this critical section? Or that is fine for RCU?

^ permalink raw reply

* Re: [PATCH iproute2] routel: fix infinite loop in line parser
From: Stephen Hemminger @ 2017-04-27 23:46 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: netdev
In-Reply-To: <20170427094347.3EB63A0F1C@unicorn.suse.cz>

On Thu, 27 Apr 2017 11:43:47 +0200 (CEST)
Michal Kubecek <mkubecek@suse.cz> wrote:

> As noticed by one of the few users of routel script, it ends up in an
> infinite loop when they pull out the cable from the NIC used for some
> route. This is caused by its parser expecting the line of "ip route show"
> output consists of "key value" pairs (except for the initial target range),
> together with an old trap of Bourne style shells that "shift 2" does
> nothing if there is only one argument left. Some keywords, e.g. "linkdown",
> are not followed by a value.
> 
> Improve the parser to
> 
>   (1) only set variables for keywords we care about
>   (2) recognize (currently) known keywords without value
> 
> This is still far from perfect (and certainly not future proof) but to
> fully fix the script, one would probably have to rewrite the logic
> completely (and I'm not sure it's worth the effort).
> 
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>

Appled, but this really needs to be done better.
Either as a simplified output of route command. See ip -br link
Or ip route should have a json output option and use python/perl/xss
to reformat.

^ permalink raw reply

* Re: [Patch net-next] ipv4: get rid of ip_ra_lock
From: Eric Dumazet @ 2017-04-27 23:54 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpXexP9OzzqyPJ-yHmyq-ZF=cNboaG8566yaxZKbn0+TTg@mail.gmail.com>

On Thu, 2017-04-27 at 16:46 -0700, Cong Wang wrote:
> On Thu, Apr 27, 2017 at 5:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2017-04-26 at 13:55 -0700, Cong Wang wrote:
> >> After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
> >> we always take RTNL lock for ip_ra_control() which is the only place
> >> we update the list ip_ra_chain, so the ip_ra_lock is no longer needed,
> >> we just need to disable BH there.
> >>
> >> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> >> ---
> >
> > Looks great, but reading again this code, I believe we do not need to
> > disable BH at all ?
> >
> 
> Hmm, if we don't disable BH here, a reader in BH could jump in and
> break this critical section? Or that is fine for RCU?

It should be fine for RCU.

The spinlock (or mutex if this is RTNL) is protecting writers among
themselves. Here it should run in process context, with no specific
rules to disable preemption, hard or soft irqs.

The reader(s) do not care of how writer(s) enforce their mutual
protection, and if writer(s) disable hard or soft irqs.

^ permalink raw reply

* Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages
From: Roopa Prabhu @ 2017-04-28  0:11 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: David Ahern, Vladislav Yasevich, netdev@vger.kernel.org,
	Jiri Pirko
In-Reply-To: <8986b8c8-9bf1-21fc-49e5-e196630cd318@redhat.com>

On Thu, Apr 27, 2017 at 12:51 PM, Vlad Yasevich <vyasevic@redhat.com> wrote:
> On 04/24/2017 11:14 AM, Roopa Prabhu wrote:
>> On Sun, Apr 23, 2017 at 6:07 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>>>
>>> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>>>> @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
>>>>       return err;
>>>>  }
>>>>
>>>> +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
>>>> +{
>>>> +     u32 rtnl_event;
>>>> +
>>>> +     switch (event) {
>>>> +     case NETDEV_REBOOT:
>>>> +             rtnl_event = IFLA_EVENT_REBOOT;
>>>> +             break;
>>>> +     case NETDEV_FEAT_CHANGE:
>>>> +             rtnl_event = IFLA_EVENT_FEAT_CHANGE;
>>>> +             break;
>>>> +     case NETDEV_BONDING_FAILOVER:
>>>> +             rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>>>> +             break;
>>>> +     case NETDEV_NOTIFY_PEERS:
>>>> +             rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
>>>> +             break;
>>>> +     case NETDEV_RESEND_IGMP:
>>>> +             rtnl_event = IFLA_EVENT_RESEND_IGMP;
>>>> +             break;
>>>> +     case NETDEV_CHANGEINFODATA:
>>>> +             rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>>>> +             break;
>>>> +     default:
>>>> +             return 0;
>>>> +     }
>>>> +
>>>> +     return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
>>>> +}
>>>> +
>>>
>>> I still have doubts about encoding kernel events into a uapi.
>>
>> agree. I don't see why user-space will need NETDEV_CHANGEINFODATA and
>> others david listed.
>>
>
> Well, I am not sure about CHANGEINFODATA as well, but I can see use
> cases for others.
>
>> My other concerns are, once we have this exposed to user-space and
>> user-space starts relying on it, it will need accurate information and
>> will expect to have this event information all the time.
>> IIUC, we cannot cover multiple events in a single notification and not
>> all link notifications will contain an IFLA_EVENT attribute.
>
> Uhm...  If the rtnetlink message was a result of an event, it will have
> an IFLA_EVENT.  If a message is something else, then it will not have
> an event.  That's the point.  Not all netlink attributes are in every
> netlink message.
>
>> In other
>> words, we will be telling user-space to not expect that the kernel
>> will send IFLA_EVENT every time.
>>
>
> No, we are telling the user that if it is interested in a specific event
> (let's say NOTIFY_PEERS or RESEND_IGMP), then it now can monitor netlink
> traffic for those events.
> As things stand right now, that's not possible.
>
> I've done this specifically for all events for which we currently generate
> a netlink message.
>
> The only concern I have is that if in the future we remove a certain netdev
> event, it may impact applications.  But we may be doing it right now as well,
> only silently, and the apps may have to find some ways to work around it.
>

ok, fair enough. it might be ok then....except for the specific
attributes that user-space may not be interested like CHANGEINFODATA.

^ permalink raw reply

* RE: [PATCH net] driver/net: Fix possible memleaks when fail to register_netdevice
From: Gao Feng @ 2017-04-28  0:27 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: jiri, davem, kuznet, jmorris, yoshfuji, kaber, steffen.klassert,
	netdev, 'Gao Feng'
In-Reply-To: <20170427081559.GA1058@gondor.apana.org.au>

> From: Gao Feng [mailto:gfree.wind@foxmail.com]
> Sent: Thursday, April 27, 2017 4:33 PM
> > From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
> > Sent: Thursday, April 27, 2017 4:16 PM On Tue, Apr 25, 2017 at
> > 08:01:50PM +0800, gfree.wind@foxmail.com wrote:
> > > From: Gao Feng <fgao@ikuai8.com>
> > >
> [...]
> >
[...]

I think I get one method which could be safe to free the mem in ndo_uninit.
Thanks, I would send the v2 patch later.

Best Regards
Feng

^ permalink raw reply

* [PATCH net-next] virtio-net: use netif_tx_napi_add for tx napi
From: Willem de Bruijn @ 2017-04-28  0:37 UTC (permalink / raw)
  To: netdev; +Cc: mst, davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/virtio_net.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 82f1c3a73345..7877551fe4e0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2228,8 +2228,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 		vi->rq[i].pages = NULL;
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
-		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
-			       napi_tx ? napi_weight : 0);
+		netif_tx_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+				  napi_tx ? napi_weight : 0);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* prog ID and next steps. Was: [RFC net-next 0/2] Introduce bpf_prog ID and iteration
From: Alexei Starovoitov @ 2017-04-28  1:11 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Martin KaFai Lau, netdev
  Cc: Daniel Borkmann, kernel-team, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Thomas Graf
In-Reply-To: <e81805c8-0499-a0a5-b788-0168947d9b8c@stressinduktion.org>

On 4/27/17 6:36 AM, Hannes Frederic Sowa wrote:
> On 27.04.2017 08:24, Martin KaFai Lau wrote:
>> This patchset introduces the bpf_prog ID and a new bpf cmd to
>> iterate all bpf_prog in the system.
>>
>> It is still incomplete.  The idea can be extended to bpf_map.
>>
>> Martin KaFai Lau (2):
>>   bpf: Introduce bpf_prog ID
>>   bpf: Test for bpf_prog ID and BPF_PROG_GET_NEXT_ID
>
> Thanks Martin, I like the approach.
>
> I think the progid is also much more suitable to be used in kallsyms
> because it handles collisions correctly and let's correctly walk the
> chain (for example imaging loading two identical programs but install
> them at different hooks, kallsysms doesn't allow to find out which
> program is installed where).

i disagree re: kallsyms. The goal of prog_tag is to let program writers
understand which program is running in a stable way.
id is assigned dynamically and not suitable for that purpose.

> It would help a lot if you could pass the prog_id back during program
> creation, otherwise it will be kind of difficult to get a hold on which
> program is where. ;)

yes, but not a creation time. bpf_prog_load command will keep returning
an FD and all operations on programs will be allowed with FD only.

Think of this 'ID' as program handle or program pointer.
In other words it's obfuscated kernel 'struct bpf_prog *' given to
user space, so that user space can later convert this ID into FD.
The other patch (not shown) will take ID from user space and will
convert it to FD if prog->aux->user is the same or root.

We tried really hard to keep everything FD based. Unfortunately
netlink is not suitable to pass FDs, so to query TC and XDP
we either have to invent a way to install FD from netlink in recvmsg()
or pass something that can be converted to FD later.
That's what program ID is solving.

This set of patches look trivial with simple use of idr,
but it took us long time to get there.
We tried to use 64-bit ID to avoid wrap around issue, but association
between ID and bpf_prog needs to be kept somewhere. The obvious
answer is rhashtable, but it cannot be iterated easily.
Like we'd need to dump the whole thing through bpf syscall which
is not practical.
Then we tried to use 32-bit idr's id + 32-bit timestamp/random.
It works better, but then we hit the issue that bpf_prog_get_next_id
cannot be iterated in a stable way when programs are being deleted
while user space iterates over the whole list.
So at the end we scraped all the fancy things and went with
simple 32-bit ID allocated in _cyclic_ way via idr.
The reason for cyclic is to avoid prog delete/create races,
so ID seen by user space stays stable for 2B ids.
We were concerned that somebody might try to load/delete
a program 2B times to cause the counter to wrap around, but
it turned out not to be an issue. In that sense prog ID is similar
to PID.

So more complete picture of what we're trying to do:
- new bpf_get_fd_from_id syscall cmd will be used to convert
   prog ID into prog FD
- tc/xdp/sockets/tracing attachment points will return prog ID
- existing bpf_map_lookup() cmd from prog_array will be returning
   prog ID
- bpf_prog_next_id syscall cmd (this patch) is used to iterate
   over all prog IDs
- new bpf_prog_get_info syscall cmd (based on prog FD) will be used
   to get all or partial info about the program that kernel knows about

Example usage:
- if user space want to see instructions of all loaded programs
   it can use a loop like:
while (!bpf_prog_get_next_id(next_id, &next_id)) {
    int fd = bpf_prog_get_fd_from_id(next_id);
    struct bpf_prog_info info;
    bpf_prog_get_info(fd, &info, flags);
    // look into info.insns[]
    close(fd);
}

- if user space want to see prog_tag of xdp program attached to eth0
   // netlink sendmsg() into ifindex of eth0 that returns prog ID
    int fd = bpf_prog_get_fd_from_id(id_from_netlink);
    struct bpf_prog_info info;
    bpf_prog_get_info(fd, &info, flags);
    // look into info.prog_tag
    close(fd);

the 'flags' argument of bpf_prog_get_info() will be used
to tell kernel which info about the program needs to be dumped.
Otherwise if kernel always dumps everything about the program,
it will make the syscall too slow and too cumbersome.
Possible combinations:
- prog_type, prog_tag, license, prog ID
- array of prog instructions
- array of map IDs
Here we'll introduce similar IDs for maps and
bpf_map_get_info() syscall cmd that will return map_type, map_id, sizes.
If user wants to iterate over all elements of the map, they can
use map_fd = bpf_map_get_fd_from_id(map_id); command
and later use existing bpf_map_get_next_key+bpf_map_lookup_elem.

We believe this way the user space will be able to see _everything_
about bpf programs and maps and can pick and choose whether
it wants to see only programs or only maps or partial info
about progs (without instructions) and so on.

Once we have CTF (debug info) available for maps and progs,
we will extend bpf_prog_get_info() and bpf_map_get_info()
commands to optionally return that as well.

^ permalink raw reply

* Re: [PATCH net-next v8 2/3] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
From: Jamal Hadi Salim @ 2017-04-28  1:22 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: davem, xiyou.wangcong, eric.dumazet, netdev, Simon Horman,
	Benjamin LaHaise
In-Reply-To: <20170427063039.GB1870@nanopsycho.orion>

On 17-04-27 02:30 AM, Jiri Pirko wrote:
> Wed, Apr 26, 2017 at 10:07:08PM CEST, jhs@mojatatu.com wrote:
>> On 17-04-26 09:56 AM, Jiri Pirko wrote:
>>> Wed, Apr 26, 2017 at 03:14:38PM CEST, jhs@mojatatu.com wrote:

>> I think to have flags at that level is useful but it
>> is a different hierarchy level. I am not sure the
>> "actions dump large messages" is a fit for that level.
>
> Jamal, the idea is to have exactly what you want to have. Only does not
> use NLA_U32 attr for that but a special attr NLA_FLAGS which would have
> well defined semantics and set of helpers to work with and enforce it.
>
> Then, this could be easily reused in other subsystem that uses netlink
>

Maybe I am misunderstanding:
Recall, this is what it looks like with this patchset:
<nlh><subsytem-header>[TCA_ROOT_XXXX]

TCA_ROOT_XXX is very subsystem specific. classifiers, qdiscs and many
subsystems defined their own semantics for that TLV level. This specific
"dump max" is very very specific to actions. They were crippled by the
fact you could only send 32 at a time - this allows more to be sent.

I thought initially you meant:
<nlh>[NLA_XXX]<subsytem-header>[TCA_ROOT_XXXX]

I think at the NLA_XXX you could fit netlink wide TLVs - but if i said
"do a large dump" it is of no use to any other subsystem.

cheers,
jamal

^ permalink raw reply

* Re: [patch net-next 10/10] net: sched: extend gact to allow jumping to another filter chain
From: Jamal Hadi Salim @ 2017-04-28  1:41 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, xiyou.wangcong, dsa, edumazet, stephen, daniel,
	alexander.h.duyck, mlxsw, simon.horman
In-Reply-To: <1493291540-2119-11-git-send-email-jiri@resnulli.us>


Jiri,

Good stuff!
Thanks for the effort.

I didnt review the details - will do. I wanted to raise one issue.
This should work for all actions, not just gact (refer to the
recent commit i made on the action jumping).

Example policy for policer:

#if packets destined for mac address 52:54:00:3d:c7:6d
#exceed 90kbps with burst of 90K then jump to chain 11
#for further classification, otherwise set their skb mark to 11
# and proceed.

tc filter add dev eth0 parent ffff: protocol ip pref 33 \
flower dst_mac 52:54:00:3d:c7:6d \
action police rate 1kbit burst 90k conform-exceed pipe/goto chain 11 \
action skbedit mark 11

But i should also be able to do this for any other action, etc.

For this to work, you have to be able to encode the action in the
opcode. Something like (for 2^16 chains):

#define TC_ACT_GOTO_CHAIN	0x20000000
#define TCA_ACT_MAX_CHAIN_MASK 0xFFFF

So 0x20000001 is encoding of chain 1 etc.

I will post the iproute2 code i used for jumping of actions.

cheers,
jamal

On 17-04-27 07:12 AM, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Introduce new type of gact action called "goto_chain". This allows
> user to specify a chain to be processed. This action type is
> then processed as a return value in tcf_classify loop in similar
> way as "reclassify" is, only it does not reset to the first filter
> in chain but rather reset to the first filter of the desired chain.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/net/sch_generic.h           |  9 +++++--
>  include/net/tc_act/tc_gact.h        |  2 ++
>  include/uapi/linux/pkt_cls.h        |  1 +
>  include/uapi/linux/tc_act/tc_gact.h |  1 +
>  net/sched/act_gact.c                | 48 ++++++++++++++++++++++++++++++++++++-
>  net/sched/cls_api.c                 |  8 +++++--
>  6 files changed, 64 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 569b565..3688501 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -193,8 +193,13 @@ struct Qdisc_ops {
>
>
>  struct tcf_result {
> -	unsigned long	class;
> -	u32		classid;
> +	union {
> +		struct {
> +			unsigned long	class;
> +			u32		classid;
> +		};
> +		const struct tcf_proto *goto_tp;
> +	};
>  };
>
>  struct tcf_proto_ops {
> diff --git a/include/net/tc_act/tc_gact.h b/include/net/tc_act/tc_gact.h
> index b6f1739..58bee54 100644
> --- a/include/net/tc_act/tc_gact.h
> +++ b/include/net/tc_act/tc_gact.h
> @@ -12,6 +12,8 @@ struct tcf_gact {
>  	int			tcfg_paction;
>  	atomic_t		packets;
>  #endif
> +	struct tcf_chain	*goto_chain;
> +	struct rcu_head		rcu;
>  };
>  #define to_gact(a) ((struct tcf_gact *)a)
>
> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
> index f1129e3..e03ba27 100644
> --- a/include/uapi/linux/pkt_cls.h
> +++ b/include/uapi/linux/pkt_cls.h
> @@ -37,6 +37,7 @@ enum {
>  #define TC_ACT_QUEUED		5
>  #define TC_ACT_REPEAT		6
>  #define TC_ACT_REDIRECT		7
> +#define TC_ACT_GOTO_CHAIN	8
>  #define TC_ACT_JUMP		0x10000000
>
>  /* Action type identifiers*/
> diff --git a/include/uapi/linux/tc_act/tc_gact.h b/include/uapi/linux/tc_act/tc_gact.h
> index 70b536a..388733d 100644
> --- a/include/uapi/linux/tc_act/tc_gact.h
> +++ b/include/uapi/linux/tc_act/tc_gact.h
> @@ -26,6 +26,7 @@ enum {
>  	TCA_GACT_PARMS,
>  	TCA_GACT_PROB,
>  	TCA_GACT_PAD,
> +	TCA_GACT_CHAIN,
>  	__TCA_GACT_MAX
>  };
>  #define TCA_GACT_MAX (__TCA_GACT_MAX - 1)
> diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
> index c527c11..d63aebd 100644
> --- a/net/sched/act_gact.c
> +++ b/net/sched/act_gact.c
> @@ -20,6 +20,7 @@
>  #include <linux/init.h>
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
> +#include <net/pkt_cls.h>
>  #include <linux/tc_act/tc_gact.h>
>  #include <net/tc_act/tc_gact.h>
>
> @@ -54,6 +55,7 @@ static g_rand gact_rand[MAX_RAND] = { NULL, gact_net_rand, gact_determ };
>  static const struct nla_policy gact_policy[TCA_GACT_MAX + 1] = {
>  	[TCA_GACT_PARMS]	= { .len = sizeof(struct tc_gact) },
>  	[TCA_GACT_PROB]		= { .len = sizeof(struct tc_gact_p) },
> +	[TCA_GACT_CHAIN]	= { .type = NLA_U32 },
>  };
>
>  static int tcf_gact_init(struct net *net, struct tcf_proto *tp,
> @@ -92,6 +94,9 @@ static int tcf_gact_init(struct net *net, struct tcf_proto *tp,
>  	}
>  #endif
>
> +	if (parm->action == TC_ACT_GOTO_CHAIN && !tb[TCA_GACT_CHAIN])
> +		return -EINVAL;
> +
>  	if (!tcf_hash_check(tn, parm->index, a, bind)) {
>  		ret = tcf_hash_create(tn, parm->index, est, a,
>  				      &act_gact_ops, bind, true);
> @@ -121,11 +126,43 @@ static int tcf_gact_init(struct net *net, struct tcf_proto *tp,
>  		gact->tcfg_ptype   = p_parm->ptype;
>  	}
>  #endif
> +
> +	if (gact->tcf_action == TC_ACT_GOTO_CHAIN) {
> +		u32 chain_index = nla_get_u32(tb[TCA_GACT_CHAIN]);
> +
> +		if (!tp) {
> +			if (ret == ACT_P_CREATED)
> +				tcf_hash_release(*a, bind);
> +			return -EINVAL;
> +		}
> +		gact->goto_chain = tcf_chain_get(tp->chain->block, chain_index);
> +		if (!gact->goto_chain) {
> +			if (ret == ACT_P_CREATED)
> +				tcf_hash_release(*a, bind);
> +			return -ENOMEM;
> +		}
> +	}
> +
>  	if (ret == ACT_P_CREATED)
>  		tcf_hash_insert(tn, *a);
>  	return ret;
>  }
>
> +static void tcf_gact_cleanup_rcu(struct rcu_head *rcu)
> +{
> +	struct tcf_gact *gact = container_of(rcu, struct tcf_gact, rcu);
> +
> +	if (gact->tcf_action == TC_ACT_GOTO_CHAIN)
> +		tcf_chain_put(gact->goto_chain);
> +}
> +
> +static void tcf_gact_cleanup(struct tc_action *a, int bind)
> +{
> +	struct tcf_gact *gact = to_gact(a);
> +
> +	call_rcu(&gact->rcu, tcf_gact_cleanup_rcu);
> +}
> +
>  static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
>  		    struct tcf_result *res)
>  {
> @@ -141,8 +178,13 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
>  	}
>  #endif
>  	bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb);
> -	if (action == TC_ACT_SHOT)
> +	if (action == TC_ACT_SHOT) {
>  		qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));
> +	} else if (action == TC_ACT_GOTO_CHAIN) {
> +		struct tcf_chain *chain = gact->goto_chain;
> +
> +		res->goto_tp = rcu_dereference_bh(chain->filter_chain);
> +	}
>
>  	tcf_lastuse_update(&gact->tcf_tm);
>
> @@ -194,6 +236,9 @@ static int tcf_gact_dump(struct sk_buff *skb, struct tc_action *a,
>  	tcf_tm_dump(&t, &gact->tcf_tm);
>  	if (nla_put_64bit(skb, TCA_GACT_TM, sizeof(t), &t, TCA_GACT_PAD))
>  		goto nla_put_failure;
> +	if (gact->tcf_action == TC_ACT_GOTO_CHAIN &&
> +	    nla_put_u32(skb, TCA_GACT_CHAIN, gact->goto_chain->index))
> +		goto nla_put_failure;
>  	return skb->len;
>
>  nla_put_failure:
> @@ -225,6 +270,7 @@ static struct tc_action_ops act_gact_ops = {
>  	.stats_update	=	tcf_gact_stats_update,
>  	.dump		=	tcf_gact_dump,
>  	.init		=	tcf_gact_init,
> +	.cleanup	=	tcf_gact_cleanup,
>  	.walk		=	tcf_gact_walker,
>  	.lookup		=	tcf_gact_search,
>  	.size		=	sizeof(struct tcf_gact),
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index dbc1348..a2d6bc7 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -304,10 +304,14 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>  			continue;
>
>  		err = tp->classify(skb, tp, res);
> -		if (unlikely(err == TC_ACT_RECLASSIFY && !compat_mode))
> +		if (err == TC_ACT_RECLASSIFY && !compat_mode) {
>  			goto reset;
> -		if (err >= 0)
> +		} else if (err == TC_ACT_GOTO_CHAIN) {
> +			old_tp = res->goto_tp;
> +			goto reset;
> +		} else if (err >= 0) {
>  			return err;
> +		}
>  	}
>
>  	return TC_ACT_UNSPEC; /* signal: continue lookup */
>

^ permalink raw reply

* [PATCH] drivers:net:ethernet:emulex:benet: Use time_before_eq for time comparison
From: Karim Eshapa @ 2017-04-28  1:48 UTC (permalink / raw)
  To: sathya.perla
  Cc: ajit.khaparde, sriharsha.basavapatna, somnath.kotur, netdev,
	linux-kernel, Karim Eshapa, Karim Eshapa

Use time_before_eq for time comparison more safe and dealing
with timer wrapping to be future-proof.

Signed-off-by: Karim Eshapa <kaim.eshapa@gmail.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 6be3b9a..56d2368 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -5217,15 +5217,15 @@ static bool be_err_is_recoverable(struct be_adapter *adapter)
 	dev_err(&adapter->pdev->dev, "Recoverable HW error code: 0x%x\n",
 		ue_err_code);
 
-	if (jiffies - err_rec->probe_time <= initial_idle_time) {
+	if (time_before_eq(jiffies - err_rec->probe_time, initial_idle_time)) {
 		dev_err(&adapter->pdev->dev,
 			"Cannot recover within %lu sec from driver load\n",
 			jiffies_to_msecs(initial_idle_time) / MSEC_PER_SEC);
 		return false;
 	}
 
-	if (err_rec->last_recovery_time &&
-	    (jiffies - err_rec->last_recovery_time <= recovery_interval)) {
+	if (err_rec->last_recovery_time && time_before_eq(
+		jiffies - err_rec->last_recovery_time, recovery_interval)) {
 		dev_err(&adapter->pdev->dev,
 			"Cannot recover within %lu sec from last recovery\n",
 			jiffies_to_msecs(recovery_interval) / MSEC_PER_SEC);
-- 
2.7.4

^ permalink raw reply related

* iproute2: tc action jump
From: Jamal Hadi Salim @ 2017-04-28  1:54 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 161 bytes --]


Attached.
Needs some cleanup and more testing (I only tested a few of these
actions - should work, just need to vet before official submission).

cheers,
jamal

[-- Attachment #2: tc-jump-actions --]
[-- Type: text/plain, Size: 9603 bytes --]

diff --git a/tc/m_connmark.c b/tc/m_connmark.c
index 295f90d..bdb6195 100644
--- a/tc/m_connmark.c
+++ b/tc/m_connmark.c
@@ -81,7 +81,7 @@ parse_connmark(struct action_util *a, int *argc_p, char ***argv_p, int tca_id,
 	}
 
 	sel.action = TC_ACT_PIPE;
-	if (argc && !action_a2n(*argv, &sel.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &sel.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/m_csum.c b/tc/m_csum.c
index 0ee8cad..e781520 100644
--- a/tc/m_csum.c
+++ b/tc/m_csum.c
@@ -123,7 +123,7 @@ parse_csum(struct action_util *a, int *argc_p,
 		return -1;
 	}
 
-	if (argc && !action_a2n(*argv, &sel.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &sel.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/m_gact.c b/tc/m_gact.c
index 755a3be..7a93d27 100644
--- a/tc/m_gact.c
+++ b/tc/m_gact.c
@@ -68,15 +68,22 @@ usage(void)
 	exit(-1);
 }
 
+/* XXX: Obsolete this function */
 static int
-get_act(char ***argv_p)
+get_act(int *argc_p, char ***argv_p)
 {
+	char **argv = *argv_p;
+	int argc = *argc_p;
 	int n;
 
-	if (action_a2n(**argv_p, &n, false)) {
+	if (action_a2n_jmp(&argc, &argv, &n, false)) {
 		fprintf(stderr, "bad action type %s\n", **argv_p);
 		return -10;
 	}
+
+	*argc_p = argc;
+	*argv_p = argv;
+
 	return n;
 }
 
@@ -102,7 +109,7 @@ parse_gact(struct action_util *a, int *argc_p, char ***argv_p,
 	if (matches(*argv, "gact") == 0) {
 		ok++;
 	} else {
-		action = get_act(&argv);
+		action = get_act(&argc, &argv);
 		if (action != -10) {
 			p.action = action;
 			ok++;
@@ -133,13 +140,15 @@ parse_gact(struct action_util *a, int *argc_p, char ***argv_p,
 				return -1;
 			}
 
-			action = get_act(&argv);
+			action = get_act(&argc, &argv);
 			if (action != -10) { /* FIXME */
 				pp.paction = action;
 			} else {
 				explain();
 				return -1;
 			}
+			//XXX: remove the argc below after testing with random
+			//..
 			argc--;
 			argv++;
 			if (get_u16(&pp.pval, *argv, 10)) {
diff --git a/tc/m_ife.c b/tc/m_ife.c
index f6131b1..23eb637 100644
--- a/tc/m_ife.c
+++ b/tc/m_ife.c
@@ -156,7 +156,7 @@ static int parse_ife(struct action_util *a, int *argc_p, char ***argv_p,
 		argv++;
 	}
 
-	if (argc && !action_a2n(*argv, &p.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &p.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/m_mirred.c b/tc/m_mirred.c
index e943890..91c6f83 100644
--- a/tc/m_mirred.c
+++ b/tc/m_mirred.c
@@ -172,7 +172,7 @@ parse_direction(struct action_util *a, int *argc_p, char ***argv_p,
 
 	if (argc &&
 	    (p.eaction == TCA_EGRESS_MIRROR || p.eaction == TCA_INGRESS_MIRROR)
-	    && !action_a2n(*argv, &p.action, false))
+	    && !action_a2n_jmp(&argc, &argv, &p.action, false))
 		NEXT_ARG();
 
 	if (argc) {
diff --git a/tc/m_nat.c b/tc/m_nat.c
index 525f185..a985165 100644
--- a/tc/m_nat.c
+++ b/tc/m_nat.c
@@ -115,7 +115,7 @@ parse_nat(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, struct
 		return -1;
 	}
 
-	if (argc && !action_a2n(*argv, &sel.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &sel.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/m_pedit.c b/tc/m_pedit.c
index 8e9bf07..4b8ff1a 100644
--- a/tc/m_pedit.c
+++ b/tc/m_pedit.c
@@ -481,7 +481,7 @@ int parse_pedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id,
 		return -1;
 	}
 
-	if (argc && !action_a2n(*argv, &sel.sel.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &sel.sel.action, false))
 		NEXT_ARG();
 
 	if (argc) {
diff --git a/tc/m_police.c b/tc/m_police.c
index 226e20e..71e64fe 100644
--- a/tc/m_police.c
+++ b/tc/m_police.c
@@ -50,24 +50,91 @@ static void explain1(char *arg)
 	fprintf(stderr, "Illegal \"%s\"\n", arg);
 }
 
-static int get_police_result(int *action, int *result, char *arg)
+static int get_police_result(int *action, int *result, int *argc_p,
+			     char ***argv_p)
 {
-	char *p = strchr(arg, '/');
+	char **argv = *argv_p;
+	int argc = *argc_p;
+	char *p = strchr(*argv, '/');
+	int skip_args = 0;
 
 	if (p)
 		*p = 0;
 
-	if (action_a2n(arg, action, true)) {
+	if (action_a2n(*argv, action, true)) {
 		if (p)
 			*p = '/';
 		return -1;
 	}
 
+	if (*action == TC_ACT_JUMP) {
+		__u32 act_goto_cnt = 0;
+
+		argv++;
+		argc--;
+		p = strchr(*argv, '/'); /*was at "jump" */
+		if (p)
+			*p = 0;
+
+		if (get_u32(&act_goto_cnt, *argv, 10)) {
+			if (p)
+				*p = '/';
+			fprintf(stderr, "bad action jmp count %s\n",
+				*argv);
+			return -1;
+		}
+
+		if (act_goto_cnt > 32/*MAX_PRIO*/ || !act_goto_cnt) {
+			if (p)
+				*p = '/';
+			fprintf(stderr, "Bad jmp range %s: Need [1,32]\n",
+				*argv);
+			return -1;
+		}
+
+		*action |= act_goto_cnt;
+		argv++;
+		argc--;
+		skip_args += 1;
+	}
+
 	if (p) {
 		*p = '/';
 		if (action_a2n(p+1, result, true))
 			return -1;
 	}
+
+	if (*result == TC_ACT_JUMP) {
+		__u32 goto_cnt = 0;
+
+		if (!skip_args) {
+			argv++;
+			argc--;
+		}
+
+		if (get_u32(&goto_cnt, *argv, 10)) {
+			if (p)
+				*p = '/';
+			fprintf(stderr, "bad result jmp count %s\n",
+				*argv);
+			return -1;
+		}
+
+		if (goto_cnt > 32/*MAX_PRIO*/ || !goto_cnt) {
+			if (p)
+				*p = '/';
+			fprintf(stderr, "Bad result jmp range %s: Need[1,32]\n",
+				*argv);
+			return -1;
+		}
+
+		*result |= goto_cnt;
+		skip_args += 1;
+	}
+
+	*argc_p -= skip_args;
+	*argv_p += skip_args;
+
 	return 0;
 }
 
@@ -179,7 +246,8 @@ int act_parse_police(struct action_util *a, int *argc_p, char ***argv_p,
 			p.action = TC_POLICE_PIPE;
 		} else if (strcmp(*argv, "conform-exceed") == 0) {
 			NEXT_ARG();
-			if (get_police_result(&p.action, &presult, *argv)) {
+			if (get_police_result(&p.action, &presult, &argc,
+					      &argv)) {
 				fprintf(stderr, "Illegal \"action\"\n");
 				return -1;
 			}
diff --git a/tc/m_simple.c b/tc/m_simple.c
index 3a8bd91..d376f51 100644
--- a/tc/m_simple.c
+++ b/tc/m_simple.c
@@ -120,7 +120,7 @@ parse_simple(struct action_util *a, int *argc_p, char ***argv_p, int tca_id,
 		}
 	}
 
-	if (argc && !action_a2n(*argv, &sel.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &sel.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/m_skbedit.c b/tc/m_skbedit.c
index 638715f..09bab15 100644
--- a/tc/m_skbedit.c
+++ b/tc/m_skbedit.c
@@ -121,7 +121,7 @@ parse_skbedit(struct action_util *a, int *argc_p, char ***argv_p, int tca_id,
 	}
 
 	sel.action = TC_ACT_PIPE;
-	if (argc && !action_a2n(*argv, &sel.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &sel.action, false))
 		NEXT_ARG();
 
 	if (argc) {
diff --git a/tc/m_tunnel_key.c b/tc/m_tunnel_key.c
index 3ceec1c..f69b518 100644
--- a/tc/m_tunnel_key.c
+++ b/tc/m_tunnel_key.c
@@ -158,7 +158,7 @@ static int parse_tunnel_key(struct action_util *a, int *argc_p, char ***argv_p,
 		NEXT_ARG_FWD();
 	}
 
-	if (argc && !action_a2n(*argv, &parm.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &parm.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/m_vlan.c b/tc/m_vlan.c
index 44b9375..6837706 100644
--- a/tc/m_vlan.c
+++ b/tc/m_vlan.c
@@ -134,7 +134,7 @@ static int parse_vlan(struct action_util *a, int *argc_p, char ***argv_p,
 	}
 
 	parm.action = TC_ACT_PIPE;
-	if (argc && !action_a2n(*argv, &parm.action, false))
+	if (argc && !action_a2n_jmp(&argc, &argv, &parm.action, false))
 		NEXT_ARG_FWD();
 
 	if (argc) {
diff --git a/tc/tc_util.c b/tc/tc_util.c
index 24ca1f1..71ce910 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -429,7 +429,11 @@ const char *action_n2a(int action)
 	case TC_ACT_STOLEN:
 		return "stolen";
 	default:
-		snprintf(buf, 64, "%d", action);
+		if (action & TC_ACT_JUMP) {
+			snprintf(buf, 64, "jump %d", action&0x1ff);
+		} else {
+			snprintf(buf, 64, "%d", action);
+		}
 		buf[63] = '\0';
 		return buf;
 	}
@@ -459,15 +463,18 @@ int action_a2n(char *arg, int *result, bool allow_num)
 		{"ok", TC_ACT_OK},
 		{"reclassify", TC_ACT_RECLASSIFY},
 		{"pipe", TC_ACT_PIPE},
+		{"jump", TC_ACT_JUMP},
 		{ NULL },
 	}, *iter;
 
 	for (iter = a2n; iter->a; iter++) {
 		if (matches(arg, iter->a) != 0)
 			continue;
+
 		*result = iter->n;
 		return 0;
 	}
+
 	if (!allow_num || sscanf(arg, "%d%c", &n, &dummy) != 1)
 		return -1;
 
@@ -475,6 +482,44 @@ int action_a2n(char *arg, int *result, bool allow_num)
 	return 0;
 }
 
+int action_a2n_jmp(int *argc_p, char ***argv_p, int *result, bool allow_num)
+{
+	int n;
+	char dummy;
+	char **argv = *argv_p;
+	int argc = *argc_p;
+
+	if (action_a2n(*argv, result, false)) {
+		fprintf(stderr, "bad jmp action type %s\n", **argv_p);
+		return -10;
+	}
+
+	if (*result == TC_ACT_JUMP) {
+		__u32 goto_cnt = 0;
+
+		argv++;
+		argc--;
+		if (get_u32(&goto_cnt, *argv, 10)) {
+			fprintf(stderr, "bad action jmp count %s\n",
+				*argv);
+			return -1;
+		}
+
+		if (goto_cnt > 32/*MAX_PRIO*/ || !goto_cnt) {
+			fprintf(stderr, "Bad jmp range %s: Need [1,32]\n",
+				*argv);
+			return -1;
+		}
+
+		*result |= 1;
+	}
+
+	*argc_p = argc;
+	*argv_p = argv;
+
+	return 0;
+}
+
 int get_linklayer(unsigned int *val, const char *arg)
 {
 	int res;
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 4db26c6..228ac98 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -102,6 +102,7 @@ int parse_police(int *argc_p, char ***argv_p, int tca_id, struct nlmsghdr *n);
 
 const char *action_n2a(int action);
 int action_a2n(char *arg, int *result, bool allow_num);
+int action_a2n_jmp(int *argc_p, char ***arg, int *result, bool allow_num);
 int act_parse_police(struct action_util *a, int *argc_p,
 		     char ***argv_p, int tca_id, struct nlmsghdr *n);
 int print_police(struct action_util *a, FILE *f, struct rtattr *tb);

^ permalink raw reply related

* Re: assembler mnenomics for call/tailcall plus maps...
From: Alexei Starovoitov @ 2017-04-28  2:06 UTC (permalink / raw)
  To: David Miller, daniel; +Cc: netdev, xdp-newbies
In-Reply-To: <20170427.164245.1980485298195850482.davem@davemloft.net>

On 4/27/17 1:42 PM, David Miller wrote:
>
> Can you guys give me some kind of idea of how it might be nice to
> represent calls and tailcalls in assembler files?

llvm prints them as 'call 1' for bpf_map_lookup
and 'call 12' for bpf_tail_call.
In the instruction stream bpf_tail_call is no different
than regular call. Only after verifier it becomes special instruction.
This 'call 1234' is obviously not readable.
We probably should allow both 'call 123' to make sure that
we don't need to change assembler/compiler with every new
helper added and allow more sane
'call bpf_map_lookup_elem'
for helpers with known func_id.

> And also the emission of maps.
>
> Right now I just have the assembler looking for 32-bit immediate
> values for call and tailcall instructions.
>
> Looking at samples/bpf/sockex3_kern.c we have:
>
> struct bpf_map_def SEC("maps") jmp_table = {
> 	.type = BPF_MAP_TYPE_PROG_ARRAY,
> 	.key_size = sizeof(u32),
> 	.value_size = sizeof(u32),
> 	.max_entries = 8,
> };

yeah. this one is tricky.
Currently there is a protocol between C file section name
and user space bpf_loader.
The above 'struct bpf_map_def' is recognized by
samples/bpf/bpf_load.c
and by tools/lib/bpf/libbpf.c which is used by perf.
iproute2 allows extended format with two extra fields
for pinning.

So if we write the same stuff in assembler, the existing
loaders will understand it.
The tricky part is in relocations.
To do bpf_map_lookup_elem() the R1 needs to point to map,
so in assembler we need to be able to say something like:

ldimm64 r1, jmp_table

where jmp_table will be in data section named 'maps'.

So in asm the map lookup will look like:
	.section        maps,"aw",@progbits
         .globl  hashmap_def
hashmap_def:
         .long   1  # type
         .long   24 # key_size
         .long   40 # value_size
         .long   256 # max_entries

         .text
         .section        xdp_tx_iptunnel,"ax",@progbits
         .globl  _xdp_prog
_xdp_prog:
          ldimm64 r1, hashmap_def
          mov r2, r10
          add r2, -8
          call bpf_map_lookup_elem

this is 64-bit relo for ldimm64 insn

This is how it's defined in llvm:
ELF_RELOC(R_BPF_NONE,        0)
ELF_RELOC(R_BPF_64_64,       1)
ELF_RELOC(R_BPF_64_32,      10)

The R_BPF_64_64 is the only relocation against .text
The other one is used for relo into dwarf sections.

^ permalink raw reply

* Re: [PATCH net-next 12/18] net: dsa: mv88e6xxx: load STU entry with VTU entry
From: Andrew Lunn @ 2017-04-28  2:07 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20170426155336.5937-13-vivien.didelot@savoirfairelinux.com>

On Wed, Apr 26, 2017 at 11:53:30AM -0400, Vivien Didelot wrote:
> Now that the code writes both VTU and STU data when loading a VTU entry,
> load the corresponding STU entry at the same time.
> 
> This allows us to get rid of the STU management in the
> _mv88e6xxx_vtu_new helper and thus remove the separate implementations
> of STU Load/Purge and STU GetNext, as well as the unused family checks.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 13/18] net: dsa: mv88e6xxx: add VTU GetNext operation
From: Andrew Lunn @ 2017-04-28  2:10 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20170426155336.5937-14-vivien.didelot@savoirfairelinux.com>

On Wed, Apr 26, 2017 at 11:53:31AM -0400, Vivien Didelot wrote:
> Add a new vtu_getnext operation to the chip info structure to differ the
> various implementations of the VTU accesses.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox