Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf 2/5] tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
From: Song Liu @ 2018-08-16 21:26 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, John Fastabend, Networking
In-Reply-To: <20180816194910.9040-3-daniel@iogearbox.net>

On Thu, Aug 16, 2018 at 12:49 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> I found that in BPF sockmap programs once we either delete a socket
> from the map or we updated a map slot and the old socket was purged
> from the map that these socket can never get reattached into a map
> even though their related psock has been dropped entirely at that
> point.
>
> Reason is that tcp_cleanup_ulp() leaves the old icsk->icsk_ulp_ops
> intact, so that on the next tcp_set_ulp_id() the kernel returns an
> -EEXIST thinking there is still some active ULP attached.
>
> BPF sockmap is the only one that has this issue as the other user,
> kTLS, only calls tcp_cleanup_ulp() from tcp_v4_destroy_sock() whereas
> sockmap semantics allow dropping the socket from the map with all
> related psock state being cleaned up.
>
> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  net/ipv4/tcp_ulp.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
> index 7dd44b6..a5995bb 100644
> --- a/net/ipv4/tcp_ulp.c
> +++ b/net/ipv4/tcp_ulp.c
> @@ -129,6 +129,8 @@ void tcp_cleanup_ulp(struct sock *sk)
>         if (icsk->icsk_ulp_ops->release)
>                 icsk->icsk_ulp_ops->release(sk);
>         module_put(icsk->icsk_ulp_ops->owner);
> +
> +       icsk->icsk_ulp_ops = NULL;
>  }
>
>  /* Change upper layer protocol for socket */
> --
> 2.9.5
>

^ permalink raw reply

* Re: [PATCH bpf 3/5] bpf, sockmap: fix leakage of smap_psock_map_entry
From: Song Liu @ 2018-08-16 21:27 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, John Fastabend, Networking
In-Reply-To: <20180816194910.9040-4-daniel@iogearbox.net>

On Thu, Aug 16, 2018 at 12:49 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> While working on sockmap I noticed that we do not always kfree the
> struct smap_psock_map_entry list elements which track psocks attached
> to maps. In the case of sock_hash_ctx_update_elem(), these map entries
> are allocated outside of __sock_map_ctx_update_elem() with their
> linkage to the socket hash table filled. In the case of sock array,
> the map entries are allocated inside of __sock_map_ctx_update_elem()
> and added with their linkage to the psock->maps. Both additions are
> under psock->maps_lock each.
>
> Now, we drop these elements from their psock->maps list in a few
> occasions: i) in sock array via smap_list_map_remove() when an entry
> is either deleted from the map from user space, or updated via
> user space or BPF program where we drop the old socket at that map
> slot, or the sock array is freed via sock_map_free() and drops all
> its elements; ii) for sock hash via smap_list_hash_remove() in exactly
> the same occasions as just described for sock array; iii) in the
> bpf_tcp_close() where we remove the elements from the list via
> psock_map_pop() and iterate over them dropping themselves from either
> sock array or sock hash; and last but not least iv) once again in
> smap_gc_work() which is a callback for deferring the work once the
> psock refcount hit zero and thus the socket is being destroyed.
>
> Problem is that the only case where we kfree() the list entry is
> in case iv), which at that point should have an empty list in
> normal cases. So in cases from i) to iii) we unlink the elements
> without freeing where they go out of reach from us. Hence fix is
> to properly kfree() them as well to stop the leakage. Given these
> are all handled under psock->maps_lock there is no need for deferred
> RCU freeing.
>
> I later also ran with kmemleak detector and it confirmed the finding
> as well where in the state before the fix the object goes unreferenced
> while after the patch no kmemleak report related to BPF showed up.
>
>   [...]
>   unreferenced object 0xffff880378eadae0 (size 64):
>     comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
>     hex dump (first 32 bytes):
>       00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
>       50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  PMu]............
>     backtrace:
>       [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
>       [<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60
>       [<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960
>       [<000000002ef89e83>] 0xffffffffffffffff
>   unreferenced object 0xffff880378ead240 (size 64):
>     comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
>     hex dump (first 32 bytes):
>       00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
>       00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  .Du]............
>     backtrace:
>       [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
>       [<0000000030e37a3a>] sock_map_update_elem+0x125/0x240
>       [<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0
>       [<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360
>       [<0000000000763660>] do_syscall_64+0x9a/0x300
>       [<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>       [<000000002ef89e83>] 0xffffffffffffffff
>   [...]
>
> Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")
> Fixes: 54fedb42c653 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps")
> Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/sockmap.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 0c1a696..94a324b 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -370,6 +370,7 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>                         }
>                         raw_spin_unlock_bh(&b->lock);
>                 }
> +               kfree(e);
>                 e = psock_map_pop(sk, psock);
>         }
>         rcu_read_unlock();
> @@ -1675,8 +1676,10 @@ static void smap_list_map_remove(struct smap_psock *psock,
>
>         spin_lock_bh(&psock->maps_lock);
>         list_for_each_entry_safe(e, tmp, &psock->maps, list) {
> -               if (e->entry == entry)
> +               if (e->entry == entry) {
>                         list_del(&e->list);
> +                       kfree(e);
> +               }
>         }
>         spin_unlock_bh(&psock->maps_lock);
>  }
> @@ -1690,8 +1693,10 @@ static void smap_list_hash_remove(struct smap_psock *psock,
>         list_for_each_entry_safe(e, tmp, &psock->maps, list) {
>                 struct htab_elem *c = rcu_dereference(e->hash_link);
>
> -               if (c == hash_link)
> +               if (c == hash_link) {
>                         list_del(&e->list);
> +                       kfree(e);
> +               }
>         }
>         spin_unlock_bh(&psock->maps_lock);
>  }
> --
> 2.9.5
>

^ permalink raw reply

* Re: [PATCH bpf 4/5] bpf, sockmap: fix map elem deletion race with smap_stop_sock
From: Song Liu @ 2018-08-16 21:30 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, John Fastabend, Networking
In-Reply-To: <20180816194910.9040-5-daniel@iogearbox.net>

On Thu, Aug 16, 2018 at 12:49 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> The smap_start_sock() and smap_stop_sock() are each protected under
> the sock->sk_callback_lock from their call-sites except in the case
> of sock_map_delete_elem() where we drop the old socket from the map
> slot. This is racy because the same sock could be part of multiple
> sock maps, so we run smap_stop_sock() in parallel, and given at that
> point psock->strp_enabled might be true on both CPUs, we might for
> example wrongly restore the sk->sk_data_ready / sk->sk_write_space.
> Therefore, hold the sock->sk_callback_lock as well on delete. Looks
> like 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add
> multi-map support") had this right, but later on e9db4ef6bf4c ("bpf:
> sockhash fix omitted bucket lock in sock_close") removed it again
> from delete leaving this smap_stop_sock() instance unprotected.
>
> Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/sockmap.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 94a324b..921cb6b 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -1786,8 +1786,11 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
>         if (!psock)
>                 goto out;
>
> -       if (psock->bpf_parse)
> +       if (psock->bpf_parse) {
> +               write_lock_bh(&sock->sk_callback_lock);
>                 smap_stop_sock(psock, sock);
> +               write_unlock_bh(&sock->sk_callback_lock);
> +       }
>         smap_list_map_remove(psock, &stab->sock_map[k]);
>         smap_release_sock(psock, sock);
>  out:
> --
> 2.9.5
>

^ permalink raw reply

* RE: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
From: Tristram.Ha @ 2018-08-16 21:34 UTC (permalink / raw)
  To: f.fainelli
  Cc: arkadis, UNGLinuxDriver, netdev, andrew, pavel, ruediger.schmitt
In-Reply-To: <1227c661-14c0-eb93-b487-f2e478f77fb8@gmail.com>

> -----Original Message-----
> From: Florian Fainelli <f.fainelli@gmail.com>
> Sent: Wednesday, August 15, 2018 5:29 PM
> To: Tristram Ha - C24268 <Tristram.Ha@microchip.com>; Andrew Lunn
> <andrew@lunn.ch>; Pavel Machek <pavel@ucw.cz>; Ruediger Schmitt
> <ruediger.schmitt@philips.com>
> Cc: Arkadi Sharshevsky <arkadis@mellanox.com>; UNGLinuxDriver
> <UNGLinuxDriver@microchip.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477
> DSA driver in preparation to add other KSZ switch drivers
> 
> On 12/05/2017 05:46 PM, Tristram.Ha@microchip.com wrote:
> > From: Tristram Ha <Tristram.Ha@microchip.com>
> >
> > This series of patches is to modify the original KSZ9477 DSA driver so
> > that other KSZ switch drivers can be added and use the common code.
> >
> > There are several steps to accomplish this achievement.  First is to
> > rename some function names with a prefix to indicate chip specific
> > function.  Second is to move common code into header that can be shared.
> > Last is to modify tag_ksz.c so that it can handle many tail tag formats
> > used by different KSZ switch drivers.
> >
> > ksz_common.c will contain the common code used by all KSZ switch drivers.
> > ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
> > ksz9477_spi.c is renamed from ksz_spi.c.
> > ksz9477_reg.h is renamed from ksz_9477_reg.h.
> > ksz_common.h is added to provide common code access to KSZ switch
> > drivers.
> > ksz_spi.h is added to provide common SPI access functions to KSZ SPI
> > drivers.
> 
> Is something gating this series from getting included? It's been nearly
> 8 months now and this has not been include nor resubmitted, any plans to
> rebase that patch series and work towards inclusion in net-next when it
> opens back again?
> 
> Thank you!

Sorry for the long delay.  I will restart my kernel submission effort next month
after finishing the work on current development project.


^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs
From: Shannon Nelson @ 2018-08-16 21:36 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: intel-wired-lan, Jeff Kirsher, Steffen Klassert, Netdev
In-Reply-To: <CAKgT0UfjB35yWMALtLUh-dU78VUW95=8MMn8rLcnKiTQxOB-Eg@mail.gmail.com>

On 8/16/2018 2:15 PM, Alexander Duyck wrote:
> On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
> <shannon.nelson@oracle.com> wrote:
>>
>> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
>>> On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
>>> <shannon.nelson@oracle.com> wrote:
>>>>
>>>> This set of patches implements IPsec hardware offload for VF devices in
>>>> Intel's 10Gbe x540 family of Ethernet devices.
>>
>> [...]
>>
>>>
>>> So the one question I would have about this patch set is what happens
>>> if you are setting up a ipsec connection between the PF and one of the
>>> VFs on the same port/function? Do the ipsec offloads get translated
>>> across the Tx loopback or do they end up causing issues? Specifically
>>> I would be interested in seeing the results of a test either between
>>> two VFs, or the PF and one of the VFs on the same port.
>>>
>>> - Alex
>>>
>>
>> There is definitely something funky in the internal switch connection,
>> as messages going from PF to VF with an offloaded encryption don't seem
>> to get received by the VF, at least when in a VEB setup.  If I only set
>> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
>> things work.
>>
>> I don't have a setup to test this, but I suspect that in a VEPA
>> configuration, with packets going out to the switch and turned around
>> back in, the Tx encryption offload would happen as expected.
>>
>> sln
> 
> We should probably look at adding at least one patch to the set then
> that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
> we don't end up breaking connections should a VF be migrated from a
> remote system to a local one that it is connected to.
> 
> - Alex
> 

The problem with this is that someone could set up an IPsec connection 
on the PF for Tx and Rx use, then set num_vfs, start some VFs, and we 
still can end up in the same place.  I don't think we want to disallow 
all Tx IPsec offload.

Maybe we can catch it in ixgbe_ipsec_offload_ok()?  If it can find that 
the dest mac is on the internal switch, perhaps it can NAK the Tx 
offload?  That would force the XFRM xmit code to do a regular SW encrypt 
before sending the packet.  I'll look into this.

sln

^ permalink raw reply

* Re: [PATCH bpf 5/5] bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
From: Song Liu @ 2018-08-16 21:51 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, John Fastabend, Networking
In-Reply-To: <20180816194910.9040-6-daniel@iogearbox.net>

On Thu, Aug 16, 2018 at 12:49 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> The current code in sock_map_ctx_update_elem() allows for BPF_EXIST
> and BPF_NOEXIST map update flags. While on array-like maps this approach
> is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others
> enforce map update flags to be BPF_ANY such that xchg() can be used
> directly, the current implementation in sock map does not guarantee
> that such operation with BPF_EXIST / BPF_NOEXIST is atomic.
>
> The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the
> socket from the slot which is then tested for NULL / non-NULL. However
> later after __sock_map_ctx_update_elem(), the actual update is done
> through osock = xchg(&stab->sock_map[i], sock). Problem is that in
> the meantime a different CPU could have updated / deleted a socket
> on that specific slot and thus flag contraints won't hold anymore.
>
> I've been thinking whether best would be to just break UAPI and do
> an enforcement of BPF_ANY to check if someone actually complains,
> however trouble is that already in BPF kselftest we use BPF_NOEXIST
> for the map update, and therefore it might have been copied into
> applications already. The fix to keep the current behavior intact
> would be to add a map lock similar to the sock hash bucket lock only
> for covering the whole map.
>
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/sockmap.c | 106 +++++++++++++++++++++++++++------------------------
>  1 file changed, 57 insertions(+), 49 deletions(-)
>
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 921cb6b..98e621a 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -58,6 +58,7 @@ struct bpf_stab {
>         struct bpf_map map;
>         struct sock **sock_map;
>         struct bpf_sock_progs progs;
> +       raw_spinlock_t lock;
>  };
>
>  struct bucket {
> @@ -89,9 +90,9 @@ enum smap_psock_state {
>
>  struct smap_psock_map_entry {
>         struct list_head list;
> +       struct bpf_map *map;
>         struct sock **entry;
>         struct htab_elem __rcu *hash_link;
> -       struct bpf_htab __rcu *htab;
>  };
>
>  struct smap_psock {
> @@ -343,13 +344,18 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>         e = psock_map_pop(sk, psock);
>         while (e) {
>                 if (e->entry) {
> -                       osk = cmpxchg(e->entry, sk, NULL);
> +                       struct bpf_stab *stab = container_of(e->map, struct bpf_stab, map);
> +
> +                       raw_spin_lock_bh(&stab->lock);
> +                       osk = *e->entry;
>                         if (osk == sk) {
> +                               *e->entry = NULL;
>                                 smap_release_sock(psock, sk);
>                         }
> +                       raw_spin_unlock_bh(&stab->lock);
>                 } else {
>                         struct htab_elem *link = rcu_dereference(e->hash_link);
> -                       struct bpf_htab *htab = rcu_dereference(e->htab);
> +                       struct bpf_htab *htab = container_of(e->map, struct bpf_htab, map);
>                         struct hlist_head *head;
>                         struct htab_elem *l;
>                         struct bucket *b;
> @@ -1642,6 +1648,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
>                 return ERR_PTR(-ENOMEM);
>
>         bpf_map_init_from_attr(&stab->map, attr);
> +       raw_spin_lock_init(&stab->lock);
>
>         /* make sure page count doesn't overflow */
>         cost = (u64) stab->map.max_entries * sizeof(struct sock *);
> @@ -1716,14 +1723,15 @@ static void sock_map_free(struct bpf_map *map)
>          * and a grace period expire to ensure psock is really safe to remove.
>          */
>         rcu_read_lock();
> +       raw_spin_lock_bh(&stab->lock);
>         for (i = 0; i < stab->map.max_entries; i++) {
>                 struct smap_psock *psock;
>                 struct sock *sock;
>
> -               sock = xchg(&stab->sock_map[i], NULL);
> +               sock = stab->sock_map[i];
>                 if (!sock)
>                         continue;
> -
> +               stab->sock_map[i] = NULL;
>                 psock = smap_psock_sk(sock);
>                 /* This check handles a racing sock event that can get the
>                  * sk_callback_lock before this case but after xchg happens
> @@ -1735,6 +1743,7 @@ static void sock_map_free(struct bpf_map *map)
>                         smap_release_sock(psock, sock);
>                 }
>         }
> +       raw_spin_unlock_bh(&stab->lock);
>         rcu_read_unlock();
>
>         sock_map_remove_complete(stab);
> @@ -1778,14 +1787,16 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
>         if (k >= map->max_entries)
>                 return -EINVAL;
>
> -       sock = xchg(&stab->sock_map[k], NULL);
> +       raw_spin_lock_bh(&stab->lock);
> +       sock = stab->sock_map[k];
> +       stab->sock_map[k] = NULL;
> +       raw_spin_unlock_bh(&stab->lock);
>         if (!sock)
>                 return -EINVAL;
>
>         psock = smap_psock_sk(sock);
>         if (!psock)
> -               goto out;
> -
> +               return 0;
>         if (psock->bpf_parse) {
>                 write_lock_bh(&sock->sk_callback_lock);
>                 smap_stop_sock(psock, sock);
> @@ -1793,7 +1804,6 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
>         }
>         smap_list_map_remove(psock, &stab->sock_map[k]);
>         smap_release_sock(psock, sock);
> -out:
>         return 0;
>  }
>
> @@ -1829,11 +1839,9 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
>  static int __sock_map_ctx_update_elem(struct bpf_map *map,
>                                       struct bpf_sock_progs *progs,
>                                       struct sock *sock,
> -                                     struct sock **map_link,
>                                       void *key)
>  {
>         struct bpf_prog *verdict, *parse, *tx_msg;
> -       struct smap_psock_map_entry *e = NULL;
>         struct smap_psock *psock;
>         bool new = false;
>         int err = 0;
> @@ -1906,14 +1914,6 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
>                 new = true;
>         }
>
> -       if (map_link) {
> -               e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
> -               if (!e) {
> -                       err = -ENOMEM;
> -                       goto out_free;
> -               }
> -       }
> -
>         /* 3. At this point we have a reference to a valid psock that is
>          * running. Attach any BPF programs needed.
>          */
> @@ -1935,17 +1935,6 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
>                 write_unlock_bh(&sock->sk_callback_lock);
>         }
>
> -       /* 4. Place psock in sockmap for use and stop any programs on
> -        * the old sock assuming its not the same sock we are replacing
> -        * it with. Because we can only have a single set of programs if
> -        * old_sock has a strp we can stop it.
> -        */
> -       if (map_link) {
> -               e->entry = map_link;
> -               spin_lock_bh(&psock->maps_lock);
> -               list_add_tail(&e->list, &psock->maps);
> -               spin_unlock_bh(&psock->maps_lock);
> -       }
>         return err;
>  out_free:
>         smap_release_sock(psock, sock);
> @@ -1956,7 +1945,6 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
>         }
>         if (tx_msg)
>                 bpf_prog_put(tx_msg);
> -       kfree(e);
>         return err;
>  }
>
> @@ -1966,36 +1954,57 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  {
>         struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
>         struct bpf_sock_progs *progs = &stab->progs;
> -       struct sock *osock, *sock;
> +       struct sock *osock, *sock = skops->sk;
> +       struct smap_psock_map_entry *e;
> +       struct smap_psock *psock;
>         u32 i = *(u32 *)key;
>         int err;
>
>         if (unlikely(flags > BPF_EXIST))
>                 return -EINVAL;
> -
>         if (unlikely(i >= stab->map.max_entries))
>                 return -E2BIG;
>
> -       sock = READ_ONCE(stab->sock_map[i]);
> -       if (flags == BPF_EXIST && !sock)
> -               return -ENOENT;
> -       else if (flags == BPF_NOEXIST && sock)
> -               return -EEXIST;
> +       e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
> +       if (!e)
> +               return -ENOMEM;
>
> -       sock = skops->sk;
> -       err = __sock_map_ctx_update_elem(map, progs, sock, &stab->sock_map[i],
> -                                        key);
> +       err = __sock_map_ctx_update_elem(map, progs, sock, key);
>         if (err)
>                 goto out;
>
> -       osock = xchg(&stab->sock_map[i], sock);
> -       if (osock) {
> -               struct smap_psock *opsock = smap_psock_sk(osock);
> +       /* psock guaranteed to be present. */
> +       psock = smap_psock_sk(sock);
> +       raw_spin_lock_bh(&stab->lock);
> +       osock = stab->sock_map[i];
> +       if (osock && flags == BPF_NOEXIST) {
> +               err = -EEXIST;
> +               goto out_unlock;
> +       }
> +       if (!osock && flags == BPF_EXIST) {
> +               err = -ENOENT;
> +               goto out_unlock;
> +       }
> +
> +       e->entry = &stab->sock_map[i];
> +       e->map = map;
> +       spin_lock_bh(&psock->maps_lock);
> +       list_add_tail(&e->list, &psock->maps);
> +       spin_unlock_bh(&psock->maps_lock);
>
> -               smap_list_map_remove(opsock, &stab->sock_map[i]);
> -               smap_release_sock(opsock, osock);
> +       stab->sock_map[i] = sock;
> +       if (osock) {
> +               psock = smap_psock_sk(osock);
> +               smap_list_map_remove(psock, &stab->sock_map[i]);
> +               smap_release_sock(psock, osock);
>         }
> +       raw_spin_unlock_bh(&stab->lock);
> +       return 0;
> +out_unlock:
> +       smap_release_sock(psock, sock);
> +       raw_spin_unlock_bh(&stab->lock);
>  out:
> +       kfree(e);
>         return err;
>  }
>
> @@ -2358,7 +2367,7 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>         b = __select_bucket(htab, hash);
>         head = &b->head;
>
> -       err = __sock_map_ctx_update_elem(map, progs, sock, NULL, key);
> +       err = __sock_map_ctx_update_elem(map, progs, sock, key);
>         if (err)
>                 goto err;
>
> @@ -2384,8 +2393,7 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>         }
>
>         rcu_assign_pointer(e->hash_link, l_new);
> -       rcu_assign_pointer(e->htab,
> -                          container_of(map, struct bpf_htab, map));
> +       e->map = map;
>         spin_lock_bh(&psock->maps_lock);
>         list_add_tail(&e->list, &psock->maps);
>         spin_unlock_bh(&psock->maps_lock);
> --
> 2.9.5
>

^ permalink raw reply

* KINDLY REPLY diplmosesd@gmail.com URGENTLY
From: MR MOSES @ 2018-08-16 21:50 UTC (permalink / raw)
  To: Recipients

KINDLY REPLY diplmosesd@gmail.com URGENTLY

^ permalink raw reply

* Re: [PATCH bpf 0/5] BPF sockmap and ulp fixes
From: Alexei Starovoitov @ 2018-08-16 22:06 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: john.fastabend, netdev
In-Reply-To: <20180816194910.9040-1-daniel@iogearbox.net>

On Thu, Aug 16, 2018 at 09:49:05PM +0200, Daniel Borkmann wrote:
> Batch of various fixes related to BPF sockmap and ULP, including
> adding module alias to restrict module requests, races and memory
> leaks in sockmap code. For details please refer to the individual
> patches. Thanks!

Applied to bpf tree, Thanks everyone

^ permalink raw reply

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-08-16 22:37 UTC (permalink / raw)
  To: ecree
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, Petar Penkov,
	Willem de Bruijn
In-Reply-To: <cb06e7c6-2042-1243-fe6e-3e1e33e4b14d@solarflare.com>

On Thu, Aug 16, 2018 at 2:51 PM Edward Cree <ecree@solarflare.com> wrote:
>
> On 16/08/18 17:44, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is kept as a global variable so it is
> > accessible to all flow dissectors.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
>
> This looks really great.
>
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> > +{
> > +     struct bpf_prog *attached;
> > +
> > +     mutex_lock(&flow_dissector_mutex);
> > +     attached = rcu_dereference_protected(flow_dissector_prog,
> > +                                          lockdep_is_held(&flow_dissector_mutex));
> > +     if (!flow_dissector_prog) {
> > +             mutex_unlock(&flow_dissector_mutex);
> > +             return -EINVAL;
> Wouldn't -ENOENT be more usual here (as the counterpart to -EEXIST in
>  the skb_flow_dissector_bpf_prog_attach() version just above)?

Absolutely. That better matches bpf_detach behavior, too.

^ permalink raw reply

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Song Liu @ 2018-08-16 22:40 UTC (permalink / raw)
  To: Petar Penkov
  Cc: Networking, David S . Miller, Alexei Starovoitov, Daniel Borkmann,
	simon.horman, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180816164423.14368-2-peterpenkov96@gmail.com>

On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is kept as a global variable so it is
> accessible to all flow dissectors.
>
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/linux/bpf_types.h                 |   1 +
>  include/linux/skbuff.h                    |   7 +
>  include/net/flow_dissector.h              |  16 +++
>  include/uapi/linux/bpf.h                  |  14 +-
>  kernel/bpf/syscall.c                      |   8 ++
>  kernel/bpf/verifier.c                     |   2 +
>  net/core/filter.c                         | 157 ++++++++++++++++++++++
>  net/core/flow_dissector.c                 |  76 +++++++++++
>  tools/bpf/bpftool/prog.c                  |   1 +
>  tools/include/uapi/linux/bpf.h            |   5 +-
>  tools/lib/bpf/libbpf.c                    |   2 +
>  tools/testing/selftests/bpf/bpf_helpers.h |   3 +
>  12 files changed, 290 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index cd26c090e7c0..22083712dd18 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
>  #ifdef CONFIG_INET
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
>  #endif
> +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
>
>  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 17a13e4785fc..ce0e863f02a2 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -243,6 +243,8 @@ struct scatterlist;
>  struct pipe_inode_info;
>  struct iov_iter;
>  struct napi_struct;
> +struct bpf_prog;
> +union bpf_attr;
>
>  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
>  struct nf_conntrack {
> @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
>                              const struct flow_dissector_key *key,
>                              unsigned int key_count);
>
> +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> +                                      struct bpf_prog *prog);
> +
> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
> +
>  bool __skb_flow_dissect(const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
>                         void *target_container,
> diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> index 6a4586dcdede..edb919d320c1 100644
> --- a/include/net/flow_dissector.h
> +++ b/include/net/flow_dissector.h
> @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
>  extern struct flow_dissector flow_keys_dissector;
>  extern struct flow_dissector flow_keys_basic_dissector;
>
> +/* struct bpf_flow_dissect_cb:
> + *
> + * This struct is used to pass parameters to BPF programs of type
> + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
> + * the control block of the skb to be a struct of this type. The first field is
> + * used to communicate the next header offset between the BPF programs and the
> + * first value of it is passed from the kernel. The last two fields are used for
> + * writing out flow keys.
> + */
> +struct bpf_flow_dissect_cb {
> +       u16 nhoff;
> +       u16 unused;
> +       void *target_container;
> +       struct flow_dissector *flow_dissector;
> +};
> +
>  /* struct flow_keys_digest:
>   *
>   * This structure is used to hold a digest of the full flow keys. This is a
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 66917a4eba27..8bc0fdab685d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_LWT_SEG6LOCAL,
>         BPF_PROG_TYPE_LIRC_MODE2,
>         BPF_PROG_TYPE_SK_REUSEPORT,
> +       BPF_PROG_TYPE_FLOW_DISSECTOR,
>  };
>
>  enum bpf_attach_type {
> @@ -172,6 +173,7 @@ enum bpf_attach_type {
>         BPF_CGROUP_UDP4_SENDMSG,
>         BPF_CGROUP_UDP6_SENDMSG,
>         BPF_LIRC_MODE2,
> +       BPF_FLOW_DISSECTOR,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -2141,6 +2143,15 @@ union bpf_attr {
>   *             request in the skb.
>   *     Return
>   *             0 on success, or a negative error in case of failure.
> + *
> + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
> + *     Description
> + *             Try to write *len* bytes from the source pointer into the offset
> + *             of the key with id *key_id*. If *len* is different from the
> + *             size of the key, an error is returned. If the key is not used,
> + *             this function exits with no effect and code 0.
> + *     Return
> + *             0 on success, negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2226,7 +2237,8 @@ union bpf_attr {
>         FN(get_current_cgroup_id),      \
>         FN(get_local_storage),          \
>         FN(sk_select_reuseport),        \
> -       FN(skb_ancestor_cgroup_id),
> +       FN(skb_ancestor_cgroup_id),     \
> +       FN(flow_dissector_write_keys),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 43727ed0d94a..a06568841a92 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         case BPF_LIRC_MODE2:
>                 ptype = BPF_PROG_TYPE_LIRC_MODE2;
>                 break;
> +       case BPF_FLOW_DISSECTOR:
> +               ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
> +               break;
>         default:
>                 return -EINVAL;
>         }
> @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         case BPF_PROG_TYPE_LIRC_MODE2:
>                 ret = lirc_prog_attach(attr, prog);
>                 break;
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> +               ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
> +               break;
>         default:
>                 ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>         }
> @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>                 return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
>         case BPF_LIRC_MODE2:
>                 return lirc_prog_detach(attr);
> +       case BPF_FLOW_DISSECTOR:
> +               return skb_flow_dissector_bpf_prog_detach(attr);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ca90679a7fe5..6d3f268fa8e0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>         case BPF_PROG_TYPE_LWT_XMIT:
>         case BPF_PROG_TYPE_SK_SKB:
>         case BPF_PROG_TYPE_SK_MSG:
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>                 if (meta)
>                         return meta->pkt_access;
>
> @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_SOCKET_FILTER:
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>                 return true;
>         default:
>                 return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fd423ce3da34..03d3037e6508 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
>         return false;
>  }
>
> +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
> +          const void *, from, u32, len, enum flow_dissector_key_id, key_id)
> +{
> +       struct bpf_flow_dissect_cb *cb;
> +       void *dest;
> +
> +       cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
> +
> +       /* Make sure the dissector actually uses the key. It is not an error if
> +        * it does not, but we should not continue past this point in that case
> +        */
> +       if (!dissector_uses_key(cb->flow_dissector, key_id))
> +               return 0;
> +
> +       /* Make sure the length is correct */
> +       switch (key_id) {
> +       case FLOW_DISSECTOR_KEY_CONTROL:
> +       case FLOW_DISSECTOR_KEY_ENC_CONTROL:
> +               if (len != sizeof(struct flow_dissector_key_control))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_BASIC:
> +               if (len != sizeof(struct flow_dissector_key_basic))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> +       case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
> +               if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> +       case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
> +               if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_ICMP:
> +               if (len != sizeof(struct flow_dissector_key_icmp))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_PORTS:
> +       case FLOW_DISSECTOR_KEY_ENC_PORTS:
> +               if (len != sizeof(struct flow_dissector_key_ports))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_ETH_ADDRS:
> +               if (len != sizeof(struct flow_dissector_key_eth_addrs))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_TIPC:
> +               if (len != sizeof(struct flow_dissector_key_tipc))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_ARP:
> +               if (len != sizeof(struct flow_dissector_key_arp))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_VLAN:
> +       case FLOW_DISSECTOR_KEY_CVLAN:
> +               if (len != sizeof(struct flow_dissector_key_vlan))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_FLOW_LABEL:
> +               if (len != sizeof(struct flow_dissector_key_tags))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_GRE_KEYID:
> +       case FLOW_DISSECTOR_KEY_ENC_KEYID:
> +       case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
> +               if (len != sizeof(struct flow_dissector_key_keyid))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_MPLS:
> +               if (len != sizeof(struct flow_dissector_key_mpls))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_TCP:
> +               if (len != sizeof(struct flow_dissector_key_tcp))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_IP:
> +       case FLOW_DISSECTOR_KEY_ENC_IP:
> +               if (len != sizeof(struct flow_dissector_key_ip))
> +                       return -EINVAL;
> +               break;
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
> +                                        cb->target_container);
> +
> +       memcpy(dest, from, len);
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
> +       .func           = bpf_flow_dissector_write_keys,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_PTR_TO_MEM,
> +       .arg3_type      = ARG_CONST_SIZE,
> +       .arg4_type      = ARG_ANYTHING,
> +};
> +
>  static const struct bpf_func_proto *
>  bpf_base_func_proto(enum bpf_func_id func_id)
>  {
> @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_skb_load_bytes:
> +               return &bpf_skb_load_bytes_proto;
> +       case BPF_FUNC_flow_dissector_write_keys:
> +               return &bpf_flow_dissector_write_keys_proto;
> +       default:
> +               return bpf_base_func_proto(func_id);
> +       }
> +}
> +
>  static const struct bpf_func_proto *
>  lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  {
> @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
>         return true;
>  }
>
> +static bool flow_dissector_is_valid_access(int off, int size,
> +                                          enum bpf_access_type type,
> +                                          const struct bpf_prog *prog,
> +                                          struct bpf_insn_access_aux *info)
> +{
> +       if (type == BPF_WRITE) {
> +               switch (off) {
> +               case bpf_ctx_range(struct __sk_buff, cb[0]):
> +                       break;
> +               default:
> +                       return false;
> +               }
> +       }
> +
> +       switch (off) {
> +       case bpf_ctx_range(struct __sk_buff, data):
> +               info->reg_type = PTR_TO_PACKET;
> +               break;
> +       case bpf_ctx_range(struct __sk_buff, data_end):
> +               info->reg_type = PTR_TO_PACKET_END;
> +               break;
> +       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> +       case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
> +               return false;
> +       }
> +
> +       return bpf_skb_is_valid_access(off, size, type, prog, info);
> +}
> +
>  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>                                   const struct bpf_insn *si,
>                                   struct bpf_insn *insn_buf,
> @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
>  const struct bpf_prog_ops sk_msg_prog_ops = {
>  };
>
> +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
> +       .get_func_proto         = flow_dissector_func_proto,
> +       .is_valid_access        = flow_dissector_is_valid_access,
> +       .convert_ctx_access     = bpf_convert_ctx_access,
> +       .gen_ld_abs             = bpf_gen_ld_abs,
> +};
> +
> +const struct bpf_prog_ops flow_dissector_prog_ops = {
> +};
> +
>  int sk_detach_filter(struct sock *sk)
>  {
>         int ret = -ENOENT;
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index ce9eeeb7c024..767daa231f04 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -25,6 +25,11 @@
>  #include <net/flow_dissector.h>
>  #include <scsi/fc/fc_fcoe.h>
>  #include <uapi/linux/batadv_packet.h>
> +#include <linux/bpf.h>
> +
> +/* BPF program accessible by all flow dissectors */
> +static struct bpf_prog __rcu *flow_dissector_prog;
> +static DEFINE_MUTEX(flow_dissector_mutex);
>
>  static void dissector_set_key(struct flow_dissector *flow_dissector,
>                               enum flow_dissector_key_id key_id)
> @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
>  }
>  EXPORT_SYMBOL(skb_flow_dissector_init);
>
> +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> +                                      struct bpf_prog *prog)
> +{
> +       struct bpf_prog *attached;
> +
> +       mutex_lock(&flow_dissector_mutex);
> +       attached = rcu_dereference_protected(flow_dissector_prog,
> +                                            lockdep_is_held(&flow_dissector_mutex));
> +       if (attached) {
> +               /* Only one BPF program can be attached at a time */
> +               mutex_unlock(&flow_dissector_mutex);
> +               return -EEXIST;
> +       }
> +       rcu_assign_pointer(flow_dissector_prog, prog);
> +       mutex_unlock(&flow_dissector_mutex);
> +       return 0;
> +}
> +
> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> +{
> +       struct bpf_prog *attached;
> +
> +       mutex_lock(&flow_dissector_mutex);
> +       attached = rcu_dereference_protected(flow_dissector_prog,
> +                                            lockdep_is_held(&flow_dissector_mutex));
> +       if (!flow_dissector_prog) {
> +               mutex_unlock(&flow_dissector_mutex);
> +               return -EINVAL;
> +       }
> +       bpf_prog_put(attached);
> +       RCU_INIT_POINTER(flow_dissector_prog, NULL);
> +       mutex_unlock(&flow_dissector_mutex);
> +       return 0;
> +}
>  /**
>   * skb_flow_get_be16 - extract be16 entity
>   * @skb: sk_buff to extract from
> @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>         struct flow_dissector_key_vlan *key_vlan;
>         enum flow_dissect_ret fdret;
>         enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
> +       struct bpf_prog *attached;
>         int num_hdrs = 0;
>         u8 ip_proto = 0;
>         bool ret;
> @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>                                               FLOW_DISSECTOR_KEY_BASIC,
>                                               target_container);
>
> +       rcu_read_lock();
> +       attached = rcu_dereference(flow_dissector_prog);
> +       if (attached) {
> +               /* Note that even though the const qualifier is discarded
> +                * throughout the execution of the BPF program, all changes(the
> +                * control block) are reverted after the BPF program returns.
> +                * Therefore, __skb_flow_dissect does not alter the skb.
> +                */
> +               struct bpf_flow_dissect_cb *cb;
> +               u8 cb_saved[BPF_SKB_CB_LEN];
> +               u32 result;
> +
> +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> +
> +               /* Save Control Block */
> +               memcpy(cb_saved, cb, sizeof(cb_saved));
> +               memset(cb, 0, sizeof(cb_saved));
> +
> +               /* Pass parameters to the BPF program */
> +               cb->nhoff = nhoff;
> +               cb->target_container = target_container;
> +               cb->flow_dissector = flow_dissector;
> +
> +               bpf_compute_data_pointers((struct sk_buff *)skb);
> +               result = BPF_PROG_RUN(attached, skb);
> +
> +               /* Restore state */
> +               memcpy(cb, cb_saved, sizeof(cb_saved));
> +
> +               key_control->thoff = min_t(u16, key_control->thoff,
> +                                          skb ? skb->len : hlen);
> +               rcu_read_unlock();
> +               return result == BPF_OK;
> +       }

If the BPF program cannot handle certain protocol, shall we fall back
to the built-in logic? Otherwise, all BPF programs need to have some
code for all protocols.

Song

> +       rcu_read_unlock();
> +
>         if (dissector_uses_key(flow_dissector,
>                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>                 struct ethhdr *eth = eth_hdr(skb);
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index dce960d22106..b1cd3bc8db70 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
>         [BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
>         [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
>         [BPF_PROG_TYPE_LIRC_MODE2]      = "lirc_mode2",
> +       [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
>  };
>
>  static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 66917a4eba27..acd74a0dd063 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_LWT_SEG6LOCAL,
>         BPF_PROG_TYPE_LIRC_MODE2,
>         BPF_PROG_TYPE_SK_REUSEPORT,
> +       BPF_PROG_TYPE_FLOW_DISSECTOR,
>  };
>
>  enum bpf_attach_type {
> @@ -172,6 +173,7 @@ enum bpf_attach_type {
>         BPF_CGROUP_UDP4_SENDMSG,
>         BPF_CGROUP_UDP6_SENDMSG,
>         BPF_LIRC_MODE2,
> +       BPF_FLOW_DISSECTOR,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -2226,7 +2228,8 @@ union bpf_attr {
>         FN(get_current_cgroup_id),      \
>         FN(get_local_storage),          \
>         FN(sk_select_reuseport),        \
> -       FN(skb_ancestor_cgroup_id),
> +       FN(skb_ancestor_cgroup_id),     \
> +       FN(flow_dissector_write_keys),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 2abd0f112627..0c749ce1b717 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>         case BPF_PROG_TYPE_LIRC_MODE2:
>         case BPF_PROG_TYPE_SK_REUSEPORT:
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>                 return false;
>         case BPF_PROG_TYPE_UNSPEC:
>         case BPF_PROG_TYPE_KPROBE:
> @@ -2121,6 +2122,7 @@ static const struct {
>         BPF_PROG_SEC("sk_skb",          BPF_PROG_TYPE_SK_SKB),
>         BPF_PROG_SEC("sk_msg",          BPF_PROG_TYPE_SK_MSG),
>         BPF_PROG_SEC("lirc_mode2",      BPF_PROG_TYPE_LIRC_MODE2),
> +       BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
>         BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
>         BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
>         BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index e4be7730222d..4204c496a04f 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
>         (void *) BPF_FUNC_skb_cgroup_id;
>  static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
>         (void *) BPF_FUNC_skb_ancestor_cgroup_id;
> +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
> +                                           int key) =
> +       (void *) BPF_FUNC_flow_dissector_write_keys;
>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> --
> 2.18.0.865.gffc8e1a3cd6-goog
>

^ permalink raw reply

* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Russell King - ARM Linux @ 2018-08-16 22:58 UTC (permalink / raw)
  To: Marc Haber
  Cc: Peter Robinson, linux-arm-kernel, netdev, labbott, Eric Dumazet,
	Daniel Borkmann
In-Reply-To: <20180816203515.GA7688@torres.zugschlus.de>

On Thu, Aug 16, 2018 at 10:35:16PM +0200, Marc Haber wrote:
> On Mon, Jun 25, 2018 at 05:41:27PM +0100, Peter Robinson wrote:
> > So with that and the other fix there was no improvement, with those
> > and the BPF JIT disabled it works, I'm not sure if the two patches
> > have any effect with the JIT disabled though.
> 
> I can confirm the crash with the released 4.18.1 on Banana Pi, and I can
> also confirm that disabling BPF JIT makes the Banana Pi work again.,

Hi,

I'm afraid that the information in the crash dumps is insufficient
to be able to work very much out about these crashes.

We need a recipe (kernel configuration and what userspace is doing)
so that it's possible to recreate the crash, or we need responses
to requests for information - I requested the disassembly of
sk_filter_trim_cap and the BPF code dump via setting a sysctl back
in early July.  Without this, as I say, I don't see how this problem
can be progressed.

If the problem is at boot, one way to set the sysctl would be to
hack the kernel and explicitly initialise the sysctl to '2', or
boot with init=/bin/sh, then manually mount /proc, set the sysctl,
and then "exec /sbin/init" from that shell.  (Remember there's no
job control in that shell, so ^z, ^c, etc do not work.)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up
According to speedtest.net: 13Mbps down 490kbps up

^ permalink raw reply

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov @ 2018-08-16 23:14 UTC (permalink / raw)
  To: Song Liu
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <CAPhsuW5pOYdTx+w06=xW0XMPkjx62RsJ2EP5iJZCx_3QQZb=xw@mail.gmail.com>

On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is kept as a global variable so it is
> > accessible to all flow dissectors.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/linux/bpf_types.h                 |   1 +
> >  include/linux/skbuff.h                    |   7 +
> >  include/net/flow_dissector.h              |  16 +++
> >  include/uapi/linux/bpf.h                  |  14 +-
> >  kernel/bpf/syscall.c                      |   8 ++
> >  kernel/bpf/verifier.c                     |   2 +
> >  net/core/filter.c                         | 157 ++++++++++++++++++++++
> >  net/core/flow_dissector.c                 |  76 +++++++++++
> >  tools/bpf/bpftool/prog.c                  |   1 +
> >  tools/include/uapi/linux/bpf.h            |   5 +-
> >  tools/lib/bpf/libbpf.c                    |   2 +
> >  tools/testing/selftests/bpf/bpf_helpers.h |   3 +
> >  12 files changed, 290 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index cd26c090e7c0..22083712dd18 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> >  #ifdef CONFIG_INET
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
> >  #endif
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
> >
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 17a13e4785fc..ce0e863f02a2 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -243,6 +243,8 @@ struct scatterlist;
> >  struct pipe_inode_info;
> >  struct iov_iter;
> >  struct napi_struct;
> > +struct bpf_prog;
> > +union bpf_attr;
> >
> >  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
> >  struct nf_conntrack {
> > @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> >                              const struct flow_dissector_key *key,
> >                              unsigned int key_count);
> >
> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> > +                                      struct bpf_prog *prog);
> > +
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
> > +
> >  bool __skb_flow_dissect(const struct sk_buff *skb,
> >                         struct flow_dissector *flow_dissector,
> >                         void *target_container,
> > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> > index 6a4586dcdede..edb919d320c1 100644
> > --- a/include/net/flow_dissector.h
> > +++ b/include/net/flow_dissector.h
> > @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
> >  extern struct flow_dissector flow_keys_dissector;
> >  extern struct flow_dissector flow_keys_basic_dissector;
> >
> > +/* struct bpf_flow_dissect_cb:
> > + *
> > + * This struct is used to pass parameters to BPF programs of type
> > + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
> > + * the control block of the skb to be a struct of this type. The first field is
> > + * used to communicate the next header offset between the BPF programs and the
> > + * first value of it is passed from the kernel. The last two fields are used for
> > + * writing out flow keys.
> > + */
> > +struct bpf_flow_dissect_cb {
> > +       u16 nhoff;
> > +       u16 unused;
> > +       void *target_container;
> > +       struct flow_dissector *flow_dissector;
> > +};
> > +
> >  /* struct flow_keys_digest:
> >   *
> >   * This structure is used to hold a digest of the full flow keys. This is a
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 66917a4eba27..8bc0fdab685d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_LWT_SEG6LOCAL,
> >         BPF_PROG_TYPE_LIRC_MODE2,
> >         BPF_PROG_TYPE_SK_REUSEPORT,
> > +       BPF_PROG_TYPE_FLOW_DISSECTOR,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
> >         BPF_CGROUP_UDP4_SENDMSG,
> >         BPF_CGROUP_UDP6_SENDMSG,
> >         BPF_LIRC_MODE2,
> > +       BPF_FLOW_DISSECTOR,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -2141,6 +2143,15 @@ union bpf_attr {
> >   *             request in the skb.
> >   *     Return
> >   *             0 on success, or a negative error in case of failure.
> > + *
> > + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
> > + *     Description
> > + *             Try to write *len* bytes from the source pointer into the offset
> > + *             of the key with id *key_id*. If *len* is different from the
> > + *             size of the key, an error is returned. If the key is not used,
> > + *             this function exits with no effect and code 0.
> > + *     Return
> > + *             0 on success, negative error in case of failure.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -2226,7 +2237,8 @@ union bpf_attr {
> >         FN(get_current_cgroup_id),      \
> >         FN(get_local_storage),          \
> >         FN(sk_select_reuseport),        \
> > -       FN(skb_ancestor_cgroup_id),
> > +       FN(skb_ancestor_cgroup_id),     \
> > +       FN(flow_dissector_write_keys),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 43727ed0d94a..a06568841a92 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >         case BPF_LIRC_MODE2:
> >                 ptype = BPF_PROG_TYPE_LIRC_MODE2;
> >                 break;
> > +       case BPF_FLOW_DISSECTOR:
> > +               ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
> > +               break;
> >         default:
> >                 return -EINVAL;
> >         }
> > @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >         case BPF_PROG_TYPE_LIRC_MODE2:
> >                 ret = lirc_prog_attach(attr, prog);
> >                 break;
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> > +               ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
> > +               break;
> >         default:
> >                 ret = cgroup_bpf_prog_attach(attr, ptype, prog);
> >         }
> > @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> >                 return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
> >         case BPF_LIRC_MODE2:
> >                 return lirc_prog_detach(attr);
> > +       case BPF_FLOW_DISSECTOR:
> > +               return skb_flow_dissector_bpf_prog_detach(attr);
> >         default:
> >                 return -EINVAL;
> >         }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index ca90679a7fe5..6d3f268fa8e0 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> >         case BPF_PROG_TYPE_LWT_XMIT:
> >         case BPF_PROG_TYPE_SK_SKB:
> >         case BPF_PROG_TYPE_SK_MSG:
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >                 if (meta)
> >                         return meta->pkt_access;
> >
> > @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
> >         case BPF_PROG_TYPE_SOCKET_FILTER:
> >         case BPF_PROG_TYPE_SCHED_CLS:
> >         case BPF_PROG_TYPE_SCHED_ACT:
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >                 return true;
> >         default:
> >                 return false;
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index fd423ce3da34..03d3037e6508 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
> >         return false;
> >  }
> >
> > +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
> > +          const void *, from, u32, len, enum flow_dissector_key_id, key_id)
> > +{
> > +       struct bpf_flow_dissect_cb *cb;
> > +       void *dest;
> > +
> > +       cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
> > +
> > +       /* Make sure the dissector actually uses the key. It is not an error if
> > +        * it does not, but we should not continue past this point in that case
> > +        */
> > +       if (!dissector_uses_key(cb->flow_dissector, key_id))
> > +               return 0;
> > +
> > +       /* Make sure the length is correct */
> > +       switch (key_id) {
> > +       case FLOW_DISSECTOR_KEY_CONTROL:
> > +       case FLOW_DISSECTOR_KEY_ENC_CONTROL:
> > +               if (len != sizeof(struct flow_dissector_key_control))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_BASIC:
> > +               if (len != sizeof(struct flow_dissector_key_basic))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> > +       case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
> > +               if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> > +       case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
> > +               if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_ICMP:
> > +               if (len != sizeof(struct flow_dissector_key_icmp))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_PORTS:
> > +       case FLOW_DISSECTOR_KEY_ENC_PORTS:
> > +               if (len != sizeof(struct flow_dissector_key_ports))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_ETH_ADDRS:
> > +               if (len != sizeof(struct flow_dissector_key_eth_addrs))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_TIPC:
> > +               if (len != sizeof(struct flow_dissector_key_tipc))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_ARP:
> > +               if (len != sizeof(struct flow_dissector_key_arp))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_VLAN:
> > +       case FLOW_DISSECTOR_KEY_CVLAN:
> > +               if (len != sizeof(struct flow_dissector_key_vlan))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_FLOW_LABEL:
> > +               if (len != sizeof(struct flow_dissector_key_tags))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_GRE_KEYID:
> > +       case FLOW_DISSECTOR_KEY_ENC_KEYID:
> > +       case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
> > +               if (len != sizeof(struct flow_dissector_key_keyid))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_MPLS:
> > +               if (len != sizeof(struct flow_dissector_key_mpls))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_TCP:
> > +               if (len != sizeof(struct flow_dissector_key_tcp))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_IP:
> > +       case FLOW_DISSECTOR_KEY_ENC_IP:
> > +               if (len != sizeof(struct flow_dissector_key_ip))
> > +                       return -EINVAL;
> > +               break;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +
> > +       dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
> > +                                        cb->target_container);
> > +
> > +       memcpy(dest, from, len);
> > +       return 0;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
> > +       .func           = bpf_flow_dissector_write_keys,
> > +       .gpl_only       = false,
> > +       .ret_type       = RET_INTEGER,
> > +       .arg1_type      = ARG_PTR_TO_CTX,
> > +       .arg2_type      = ARG_PTR_TO_MEM,
> > +       .arg3_type      = ARG_CONST_SIZE,
> > +       .arg4_type      = ARG_ANYTHING,
> > +};
> > +
> >  static const struct bpf_func_proto *
> >  bpf_base_func_proto(enum bpf_func_id func_id)
> >  {
> > @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >         }
> >  }
> >
> > +static const struct bpf_func_proto *
> > +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +       switch (func_id) {
> > +       case BPF_FUNC_skb_load_bytes:
> > +               return &bpf_skb_load_bytes_proto;
> > +       case BPF_FUNC_flow_dissector_write_keys:
> > +               return &bpf_flow_dissector_write_keys_proto;
> > +       default:
> > +               return bpf_base_func_proto(func_id);
> > +       }
> > +}
> > +
> >  static const struct bpf_func_proto *
> >  lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >  {
> > @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
> >         return true;
> >  }
> >
> > +static bool flow_dissector_is_valid_access(int off, int size,
> > +                                          enum bpf_access_type type,
> > +                                          const struct bpf_prog *prog,
> > +                                          struct bpf_insn_access_aux *info)
> > +{
> > +       if (type == BPF_WRITE) {
> > +               switch (off) {
> > +               case bpf_ctx_range(struct __sk_buff, cb[0]):
> > +                       break;
> > +               default:
> > +                       return false;
> > +               }
> > +       }
> > +
> > +       switch (off) {
> > +       case bpf_ctx_range(struct __sk_buff, data):
> > +               info->reg_type = PTR_TO_PACKET;
> > +               break;
> > +       case bpf_ctx_range(struct __sk_buff, data_end):
> > +               info->reg_type = PTR_TO_PACKET_END;
> > +               break;
> > +       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> > +       case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
> > +               return false;
> > +       }
> > +
> > +       return bpf_skb_is_valid_access(off, size, type, prog, info);
> > +}
> > +
> >  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> >                                   const struct bpf_insn *si,
> >                                   struct bpf_insn *insn_buf,
> > @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
> >  const struct bpf_prog_ops sk_msg_prog_ops = {
> >  };
> >
> > +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
> > +       .get_func_proto         = flow_dissector_func_proto,
> > +       .is_valid_access        = flow_dissector_is_valid_access,
> > +       .convert_ctx_access     = bpf_convert_ctx_access,
> > +       .gen_ld_abs             = bpf_gen_ld_abs,
> > +};
> > +
> > +const struct bpf_prog_ops flow_dissector_prog_ops = {
> > +};
> > +
> >  int sk_detach_filter(struct sock *sk)
> >  {
> >         int ret = -ENOENT;
> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > index ce9eeeb7c024..767daa231f04 100644
> > --- a/net/core/flow_dissector.c
> > +++ b/net/core/flow_dissector.c
> > @@ -25,6 +25,11 @@
> >  #include <net/flow_dissector.h>
> >  #include <scsi/fc/fc_fcoe.h>
> >  #include <uapi/linux/batadv_packet.h>
> > +#include <linux/bpf.h>
> > +
> > +/* BPF program accessible by all flow dissectors */
> > +static struct bpf_prog __rcu *flow_dissector_prog;
> > +static DEFINE_MUTEX(flow_dissector_mutex);
> >
> >  static void dissector_set_key(struct flow_dissector *flow_dissector,
> >                               enum flow_dissector_key_id key_id)
> > @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> >  }
> >  EXPORT_SYMBOL(skb_flow_dissector_init);
> >
> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> > +                                      struct bpf_prog *prog)
> > +{
> > +       struct bpf_prog *attached;
> > +
> > +       mutex_lock(&flow_dissector_mutex);
> > +       attached = rcu_dereference_protected(flow_dissector_prog,
> > +                                            lockdep_is_held(&flow_dissector_mutex));
> > +       if (attached) {
> > +               /* Only one BPF program can be attached at a time */
> > +               mutex_unlock(&flow_dissector_mutex);
> > +               return -EEXIST;
> > +       }
> > +       rcu_assign_pointer(flow_dissector_prog, prog);
> > +       mutex_unlock(&flow_dissector_mutex);
> > +       return 0;
> > +}
> > +
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> > +{
> > +       struct bpf_prog *attached;
> > +
> > +       mutex_lock(&flow_dissector_mutex);
> > +       attached = rcu_dereference_protected(flow_dissector_prog,
> > +                                            lockdep_is_held(&flow_dissector_mutex));
> > +       if (!flow_dissector_prog) {
> > +               mutex_unlock(&flow_dissector_mutex);
> > +               return -EINVAL;
> > +       }
> > +       bpf_prog_put(attached);
> > +       RCU_INIT_POINTER(flow_dissector_prog, NULL);
> > +       mutex_unlock(&flow_dissector_mutex);
> > +       return 0;
> > +}
> >  /**
> >   * skb_flow_get_be16 - extract be16 entity
> >   * @skb: sk_buff to extract from
> > @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >         struct flow_dissector_key_vlan *key_vlan;
> >         enum flow_dissect_ret fdret;
> >         enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
> > +       struct bpf_prog *attached;
> >         int num_hdrs = 0;
> >         u8 ip_proto = 0;
> >         bool ret;
> > @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >                                               FLOW_DISSECTOR_KEY_BASIC,
> >                                               target_container);
> >
> > +       rcu_read_lock();
> > +       attached = rcu_dereference(flow_dissector_prog);
> > +       if (attached) {
> > +               /* Note that even though the const qualifier is discarded
> > +                * throughout the execution of the BPF program, all changes(the
> > +                * control block) are reverted after the BPF program returns.
> > +                * Therefore, __skb_flow_dissect does not alter the skb.
> > +                */
> > +               struct bpf_flow_dissect_cb *cb;
> > +               u8 cb_saved[BPF_SKB_CB_LEN];
> > +               u32 result;
> > +
> > +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> > +
> > +               /* Save Control Block */
> > +               memcpy(cb_saved, cb, sizeof(cb_saved));
> > +               memset(cb, 0, sizeof(cb_saved));
> > +
> > +               /* Pass parameters to the BPF program */
> > +               cb->nhoff = nhoff;
> > +               cb->target_container = target_container;
> > +               cb->flow_dissector = flow_dissector;
> > +
> > +               bpf_compute_data_pointers((struct sk_buff *)skb);
> > +               result = BPF_PROG_RUN(attached, skb);
> > +
> > +               /* Restore state */
> > +               memcpy(cb, cb_saved, sizeof(cb_saved));
> > +
> > +               key_control->thoff = min_t(u16, key_control->thoff,
> > +                                          skb ? skb->len : hlen);
> > +               rcu_read_unlock();
> > +               return result == BPF_OK;
> > +       }
>
> If the BPF program cannot handle certain protocol, shall we fall back
> to the built-in logic? Otherwise, all BPF programs need to have some
> code for all protocols.
>
> Song

I believe that if we fall back to the built-in logic we lose all security
guarantees from BPF and this is why the code does not support
fall back.

Petar
>
>
> > +       rcu_read_unlock();
> > +
> >         if (dissector_uses_key(flow_dissector,
> >                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> >                 struct ethhdr *eth = eth_hdr(skb);
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index dce960d22106..b1cd3bc8db70 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
> >         [BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
> >         [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
> >         [BPF_PROG_TYPE_LIRC_MODE2]      = "lirc_mode2",
> > +       [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
> >  };
> >
> >  static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 66917a4eba27..acd74a0dd063 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_LWT_SEG6LOCAL,
> >         BPF_PROG_TYPE_LIRC_MODE2,
> >         BPF_PROG_TYPE_SK_REUSEPORT,
> > +       BPF_PROG_TYPE_FLOW_DISSECTOR,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
> >         BPF_CGROUP_UDP4_SENDMSG,
> >         BPF_CGROUP_UDP6_SENDMSG,
> >         BPF_LIRC_MODE2,
> > +       BPF_FLOW_DISSECTOR,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -2226,7 +2228,8 @@ union bpf_attr {
> >         FN(get_current_cgroup_id),      \
> >         FN(get_local_storage),          \
> >         FN(sk_select_reuseport),        \
> > -       FN(skb_ancestor_cgroup_id),
> > +       FN(skb_ancestor_cgroup_id),     \
> > +       FN(flow_dissector_write_keys),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 2abd0f112627..0c749ce1b717 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
> >         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> >         case BPF_PROG_TYPE_LIRC_MODE2:
> >         case BPF_PROG_TYPE_SK_REUSEPORT:
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >                 return false;
> >         case BPF_PROG_TYPE_UNSPEC:
> >         case BPF_PROG_TYPE_KPROBE:
> > @@ -2121,6 +2122,7 @@ static const struct {
> >         BPF_PROG_SEC("sk_skb",          BPF_PROG_TYPE_SK_SKB),
> >         BPF_PROG_SEC("sk_msg",          BPF_PROG_TYPE_SK_MSG),
> >         BPF_PROG_SEC("lirc_mode2",      BPF_PROG_TYPE_LIRC_MODE2),
> > +       BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
> >         BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
> >         BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
> >         BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> > index e4be7730222d..4204c496a04f 100644
> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> > @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
> >         (void *) BPF_FUNC_skb_cgroup_id;
> >  static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
> >         (void *) BPF_FUNC_skb_ancestor_cgroup_id;
> > +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
> > +                                           int key) =
> > +       (void *) BPF_FUNC_flow_dissector_write_keys;
> >
> >  /* llvm builtin functions that eBPF C program may use to
> >   * emit BPF_LD_ABS and BPF_LD_IND instructions
> > --
> > 2.18.0.865.gffc8e1a3cd6-goog
> >

^ permalink raw reply

* [GIT PULL] 9p updates for 4.19
From: Dominique Martinet @ 2018-08-17  2:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: v9fs-developer, linux-kernel, netdev
In-Reply-To: <20180813012347.GA32555@nautica>

Hi Linus,

9p has seen some recent surge of activity recently and I've stepped up
to help maintaining 9p; please let me know if there are things to
improve in how I do things.
In particular, the gpg key I used to sign the tag is getting rather old
(1024 bit dsa is considered weak nowadays) and I will have a newer key
ready for 4.20, but I didn't think I would be able to gather signatures
for 4.19 so went for this one for now. Sorry for that.

I've rebased the branch at the start of the week to add some stable Cc
but all the patches have been in linux-next for two weeks, and all
transports have been tested since the rebase (thanks to Stefano
Stabellini for xen!)


The following changes since commit 94710cac0ef4ee177a63b5227664b38c95bbf703:

  Linux 4.18 (2018-08-12 13:41:04 -0700)

are available in the Git repository at:

  git://github.com/martinetd/linux tags/9p-for-4.19-2

for you to fetch changes up to edcd9d977354304cb85aee61c2b96809edce41ed:

  net/9p/trans_virtio.c: add null terminal for mount tag (2018-08-13 09:34:58 +0900)

----------------------------------------------------------------
Pull request for inclusion in 4.19 for 9p

Contains mostly fixes (6 to be backported to stable) and a few changes,
here is the breakdown:
 * Rework how fids are attributed by replacing some custom tracking in a
list by an idr (f28cdf0430fc)
 * For packet-based transports (virtio/rdma) validate that the packet
length matches what the header says (f984579a01d8)
 * A few race condition fixes found by syzkaller (9f476d7c540c,
430ac66eb4c5)
 * Missing argument check when NULL device is passed in sys_mount
(10aa14527f45)
 * A few virtio fixes (23cba9cbde0b, 31934da81036, d28c756caee6)
 * Some spelling and style fixes

----------------------------------------------------------------
Chirantan Ekbote (1):
      9p/net: Fix zero-copy path in the 9p virtio transport

Colin Ian King (1):
      fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"

Jean-Philippe Brucker (1):
      net/9p: fix error path of p9_virtio_probe

Matthew Wilcox (4):
      9p: Fix comment on smp_wmb
      9p: Change p9_fid_create calling convention
      9p: Replace the fidlist with an IDR
      9p: Embed wait_queue_head into p9_req_t

Souptick Joarder (1):
      fs/9p/vfs_file.c: use new return type vm_fault_t

Stephen Hemminger (1):
      9p: fix whitespace issues

Tomas Bortoli (5):
      net/9p/client.c: version pointer uninitialized
      net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
      net/9p/trans_fd.c: fix race by holding the lock
      9p: validate PDU length
      9p: fix multiple NULL-pointer-dereferences

jiangyiwen (2):
      net/9p/virtio: Fix hard lockup in req_done
      9p/virtio: fix off-by-one error in sg list bounds check

piaojun (5):
      net/9p/client.c: add missing '\n' at the end of p9_debug()
      9p/net/protocol.c: return -ENOMEM when kmalloc() failed
      net/9p/trans_virtio.c: fix some spell mistakes in comments
      fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
      net/9p/trans_virtio.c: add null terminal for mount tag

 fs/9p/v9fs.c            |   2 +-
 fs/9p/vfs_file.c        |   2 +-
 fs/9p/xattr.c           |   6 ++-
 include/net/9p/client.h |  11 ++---
 net/9p/client.c         | 119 ++++++++++++++++++++----------------------------
 net/9p/protocol.c       |   2 +-
 net/9p/trans_fd.c       |  22 ++++++---
 net/9p/trans_rdma.c     |   4 ++
 net/9p/trans_virtio.c   |  66 ++++++++++++++++-----------
 net/9p/trans_xen.c      |   3 ++
 net/9p/util.c           |   1 -
 11 files changed, 122 insertions(+), 116 deletions(-)
-- 
Dominique Martinet

^ permalink raw reply

* Re: Under what conditions is phy_device "adjust_link()" called?
From: Andrew Lunn @ 2018-08-16 23:33 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: rpjday, netdev
In-Reply-To: <53c990a5-9b27-df58-43c3-517708e97511@gmail.com>

On Thu, Aug 16, 2018 at 11:59:19AM -0700, Florian Fainelli wrote:
> On 08/16/2018 10:26 AM, rpjday@crashcourse.ca wrote:
> > 
> > I can see from the documentation that the callback adjust_link() is invoked
> > "for the enet controller to respond to changes in the link state." Is there
> > a specific list of the events that would generate such a change? Are we
> > talking initially opening the device, ifup/ifdown, physically unplugging
> > from the port, some or all of the above?
> 
> adjust_link() is typically called on transitions from link UP to DOWN
> and DOWN to UP. This may include the initial configuration of the PHY
> during e.g: phy_connect() and then typically when an event occurs than
> requires a re-configuration of the MAC: link parameters (speed, status,
> duplex, pause) changed.

Just adding to that, if you have a 10G PHY, the MAC-PHY interface mode
can also change, e.g. SGMII for 10/100/1000, and 10GBase-T for 10G,
etc.

	Andrew

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: maowenan @ 2018-08-17  2:48 UTC (permalink / raw)
  To: Michal Kubecek, Greg KH
  Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw, stable,
	Takashi Iwai
In-Reply-To: <20180816160616.u3refk4mqpyqagzi@unicorn.suse.cz>



On 2018/8/17 0:06, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
>>>
>>> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
>>>
>>> What I can see, though, is that with current stable 4.4 code, modified
>>> testcase which sends something like
>>>
>>>   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
>>>
>>> I quickly eat 6 MB of memory for receive queue of one socket while
>>> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
>>> Takashi's follow-up yet but I'm pretty sure it will help while
>>> preserving nice performance when using the original segmentsmack
>>> testcase (with increased packet ratio).
>>
>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>> push out a new 4.4-rc later tonight.  Can everyone standardize on that
>> and test and let me know if it does, or does not, fix the reported
>> issues?
> 
> I did repeat the tests with Takashi's fix and the CPU utilization is
> similar to what we have now, i.e. 3-5% with 10K pkt/s. I could still
> saturate one CPU somewhere around 50K pkt/s but that already requires
> 2.75 MB/s (22 Mb/s) of throughput. (My previous tests with Mao Wenan's
> changes in fact used lower speeds as the change from 128 to 1024 would
> need to be done in two places.)
> 
> Where Takashi's patch does help is that it does not prevent collapsing
> of ranges of adjacent segments with total length shorter than ~4KB. It
> took more time to verify: it cannot be checked by watching the socket
> memory consumption with ss as tcp_collapse_ofo_queue isn't called until
> we reach the limits. So I needed to trace when and how tcp_collpse() is
> called with both current stable 4.4 code and one with Takashi's fix.

The POC is default to attack Raspberry Pi system, whose cpu performance is lower,
so the default parameter is not aggressive, we would enlarge parameter to test
in our intel skylake system(with high performance), if don't do this, cpu usage isn't
obvious different with fixed patch and without fixed patch, you can't distinguish
whether the patch can really fix it or not.

I have made series testing here, including low rate attacking(128B,100ms interval)
and high rate attacking(1024B,10ms interval), with original 4.4 kernel, only Takashi's patch,
and only Mao Wenan's patches. I will check the cpu usage of ksoftirq.

	      original	Takashi	Mao Wenan
low rate 	3%	2%	2%
high rate 	50%	49%	~10%

so, I can't identify whether Takashi's patch can really fix radical issue, which I think
the root reason exist in simple queue, and Eric's patch
72cd43ba tcp: free batches of packets in tcp_prune_ofo_queue() can completely fix this,
which have already involved in my patch series. This patch need change simple queue to
RB tree, and it is high efficiency searching and dropping packets, and avoid large tcp retransmitting.
so cpu usage will be fall down.

>   
>> If not, we can go from there and evaluate this much larger patch
>> series.  But let's try the simple thing first.
> 
> At high packet rates (say 30K pkt/s and more), we can still saturate the
> CPU. This is also mentioned in the announcement with claim that switch
> to rbtree based queue would be necessary to fully address that. My tests
> seem to confirm that but I'm still not sure it is worth backporting
> something as intrusive into stable 4.4.
> 
> Michal Kubecek
> 
> .
> 

^ permalink raw reply

* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: Jian-Hong Pan @ 2018-08-17  2:03 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <ff5ea624-0bd5-ef9e-9c02-deb4c1de601b@gmail.com>

2018-08-17 2:59 GMT+08:00 Heiner Kallweit <hkallweit1@gmail.com>:
>> From: Jian-Hong Pan <jian-hong@endlessm.com>
>>
>> Found the ethernet network on ASUS X441UAR doesn't come back on resume
>> from suspend when using MSI-X.  The chip is RTL8106e - version 39.
>>
> The patch itself looks good, just the commit message is wrong in one
> place and a little bit long.
>
>> asus@endless:~$ dmesg | grep r8169
>> [   21.848357] libphy: r8169: probed
>> [   21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
>> 44900000, IRQ 127
>> [   22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0
>> [   29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [   63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [  124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>>
>> Here is the ethernet controller in detail:
>>
>> asus@endless:~$ sudo lspci -nnvs 02:00.0
>> [sudo] password for asus:
>> 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>> RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
>> (rev 07)
>>       Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
>> Ethernet controller [1043:200f]
>>       Flags: bus master, fast devsel, latency 0, IRQ 16
>>       I/O ports at e000 [size=256]
>>       Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
>>       Memory at e0000000 (64-bit, prefetchable) [size=16K]
>>       Capabilities: [40] Power Management version 3
>>       Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>       Capabilities: [70] Express Endpoint, MSI 01
>>       Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>>       Capabilities: [d0] Vital Product Data
>>       Capabilities: [100] Advanced Error Reporting
>>       Capabilities: [140] Virtual Channel
>>       Capabilities: [160] Device Serial Number 01-00-00-00-36-4c-e0-00
>>       Capabilities: [170] Latency Tolerance Reporting
>>       Kernel driver in use: r8169
>>       Kernel modules: r8169
>>
>> Here is the system interrupt table:
>>
>> asus@endless:~$ cat /proc/interrupts
>>             CPU0       CPU1       CPU2       CPU3
>>    0:         22          0          0          0   IO-APIC    2-edge
>> timer
>>    1:        157         42          0          0   IO-APIC    1-edge
>> i8042
>>    8:          0          0          1          0   IO-APIC    8-edge
>> rtc0
>>    9:         10         13          0          0   IO-APIC    9-fasteoi
>> acpi
>>   16:          0          0          0          0   IO-APIC   16-fasteoi
>> i2c_designware.0, i801_smbus
>>   17:       2445          0       3453          0   IO-APIC   17-fasteoi
>> i2c_designware.1, rtl_pci
>>  109:          2          0          0          1   IO-APIC  109-fasteoi
>> FTE1200:00
>>  120:          0          0          0          0   PCI-MSI 458752-edge
>> PCIe PME
>>  121:          0          0          0          0   PCI-MSI 466944-edge
>> PCIe PME
>>  122:          0          0          0          0   PCI-MSI 468992-edge
>> PCIe PME
>>  123:       1465          0          0      21263   PCI-MSI 376832-edge
>> ahci[0000:00:17.0]
>>  124:          0        530          0          0   PCI-MSI 327680-edge
>> xhci_hcd
>>  125:       5204          0          0          0   PCI-MSI 32768-edge
>> i915
>>  126:          0          0        149          0   PCI-MSI 514048-edge
>> snd_hda_intel:card0
>>  127:          0          0        337          0   PCI-MSI 1048576-edge
>> enp2s0
>>  NMI:          0          0          0          0   Non-maskable
>> interrupts
>>  LOC:      45049      39474      38978      46677   Local timer
>> interrupts
>>  SPU:          0          0          0          0   Spurious interrupts
>>  PMI:          0          0          0          0   Performance
>> monitoring interrupts
>>  IWI:        619          8          0          1   IRQ work interrupts
>>  RTR:          6          0          0          0   APIC ICR read
>> retries
>>  RES:       4918       4436       3835       2943   Rescheduling
>> interrupts
>>  CAL:       1399       1478       1598       1465   Function call
>> interrupts
>>  TLB:        608        513        723        559   TLB shootdowns
>>  TRM:          0          0          0          0   Thermal event
>> interrupts
>>  THR:          0          0          0          0   Threshold APIC
>> interrupts
>>  DFR:          0          0          0          0   Deferred Error APIC
>> interrupts
>>  MCE:          0          0          0          0   Machine check
>> exceptions
>>  MCP:          3          4          4          4   Machine check polls
>>  ERR:          0
>>  MIS:          0
>>  PIN:          0          0          0          0   Posted-interrupt
>> notification event
>>  NPI:          0          0          0          0   Nested
>> posted-interrupt event
>>  PIW:          0          0          0          0   Posted-interrupt
>> wakeup event
>>
>> It is the IRQ 127 - PCI-MSI used by enp2s0.  However, lspci lists MSI is
>> disabled and MSI-X is enabled which conflicts to the interrupt table.
>>
> Both types of interrupts, MSI and MSI-X, are listed with irq chip name
> "PCI-MSI", because MSI-X is treated as a sub-feature of MSI.
> Therefore the output of /proc/interrupts doesn't allow to tell whether
> a MSI or MSI-X interrupt is used, and as a consequence there is no such
> conflict.
> Indeed only lspci provides the information whether MSI or MSI-X is used.

Oh!  Thanks for your information.  I learned and noted.
I am preparing the version 2 patch with modified commit message.

>> Falling back to MSI fixes the issue.
>>
>> Here is the test result with this patch in dmesg:
>>
>> asus@endless:~$ dmesg | grep r8169
>> [   22.017477] libphy: r8169: probed
>> [   22.017735] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
>> 44900000, IRQ 127
>> [   22.041489] r8169 0000:02:00.0 enp2s0: renamed from eth0
>> [   29.138312] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [   30.927359] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [  289.998077] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>> [  290.508084] Generic PHY r8169-200:00: attached PHY driver [Generic
>> PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
>> [  290.745690] r8169 0000:02:00.0 enp2s0: Link is Down
>> [  292.367717] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
>> flow control off
>>
>> lspci lists MSI is enabled and MSI-X is disabled with this patch:
>>
>> asus@endless:~/linux-net$ sudo lspci -nnvs 02:00.0
>> [sudo] password for asus:
>> 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>> RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
>> (rev 07)
>>       Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
>> Ethernet controller [1043:200f]
>>       Flags: bus master, fast devsel, latency 0, IRQ 127
>>       I/O ports at e000 [size=256]
>>       Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
>>       Memory at e0000000 (64-bit, prefetchable) [size=16K]
>>       Capabilities: [40] Power Management version 3
>>       Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>       Capabilities: [70] Express Endpoint, MSI 01
>>       Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
>>       Capabilities: [d0] Vital Product Data
>>       Capabilities: [100] Advanced Error Reporting
>>       Capabilities: [140] Virtual Channel
>>       Capabilities: [160] Device Serial Number 01-00-00-00-36-4c-e0-00
>>       Capabilities: [170] Latency Tolerance Reporting
>>       Kernel driver in use: r8169
>>       Kernel modules: r8169
>>
>> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
>> ---
>>  drivers/net/ethernet/realtek/r8169.c | 9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index 0d9c3831838f..0efa977c422d 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -7071,17 +7071,20 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
>>  {
>>       unsigned int flags;
>>
>> -     if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
>> +     switch (tp->mac_version) {
>> +     case RTL_GIGA_MAC_VER_01 ... RTL_GIGA_MAC_VER_06:
>>               RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
>>               RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
>>               RTL_W8(tp, Cfg9346, Cfg9346_Lock);
>>               flags = PCI_IRQ_LEGACY;
>> -     } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
>> +             break;
>> +     case RTL_GIGA_MAC_VER_39 ... RTL_GIGA_MAC_VER_40:
>>               /* This version was reported to have issues with resume
>>                * from suspend when using MSI-X
>>                */
>>               flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
>> -     } else {
>> +             break;
>> +     default:
>>               flags = PCI_IRQ_ALL_TYPES;
>>       }
>>
>>
>
>

^ permalink raw reply

* [PATCH v2 net] r8169: don't use MSI-X on RTL8106e
From: Jian-Hong Pan @ 2018-08-17  5:07 UTC (permalink / raw)
  To: Heiner Kallweit, David Miller, nic_swsd, netdev, linux-kernel,
	linux
  Cc: Jian-Hong Pan
In-Reply-To: <20180815062110.16155-1-jian-hong@endlessm.com>

Found the ethernet network on ASUS X441UAR doesn't come back on resume
from suspend when using MSI-X.  The chip is RTL8106e - version 39.

[   21.848357] libphy: r8169: probed
[   21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
44900000, IRQ 127
[   22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0
[   29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[   63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
flow control off
[  124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)

Here is the ethernet controller in detail:

02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
(rev 07)
	Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
Ethernet controller [1043:200f]
	Flags: bus master, fast devsel, latency 0, IRQ 16
	I/O ports at e000 [size=256]
	Memory at ef100000 (64-bit, non-prefetchable) [size=4K]
	Memory at e0000000 (64-bit, prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
	Kernel modules: r8169

Falling back to MSI fixes the issue.

Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling")
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
---
Changes in v2:
  - Make the commit message shorter
  - Add "Fixes" tag in the commit message

 drivers/net/ethernet/realtek/r8169.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 0d9c3831838f..0efa977c422d 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7071,17 +7071,20 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
 {
 	unsigned int flags;
 
-	if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_01 ... RTL_GIGA_MAC_VER_06:
 		RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
 		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
 		RTL_W8(tp, Cfg9346, Cfg9346_Lock);
 		flags = PCI_IRQ_LEGACY;
-	} else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
+		break;
+	case RTL_GIGA_MAC_VER_39 ... RTL_GIGA_MAC_VER_40:
 		/* This version was reported to have issues with resume
 		 * from suspend when using MSI-X
 		 */
 		flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
-	} else {
+		break;
+	default:
 		flags = PCI_IRQ_ALL_TYPES;
 	}
 
-- 
2.11.0

^ permalink raw reply related

* KINDLY REPLY diplmosesd@gmail.com URGENTLY
From: MR MOSES @ 2018-08-17  2:16 UTC (permalink / raw)
  To: Recipients

KINDLY REPLY diplmosesd@gmail.com URGENTLY

^ permalink raw reply

* Re: bnxt: card intermittently hanging and dropping link
From: Michael Chan @ 2018-08-17  2:37 UTC (permalink / raw)
  To: Daniel Axtens, Ashwin Thiagarajan, Carl Tung; +Cc: Netdev, jay.vosburgh
In-Reply-To: <87lg9665l4.fsf@linkitivity.dja.id.au>

On Thu, Aug 16, 2018 at 2:09 AM, Daniel Axtens <dja@axtens.net> wrote:
> Hi Michael,
>
>> The main issue is the TX timeout.
>> .....
>>
>>> [ 2682.911693] bnxt_en 0000:3b:00.0 eth4: TX timeout detected, starting reset task!
>>> [ 2683.782496] bnxt_en 0000:3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>>> [ 2683.783061] bnxt_en 0000:3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>> [ 2684.634557] bnxt_en 0000:3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>>> [ 2684.635120] bnxt_en 0000:3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>
>> and it is not recovering.
>>
>> Please provide ethtool -i eth4 which will show the firmware version on
>> the NIC.  Let's see if the firmware is too old.
>
> driver: bnxt_en
> version: 1.8.0
> firmware-version: 20.6.151.0/pkg 20.06.05.11

I believe the firmware should be updated.  My colleague will contact
you on how to proceed.

Thanks.

^ permalink raw reply

* Re: [PATCH 1/1] tap: comment fix
From: Jason Wang @ 2018-08-17  4:28 UTC (permalink / raw)
  To: David Miller, jianjian.wang1
  Cc: girish.moodalbail, mst, willemb, viro, wexu, netdev, linux-kernel
In-Reply-To: <20180816.123002.844201503080913720.davem@davemloft.net>



On 2018年08月17日 03:30, David Miller wrote:
> From: Wang Jian <jianjian.wang1@gmail.com>
> Date: Thu, 16 Aug 2018 21:01:27 +0800
>
>> The tap_queue and the "tap_dev" are loosely coupled, not "macvlan_dev".
>>
>> And I also change one rcu_read_lock's place, seems can reduce rcu
>> critical section a little.
>>
>> Signed-off-by: Wang Jian <jianjian.wang1@gmail.com>
> This patch was corrupted by your email client, for example it turned
> TAB characters into sequences of spaces.
>
> Please fix this, email a test patch to yourself, and do not resend the
> patch to this mailing list until you can successfully extract and
> cleanly apply the test patch you email to yourself.
>
> Thank you.

Besides this, please split it into two patches. The RCU change does not 
belong to "comment fix" for sure.

Thanks

^ permalink raw reply

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
From: D. J. Bernstein @ 2018-08-17  7:31 UTC (permalink / raw)
  To: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan
In-Reply-To: <20180816194620.GA185651@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5175 bytes --]

Eric Biggers writes:
> If (more likely) you're talking about things like "use this NEON implementation
> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
> developers and community to test different CPUs and make appropriate decisions,
> and yes it can be very useful to have external benchmarks like SUPERCOP to refer
> to, and I appreciate your work in that area.

You seem to be talking about a process that selects (e.g.) ChaCha20
implementations as follows: manually inspect benchmarks of various
implementations on various CPUs, manually write code to map CPUs to
implementations, manually update the code as necessary for new CPUs, and
of course manually do the same for every other primitive that can see
differences between microarchitectures (which isn't something weird---
it's the normal situation after enough optimization effort).

This is quite a bit of manual work, so the kernel often doesn't do it,
so we end up with unhappy people talking about performance regressions.

For comparison, imagine one simple central piece of code in the kernel
to automatically do the following:

   When a CPU core is booted:
     For each primitive:
       Benchmark all implementations of the primitive on the core.
       Select the fastest for subsequent use on the core.

If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and
libpqcrypto) rather than something ad-hoc (as in raid6), then there's no
manual work per primitive, and no work per implementation. Each CPU, old
or new, automatically obtains the fastest available code for that CPU.

The only cost is a moment of benchmarking at boot time. _If_ this is a
noticeable cost then there are many ways to speed it up: for example,
automatically copy the results across identical cores, automatically
copy the results across boots if the cores are unchanged, automatically
copy results from a central database indexed by CPU identifiers, etc.
The SUPERCOP database is evolving towards enabling this type of sharing.

> A lot of code can be shared, but in practice different environments have
> different constraints, and kernel programming in particular has some distinct
> differences from userspace programming.  For example, you cannot just use the
> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
> most architectures they can't be used in some contexts such as hardirq context,
> and even when they *can* be used you have to run special code before and after
> which does things like saving all the FPU registers to the task_struct,
> disabling preemption, and/or enabling the FPU.

Is there some reason that each implementor is being pestered to handle
all this? Detecting FPU usage is a simple static-analysis exercise, and
the rest sounds like straightforward boilerplate that should be handled
centrally.

> But disabling preemption for
> long periods of time hurts responsiveness, so it's also desirable to yield the
> processor occasionally, which means that assembly implementations should be
> incremental rather than having a single entry point that does everything.

Doing this rewrite automatically is a bit more of a code-analysis
challenge, but the alternative approach of doing it by hand is insanely
error-prone. See, e.g., https://eprint.iacr.org/2017/891.

> Many people may have contributed to SUPERCOP already, but that doesn't mean
> there aren't things you could do to make it more appealing to contributors and
> more of a community project,

The logic in this sentence is impeccable, and is already illustrated by
many SUPERCOP improvements through the years from an increasing number
of contributors, as summarized in the 87 release announcements so far on
the relevant public mailing list, which you're welcome to study in
detail along with the 400 megabytes of current code and as many previous
versions as you're interested in. That's also the mailing list where
people are told to send patches, as you'll see if you RTFM.

> So Linux distributions may not want to take on the legal risk of
> distributing it

This is a puzzling comment. A moment ago we were talking about the
possibility of useful sharing of (e.g.) ChaCha20 implementations between
SUPERCOP and the Linux kernel, avoiding pointless fracturing of the
community's development process for these implementations. This doesn't
mean that the kernel should be grabbing implementations willy-nilly from
SUPERCOP---surely the kernel should be doing security audits, and the
kernel already has various coding requirements, and the kernel requires
GPL compatibility, while putting any of these requirements into SUPERCOP
would be counterproductive.

If you mean having the entire SUPERCOP benchmarking package distributed
through Linux distributions, I have no idea what your motivation is or
how this is supposed to be connected to anything else we're discussing.
Obviously SUPERCOP's broad code-inclusion policies make this idea a
non-starter.

> nor may companies want to take on the risk of contributing.

RTFM. People who submit code are authorizing public redistribution for
benchmarking. It's up to them to decide if they want to allow more.

---Dan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* [PATCH net-next] esp: remove redundant define esph
From: Haishuang Yan @ 2018-08-17  7:51 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S. Miller
  Cc: netdev, linux-kernel, Haishuang Yan

The pointer 'esph' is defined but is never used hence it is redundant
and canbe removed.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
---
 net/ipv4/esp4.c | 7 +++----
 net/ipv6/esp6.c | 7 +++----
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 9768901..211caaf 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -683,12 +683,11 @@ static void esp_input_done_esn(struct crypto_async_request *base, int err)
  */
 static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
 {
-	struct ip_esp_hdr *esph;
 	struct crypto_aead *aead = x->data;
 	struct aead_request *req;
 	struct sk_buff *trailer;
 	int ivlen = crypto_aead_ivsize(aead);
-	int elen = skb->len - sizeof(*esph) - ivlen;
+	int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen;
 	int nfrags;
 	int assoclen;
 	int seqhilen;
@@ -698,13 +697,13 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
 	struct scatterlist *sg;
 	int err = -EINVAL;
 
-	if (!pskb_may_pull(skb, sizeof(*esph) + ivlen))
+	if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen))
 		goto out;
 
 	if (elen <= 0)
 		goto out;
 
-	assoclen = sizeof(*esph);
+	assoclen = sizeof(struct ip_esp_hdr);
 	seqhilen = 0;
 
 	if (x->props.flags & XFRM_STATE_ESN) {
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 88a7579..63b2b66 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -601,12 +601,11 @@ static void esp_input_done_esn(struct crypto_async_request *base, int err)
 
 static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 {
-	struct ip_esp_hdr *esph;
 	struct crypto_aead *aead = x->data;
 	struct aead_request *req;
 	struct sk_buff *trailer;
 	int ivlen = crypto_aead_ivsize(aead);
-	int elen = skb->len - sizeof(*esph) - ivlen;
+	int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen;
 	int nfrags;
 	int assoclen;
 	int seqhilen;
@@ -616,7 +615,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 	u8 *iv;
 	struct scatterlist *sg;
 
-	if (!pskb_may_pull(skb, sizeof(*esph) + ivlen)) {
+	if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen)) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -626,7 +625,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 		goto out;
 	}
 
-	assoclen = sizeof(*esph);
+	assoclen = sizeof(struct ip_esp_hdr);
 	seqhilen = 0;
 
 	if (x->props.flags & XFRM_STATE_ESN) {
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH 1/1] tap: comment fix
From: Wang Jian @ 2018-08-17  8:24 UTC (permalink / raw)
  To: Jason
  Cc: David S . Miller, girish.moodalbail, mst, Willem de Bruijn, viro,
	wexu, netdev, linux-kernel
In-Reply-To: <d3c380fc-cebd-0a91-1acc-916a453c9f11@redhat.com>

Thanks for the reminder.
Because this change is trivial, I change the subject.
On Fri, Aug 17, 2018 at 12:29 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年08月17日 03:30, David Miller wrote:
> > From: Wang Jian <jianjian.wang1@gmail.com>
> > Date: Thu, 16 Aug 2018 21:01:27 +0800
> >
> >> The tap_queue and the "tap_dev" are loosely coupled, not "macvlan_dev".
> >>
> >> And I also change one rcu_read_lock's place, seems can reduce rcu
> >> critical section a little.
> >>
> >> Signed-off-by: Wang Jian <jianjian.wang1@gmail.com>
> > This patch was corrupted by your email client, for example it turned
> > TAB characters into sequences of spaces.
> >
> > Please fix this, email a test patch to yourself, and do not resend the
> > patch to this mailing list until you can successfully extract and
> > cleanly apply the test patch you email to yourself.
> >
> > Thank you.
>
> Besides this, please split it into two patches. The RCU change does not
> belong to "comment fix" for sure.
>
> Thanks
>


-- 
Regards,
Wang Jian

^ permalink raw reply

* [PATCH iproute2-next] iproute_lwtunnel: allow specifying 'src' for 'encap ip' / 'encap ip6'
From: Shmulik Ladkani @ 2018-08-17  7:31 UTC (permalink / raw)
  To: stephen, dsahern; +Cc: netdev, shmulik.ladkani, Shmulik Ladkani

This allows the user to specify the LWTUNNEL_IP_SRC/LWTUNNEL_IP6_SRC
when setting an lwtunnel encapsulation route.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
 ip/iproute_lwtunnel.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 740da7c6..20d5545c 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -671,7 +671,7 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
 static int parse_encap_ip(struct rtattr *rta, size_t len,
 			  int *argcp, char ***argvp)
 {
-	int id_ok = 0, dst_ok = 0, tos_ok = 0, ttl_ok = 0;
+	int id_ok = 0, dst_ok = 0, src_ok = 0, tos_ok = 0, ttl_ok = 0;
 	char **argv = *argvp;
 	int argc = *argcp;
 
@@ -694,6 +694,15 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 			get_addr(&addr, *argv, AF_INET);
 			rta_addattr_l(rta, len, LWTUNNEL_IP_DST,
 				      &addr.data, addr.bytelen);
+		} else if (strcmp(*argv, "src") == 0) {
+			inet_prefix addr;
+
+			NEXT_ARG();
+			if (src_ok++)
+				duparg2("src", *argv);
+			get_addr(&addr, *argv, AF_INET);
+			rta_addattr_l(rta, len, LWTUNNEL_IP_SRC,
+				      &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "tos") == 0) {
 			__u32 tos;
 
@@ -805,7 +814,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 static int parse_encap_ip6(struct rtattr *rta, size_t len,
 			   int *argcp, char ***argvp)
 {
-	int id_ok = 0, dst_ok = 0, tos_ok = 0, ttl_ok = 0;
+	int id_ok = 0, dst_ok = 0, src_ok = 0, tos_ok = 0, ttl_ok = 0;
 	char **argv = *argvp;
 	int argc = *argcp;
 
@@ -828,6 +837,15 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 			get_addr(&addr, *argv, AF_INET6);
 			rta_addattr_l(rta, len, LWTUNNEL_IP6_DST,
 				      &addr.data, addr.bytelen);
+		} else if (strcmp(*argv, "src") == 0) {
+			inet_prefix addr;
+
+			NEXT_ARG();
+			if (src_ok++)
+				duparg2("src", *argv);
+			get_addr(&addr, *argv, AF_INET6);
+			rta_addattr_l(rta, len, LWTUNNEL_IP6_SRC,
+				      &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "tc") == 0) {
 			__u32 tc;
 
-- 
2.18.0

^ permalink raw reply related

* Re: [v3, net-next, 02/12] net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit
From: Jerome Brunet @ 2018-08-17  7:32 UTC (permalink / raw)
  To: Jose Abreu, netdev, open list:ARM/Amlogic Meson..., Kevin Hilman
  Cc: David S. Miller, Joao Pinto, Vitor Soares, Giuseppe Cavallaro,
	Alexandre Torgue, Bartosz Gołaszewski
In-Reply-To: <9d0be5db11478d00a9194065abcf137b4d537c0a.1526651009.git.joabreu@synopsys.com>

On Fri, 2018-05-18 at 14:55 +0100, Jose Abreu wrote:
> This is cutting down performance. Once the timer is armed it should run
> after the time expires for the first packet sent and not the last one.
> 
> After this change, running iperf, the performance gain is +/- 24%.

Hi Guys,

Since v4.18, we are getting a serious regression on Amlogic based SoCs.
I have tested this on amlogic's: 
* gxbb S905 p200 (Micrel KSZ9031 - 1GBps)
* axg A113 s400 (Realtek RTL8211F - 1GBps)

Both SoCs use the synopsys gmac with stmmac driver.

I first noticed that running NFS root filesystem became unstable but I could not
understand why. Then, running a download as simple test with iperf3 (from an
initramfs) will break the 'network' in matter of seconds.

I don't know exactly what breaks but bisect clearly assign the blame to this
change. Reverting the change solve this problem.

I'll be happy to make more tests to help understand what is happening here.

In the meantime, should we consider reverting this patch ?

Best Regards
Jerome

> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Vitor Soares <soares@synopsys.com>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac.h      |    1 +
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |    5 ++++-
>  2 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index 42fc76e..4d425b1 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -105,6 +105,7 @@ struct stmmac_priv {
>  	u32 tx_count_frames;
>  	u32 tx_coal_frames;
>  	u32 tx_coal_timer;
> +	bool tx_timer_armed;
>  
>  	int tx_coalesce;
>  	int hwts_tx_en;
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index d9dbe13..789bc22 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3158,13 +3158,16 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>  	 * element in case of no SG.
>  	 */
>  	priv->tx_count_frames += nfrags + 1;
> -	if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
> +	if (likely(priv->tx_coal_frames > priv->tx_count_frames) &&
> +	    !priv->tx_timer_armed) {
>  		mod_timer(&priv->txtimer,
>  			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
> +		priv->tx_timer_armed = true;
>  	} else {
>  		priv->tx_count_frames = 0;
>  		stmmac_set_tx_ic(priv, desc);
>  		priv->xstats.tx_set_ic_bit++;
> +		priv->tx_timer_armed = false;
>  	}
>  
>  	skb_tx_timestamp(skb);

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox