Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: net/tipc: recursive locking in tipc_link_reset
From: Dmitry Vyukov @ 2018-10-11  7:59 UTC (permalink / raw)
  To: Jon Maloy, David Miller, Ying Xue, netdev, tipc-discussion, LKML
In-Reply-To: <CACT4Y+bcxEz=P2E5TRj6A1qeCOQ+WtWR44WGLzqdDttLzD393A@mail.gmail.com>

On Thu, Oct 11, 2018 at 9:55 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Hi,
>
> I am getting the following error while booting the latest kernel on
> bb2d8f2f61047cbde08b78ec03e4ebdb01ee5434 (Oct 10). Config is attached.
>
> Since this happens during boot, this makes LOCKDEP completely
> unusable, does not allow to discover any other locking issues and
> masks all new bugs being introduced into kernel.
> Please fix asap.
> Thanks

-parthasarathy.bhuvaragan address as it gives me bounces
but this is highly likely due to:

commit 3f32d0be6c16b902b687453c962d17eea5b8ea19
Author: Parthasarathy Bhuvaragan
Date:   Tue Sep 25 22:09:10 2018 +0200

    tipc: lock wakeup & inputq at tipc_link_reset()


> WARNING: possible recursive locking detected
> 4.19.0-rc7+ #14 Not tainted
> --------------------------------------------
> swapper/0/1 is trying to acquire lock:
> 00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> 00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
>
> but task is already holding lock:
> 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(&(&list->lock)->rlock#4);
>   lock(&(&list->lock)->rlock#4);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 2 locks held by swapper/0/1:
>  #0: 00000000f7539d34 (pernet_ops_rwsem){+.+.}, at:
> register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
>  #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> spin_lock_bh include/linux/spinlock.h:334 [inline]
>  #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
>
> stack backtrace:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1af/0x295 lib/dump_stack.c:113
>  print_deadlock_bug kernel/locking/lockdep.c:1759 [inline]
>  check_deadlock kernel/locking/lockdep.c:1803 [inline]
>  validate_chain kernel/locking/lockdep.c:2399 [inline]
>  __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
>  lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
>  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
>  spin_lock_bh include/linux/spinlock.h:334 [inline]
>  tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
>  tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
>  tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
>  tipc_init_net+0x472/0x610 net/tipc/core.c:82
>  ops_init+0xf7/0x520 net/core/net_namespace.c:129
>  __register_pernet_operations net/core/net_namespace.c:940 [inline]
>  register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
>  register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
>  tipc_init+0x83/0x104 net/tipc/core.c:140
>  do_one_initcall+0x109/0x70a init/main.c:885
>  do_initcall_level init/main.c:953 [inline]
>  do_initcalls init/main.c:961 [inline]
>  do_basic_setup init/main.c:979 [inline]
>  kernel_init_freeable+0x4bd/0x57f init/main.c:1144
>  kernel_init+0x13/0x180 init/main.c:1063
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

^ permalink raw reply

* [PATCH] virtio_net: enable tx after resuming from suspend
From: Ake Koomsin @ 2018-10-11  7:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: ake, Michael S. Tsirkin, David S. Miller, virtualization, netdev,
	linux-kernel

commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
disabled the virtio tx before going to suspend to avoid a use after free.
However, after resuming, it causes the virtio_net device to lose its
network connectivity.

To solve the issue, we need to enable tx after resuming.

Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
Signed-off-by: Ake Koomsin <ake@igel.co.jp>
---
 drivers/net/virtio_net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dab504ec5e50..3453d80f5f81 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2256,6 +2256,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 	}
 
 	netif_device_attach(vi->dev);
+	netif_start_queue(vi->dev);
 	return err;
 }
 
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2] net: mscc: allow extracting the FCS into the skb
From: Antoine Tenart @ 2018-10-11  7:12 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, netdev, linux-kernel, thomas.petazzoni,
	alexandre.belloni, quentin.schulz, allan.nielsen, f.fainelli,
	andrew

This patch adds support for the NETIF_F_RXFCS feature in the Mscc
Ethernet driver. This feature is disabled by default and allow a user
to request the driver not to drop the FCS and to extract it into the skb
for debugging purposes.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
---

Since v1:
  - Rebased on top of the latest net-next.

 drivers/net/ethernet/mscc/ocelot.c       | 2 +-
 drivers/net/ethernet/mscc/ocelot_board.c | 7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c
index 8f11fdba8d0e..9b8a17ee3cb3 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1620,7 +1620,7 @@ int ocelot_probe_port(struct ocelot *ocelot, u8 port,
 	dev->ethtool_ops = &ocelot_ethtool_ops;
 	dev->switchdev_ops = &ocelot_port_switchdev_ops;
 
-	dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
+	dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_RXFCS;
 	dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
 	memcpy(dev->dev_addr, ocelot->base_mac, ETH_ALEN);
diff --git a/drivers/net/ethernet/mscc/ocelot_board.c b/drivers/net/ethernet/mscc/ocelot_board.c
index 0cf0b0935b3b..4c23d18bbf44 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -128,11 +128,16 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg)
 			len += sz;
 		} while (len < buf_len);
 
-		/* Read the FCS and discard it */
+		/* Read the FCS */
 		sz = ocelot_rx_frame_word(ocelot, grp, false, &val);
 		/* Update the statistics if part of the FCS was read before */
 		len -= ETH_FCS_LEN - sz;
 
+		if (unlikely(dev->features & NETIF_F_RXFCS)) {
+			buf = (u32 *)skb_put(skb, ETH_FCS_LEN);
+			*buf = val;
+		}
+
 		if (sz < 0) {
 			err = sz;
 			break;
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next] net: mscc: allow extracting the FCS into the skb
From: Antoine Tenart @ 2018-10-11  6:57 UTC (permalink / raw)
  To: David Miller
  Cc: f.fainelli, antoine.tenart, andrew, netdev, linux-kernel,
	thomas.petazzoni, alexandre.belloni, quentin.schulz,
	allan.nielsen
In-Reply-To: <20181010.102605.1416485422281413445.davem@davemloft.net>

Hi Florian, Dave,

On Wed, Oct 10, 2018 at 10:26:05AM -0700, David Miller wrote:
> From: Florian Fainelli <f.fainelli@gmail.com>
> Date: Wed, 10 Oct 2018 09:25:01 -0700
> > 
> > On October 10, 2018 7:46:31 AM PDT, Antoine Tenart <antoine.tenart@bootlin.com> wrote:
> >>
> >>@Dave, Florian: it seems to me no modification was requested after
> >>discussing those changes. Where do we stand regarding the patch?
> > 
> > Silence means agreement, thanks for your explanation, no more
> > questions or concerns on my side.
> 
> You'll probably need to resubmit the patch anew though.

I'll rebase the patch on top of net-next and send it again.

Thanks!
Antoine

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH stable 4.9 00/29] backport of IP fragmentation fixes
From: Florian Fainelli @ 2018-10-10 23:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, netdev, davem, gregkh, stable, edumazet
In-Reply-To: <20181010160949.3cdc4fbc@xeon-e3>

On 10/10/2018 04:18 PM, Stephen Hemminger wrote:
> On Tue, 9 Oct 2018 21:15:04 -0700
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> 
>>>
>>> Strange, I do not see "ip: use rb trees for IP frag queue." in this list ?  
>>
>> And it was not in Stephen's backport to 4.14 either, wait, looks like it
>> was somehow squashed into "net: sk_buff rbnode reorg". Stephen, was
>> there a reason for that?
>>
>> Let me go back and add bffa72cf7f9df842f0016ba03586039296b4caaf as well
>> as eeea10b83a139451130df1594f26710c8fa390c8 to the rebase todo and see
>> how things go from there.
>>
>> Thanks for taking a look.
> 
> I don't remember, spent time doing cherry-pick and fixups. Maybe the reorg
> commit got squashed as part of one rebase.
> 

No worries, I ended dropping these two commits because they would have
required cherry-picking "udp: copy skb->truesize in the first cache
line" which did not seem appropriate. I would appreciate if you could
take a look at v2 and confirm this does look good, in particular the
struct sk_buff layout.

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
From: Mauricio Vasquez @ 2018-10-10 22:50 UTC (permalink / raw)
  To: Song Liu; +Cc: Alexei Starovoitov, Daniel Borkmann, Networking
In-Reply-To: <CAPhsuW4oDn-DiK1zAk_vcPBggB=mBJ6NO562CjVyyavYW7f_LQ@mail.gmail.com>



On 10/10/2018 05:34 PM, Song Liu wrote:
> On Wed, Oct 10, 2018 at 10:48 AM Mauricio Vasquez
> <mauricio.vasquez@polito.it> wrote:
>>
>>
>> On 10/10/2018 11:48 AM, Song Liu wrote:
>>> On Wed, Oct 10, 2018 at 7:06 AM Mauricio Vasquez B
>>> <mauricio.vasquez@polito.it> wrote:
>>>> The following patch implements a bpf queue/stack maps that
>>>> provides the peek/pop/push functions.  There is not a direct
>>>> relationship between those functions and the current maps
>>>> syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
>>>> this is mapped to the pop operation in the queue/stack maps
>>>> and it is still to implement in other kind of maps.
>>>>
>>>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>>>> ---
>>>>    include/linux/bpf.h      |    1 +
>>>>    include/uapi/linux/bpf.h |    1 +
>>>>    kernel/bpf/syscall.c     |   82 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>    3 files changed, 84 insertions(+)
>>>>
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index 9b558713447f..5793f0c7fbb5 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>>> @@ -39,6 +39,7 @@ struct bpf_map_ops {
>>>>           void *(*map_lookup_elem)(struct bpf_map *map, void *key);
>>>>           int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
>>>>           int (*map_delete_elem)(struct bpf_map *map, void *key);
>>>> +       void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
>>>>
>>>>           /* funcs called by prog_array and perf_event_array map */
>>>>           void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index f9187b41dff6..3bb94aa2d408 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -103,6 +103,7 @@ enum bpf_cmd {
>>>>           BPF_BTF_LOAD,
>>>>           BPF_BTF_GET_FD_BY_ID,
>>>>           BPF_TASK_FD_QUERY,
>>>> +       BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>>>>    };
>>>>
>>>>    enum bpf_map_type {
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index f36c080ad356..6907d661dea5 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -980,6 +980,85 @@ static int map_get_next_key(union bpf_attr *attr)
>>>>           return err;
>>>>    }
>>>>
>>>> +#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
>>>> +
>>>> +static int map_lookup_and_delete_elem(union bpf_attr *attr)
>>>> +{
>>>> +       void __user *ukey = u64_to_user_ptr(attr->key);
>>>> +       void __user *uvalue = u64_to_user_ptr(attr->value);
>>>> +       int ufd = attr->map_fd;
>>>> +       struct bpf_map *map;
>>>> +       void *key, *value, *ptr;
>>>> +       u32 value_size;
>>>> +       struct fd f;
>>>> +       int err;
>>>> +
>>>> +       if (CHECK_ATTR(BPF_MAP_LOOKUP_AND_DELETE_ELEM))
>>>> +               return -EINVAL;
>>>> +
>>>> +       f = fdget(ufd);
>>>> +       map = __bpf_map_get(f);
>>>> +       if (IS_ERR(map))
>>>> +               return PTR_ERR(map);
>>>> +
>>>> +       if (!(f.file->f_mode & FMODE_CAN_WRITE)) {
>>>> +               err = -EPERM;
>>>> +               goto err_put;
>>>> +       }
>>>> +
>>>> +       key = __bpf_copy_key(ukey, map->key_size);
>>>> +       if (IS_ERR(key)) {
>>>> +               err = PTR_ERR(key);
>>>> +               goto err_put;
>>>> +       }
>>>> +
>>>> +       value_size = map->value_size;
>>>> +
>>>> +       err = -ENOMEM;
>>>> +       value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
>>>> +       if (!value)
>>>> +               goto free_key;
>>>> +
>>>> +       err = -EFAULT;
>>>> +       if (copy_from_user(value, uvalue, value_size) != 0)
>>>> +               goto free_value;
>>>> +
>>>> +       /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
>>>> +        * inside bpf map update or delete otherwise deadlocks are possible
>>>> +        */
>>>> +       preempt_disable();
>>>> +       __this_cpu_inc(bpf_prog_active);
>>>> +       if (map->ops->map_lookup_and_delete_elem) {
>>>> +               rcu_read_lock();
>>>> +               ptr = map->ops->map_lookup_and_delete_elem(map, key);
>>>> +               if (ptr)
>>>> +                       memcpy(value, ptr, value_size);
>>> I think we are exposed to race condition with push and pop in parallel.
>>> map_lookup_and_delete_elem() only updates the head/tail, so it gives
>>> no protection for the buffer pointed by ptr.
>> queue/stack maps does not use this 'ptr', the pop operation directly
>> copies the value into the buffer allocated in map_lookup_and_delete_elem().
>> The copy from the queue/stack buffer into 'value' and the head/tail
>> update are protected by a spinlock in the queue/stack maps implementation.
>>
>> On the other hand, future implementation of map_lookup_and_delete
>> operation in other kind of maps should guarantee that the return ptr is
>> rcu protected.
>>
>> Does it make sense to you?
> I reread the other patch, and found it does NOT use the following logic for
> queue and stack:
>
>                 rcu_read_lock();
>                 ptr = map->ops->map_lookup_and_delete_elem(map, key);
>                 if (ptr)
>                         memcpy(value, ptr, value_size);
>
> I guess this part is not used at all? Can we just remove it?
>
> Thanks,
> Song

This is the base code for map_lookup_and_delete support, it is not used 
in queue/stack maps.

I think we can leave it there, so when somebody implements 
lookup_and_delete for other maps doesn't have to care about implementing 
also this.
>
>
>
>
>
>>>> +               rcu_read_unlock();
>>>> +               err = ptr ? 0 : -ENOENT;
>>>> +       } else {
>>>> +               err = -ENOTSUPP;
>>>> +       }
>>>> +
>>>> +       __this_cpu_dec(bpf_prog_active);
>>>> +       preempt_enable();
>>>> +
>>>> +       if (err)
>>>> +               goto free_value;
>>>> +
>>>> +       if (copy_to_user(uvalue, value, value_size) != 0)
>>>> +               goto free_value;
>>>> +
>>>> +       err = 0;
>>>> +
>>>> +free_value:
>>>> +       kfree(value);
>>>> +free_key:
>>>> +       kfree(key);
>>>> +err_put:
>>>> +       fdput(f);
>>>> +       return err;
>>>> +}
>>>> +
>>>>    static const struct bpf_prog_ops * const bpf_prog_types[] = {
>>>>    #define BPF_PROG_TYPE(_id, _name) \
>>>>           [_id] = & _name ## _prog_ops,
>>>> @@ -2453,6 +2532,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>>>>           case BPF_TASK_FD_QUERY:
>>>>                   err = bpf_task_fd_query(&attr, uattr);
>>>>                   break;
>>>> +       case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
>>>> +               err = map_lookup_and_delete_elem(&attr);
>>>> +               break;
>>>>           default:
>>>>                   err = -EINVAL;
>>>>                   break;
>>>>

^ permalink raw reply

* Re: [PATCH net-next 0/4] Adds support of RSS to HNS3 Driver for Rev 2(=0x21) H/W
From: David Miller @ 2018-10-11  5:59 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20181010190537.20972-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Wed, 10 Oct 2018 20:05:33 +0100

> This patch-set mainly adds new additions related to RSS for the new
> hardware Revision 0x21. It also adds support to use RSS hash value
> provided by the hardware along with descriptor.

Series applied.

^ permalink raw reply

* Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
From: Song Liu @ 2018-10-10 22:34 UTC (permalink / raw)
  To: mauricio.vasquez; +Cc: Alexei Starovoitov, Daniel Borkmann, Networking
In-Reply-To: <bd36a892-c73a-f257-1e97-2c9a45780659@polito.it>

On Wed, Oct 10, 2018 at 10:48 AM Mauricio Vasquez
<mauricio.vasquez@polito.it> wrote:
>
>
>
> On 10/10/2018 11:48 AM, Song Liu wrote:
> > On Wed, Oct 10, 2018 at 7:06 AM Mauricio Vasquez B
> > <mauricio.vasquez@polito.it> wrote:
> >> The following patch implements a bpf queue/stack maps that
> >> provides the peek/pop/push functions.  There is not a direct
> >> relationship between those functions and the current maps
> >> syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
> >> this is mapped to the pop operation in the queue/stack maps
> >> and it is still to implement in other kind of maps.
> >>
> >> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> >> ---
> >>   include/linux/bpf.h      |    1 +
> >>   include/uapi/linux/bpf.h |    1 +
> >>   kernel/bpf/syscall.c     |   82 ++++++++++++++++++++++++++++++++++++++++++++++
> >>   3 files changed, 84 insertions(+)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 9b558713447f..5793f0c7fbb5 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -39,6 +39,7 @@ struct bpf_map_ops {
> >>          void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> >>          int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
> >>          int (*map_delete_elem)(struct bpf_map *map, void *key);
> >> +       void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
> >>
> >>          /* funcs called by prog_array and perf_event_array map */
> >>          void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index f9187b41dff6..3bb94aa2d408 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -103,6 +103,7 @@ enum bpf_cmd {
> >>          BPF_BTF_LOAD,
> >>          BPF_BTF_GET_FD_BY_ID,
> >>          BPF_TASK_FD_QUERY,
> >> +       BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >>   };
> >>
> >>   enum bpf_map_type {
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index f36c080ad356..6907d661dea5 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -980,6 +980,85 @@ static int map_get_next_key(union bpf_attr *attr)
> >>          return err;
> >>   }
> >>
> >> +#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
> >> +
> >> +static int map_lookup_and_delete_elem(union bpf_attr *attr)
> >> +{
> >> +       void __user *ukey = u64_to_user_ptr(attr->key);
> >> +       void __user *uvalue = u64_to_user_ptr(attr->value);
> >> +       int ufd = attr->map_fd;
> >> +       struct bpf_map *map;
> >> +       void *key, *value, *ptr;
> >> +       u32 value_size;
> >> +       struct fd f;
> >> +       int err;
> >> +
> >> +       if (CHECK_ATTR(BPF_MAP_LOOKUP_AND_DELETE_ELEM))
> >> +               return -EINVAL;
> >> +
> >> +       f = fdget(ufd);
> >> +       map = __bpf_map_get(f);
> >> +       if (IS_ERR(map))
> >> +               return PTR_ERR(map);
> >> +
> >> +       if (!(f.file->f_mode & FMODE_CAN_WRITE)) {
> >> +               err = -EPERM;
> >> +               goto err_put;
> >> +       }
> >> +
> >> +       key = __bpf_copy_key(ukey, map->key_size);
> >> +       if (IS_ERR(key)) {
> >> +               err = PTR_ERR(key);
> >> +               goto err_put;
> >> +       }
> >> +
> >> +       value_size = map->value_size;
> >> +
> >> +       err = -ENOMEM;
> >> +       value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
> >> +       if (!value)
> >> +               goto free_key;
> >> +
> >> +       err = -EFAULT;
> >> +       if (copy_from_user(value, uvalue, value_size) != 0)
> >> +               goto free_value;
> >> +
> >> +       /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
> >> +        * inside bpf map update or delete otherwise deadlocks are possible
> >> +        */
> >> +       preempt_disable();
> >> +       __this_cpu_inc(bpf_prog_active);
> >> +       if (map->ops->map_lookup_and_delete_elem) {
> >> +               rcu_read_lock();
> >> +               ptr = map->ops->map_lookup_and_delete_elem(map, key);
> >> +               if (ptr)
> >> +                       memcpy(value, ptr, value_size);
> > I think we are exposed to race condition with push and pop in parallel.
> > map_lookup_and_delete_elem() only updates the head/tail, so it gives
> > no protection for the buffer pointed by ptr.
>
> queue/stack maps does not use this 'ptr', the pop operation directly
> copies the value into the buffer allocated in map_lookup_and_delete_elem().
> The copy from the queue/stack buffer into 'value' and the head/tail
> update are protected by a spinlock in the queue/stack maps implementation.
>
> On the other hand, future implementation of map_lookup_and_delete
> operation in other kind of maps should guarantee that the return ptr is
> rcu protected.
>
> Does it make sense to you?

I reread the other patch, and found it does NOT use the following logic for
queue and stack:

               rcu_read_lock();
               ptr = map->ops->map_lookup_and_delete_elem(map, key);
               if (ptr)
                       memcpy(value, ptr, value_size);

I guess this part is not used at all? Can we just remove it?

Thanks,
Song






> >
> >> +               rcu_read_unlock();
> >> +               err = ptr ? 0 : -ENOENT;
> >> +       } else {
> >> +               err = -ENOTSUPP;
> >> +       }
> >> +
> >> +       __this_cpu_dec(bpf_prog_active);
> >> +       preempt_enable();
> >> +
> >> +       if (err)
> >> +               goto free_value;
> >> +
> >> +       if (copy_to_user(uvalue, value, value_size) != 0)
> >> +               goto free_value;
> >> +
> >> +       err = 0;
> >> +
> >> +free_value:
> >> +       kfree(value);
> >> +free_key:
> >> +       kfree(key);
> >> +err_put:
> >> +       fdput(f);
> >> +       return err;
> >> +}
> >> +
> >>   static const struct bpf_prog_ops * const bpf_prog_types[] = {
> >>   #define BPF_PROG_TYPE(_id, _name) \
> >>          [_id] = & _name ## _prog_ops,
> >> @@ -2453,6 +2532,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> >>          case BPF_TASK_FD_QUERY:
> >>                  err = bpf_task_fd_query(&attr, uattr);
> >>                  break;
> >> +       case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> >> +               err = map_lookup_and_delete_elem(&attr);
> >> +               break;
> >>          default:
> >>                  err = -EINVAL;
> >>                  break;
> >>
>

^ permalink raw reply

* Re: [PATCH -next] phy: phy-ocelot-serdes: fix return value check in serdes_probe()
From: David Miller @ 2018-10-11  5:54 UTC (permalink / raw)
  To: weiyongjun1; +Cc: kishon, quentin.schulz, linux-kernel, kernel-janitors, netdev
In-Reply-To: <1539136824-44940-1-git-send-email-weiyongjun1@huawei.com>

From: Wei Yongjun <weiyongjun1@huawei.com>
Date: Wed, 10 Oct 2018 02:00:24 +0000

> In case of error, the function syscon_node_to_regmap() returns ERR_PTR()
> and never returns NULL. The NULL test in the return value check should
> be replaced with IS_ERR().
> 
> Fixes: 51f6b410fc22 ("phy: add driver for Microsemi Ocelot SerDes muxing")
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>

Applied.

^ permalink raw reply

* KASAN: use-after-free Read in __llc_lookup_established
From: syzbot @ 2018-10-11  5:50 UTC (permalink / raw)
  To: davem, keescook, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    3d647e62686f Merge tag 's390-4.19-4' of git://git.kernel.o..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1707d809400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
dashboard link: https://syzkaller.appspot.com/bug?extid=11e05f04c15e03be5254
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+11e05f04c15e03be5254@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in llc_estab_match net/llc/llc_conn.c:494  
[inline]
BUG: KASAN: use-after-free in __llc_lookup_established+0xc80/0xe10  
net/llc/llc_conn.c:522
Read of size 1 at addr ffff8801c5794a7f by task syz-executor3/10277

CPU: 0 PID: 10277 Comm: syz-executor3 Not tainted 4.19.0-rc7+ #55
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
  print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
net_ratelimit: 9 callbacks suppressed
openvswitch: netlink: Key type 12288 is out of range max 29
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  llc_estab_match net/llc/llc_conn.c:494 [inline]
  __llc_lookup_established+0xc80/0xe10 net/llc/llc_conn.c:522
openvswitch: netlink: Key type 12288 is out of range max 29
  llc_lookup_established+0x36/0x60 net/llc/llc_conn.c:554
  llc_ui_bind+0x810/0xdd0 net/llc/af_llc.c:381
  __sys_bind+0x331/0x440 net/socket.c:1483
  __do_sys_bind net/socket.c:1494 [inline]
  __se_sys_bind net/socket.c:1492 [inline]
  __x64_sys_bind+0x73/0xb0 net/socket.c:1492
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2a18100c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457579
RDX: 0000000000000010 RSI: 0000000020000040 RDI: 0000000000000006
RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2a181016d4
R13: 00000000004bd718 R14: 00000000004cbfe0 R15: 00000000ffffffff

Allocated by task 10278:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
  __do_kmalloc mm/slab.c:3718 [inline]
  __kmalloc+0x14e/0x760 mm/slab.c:3727
  kmalloc include/linux/slab.h:518 [inline]
  sk_prot_alloc+0x1b0/0x2e0 net/core/sock.c:1468
  sk_alloc+0x10d/0x1690 net/core/sock.c:1522
  llc_sk_alloc+0x35/0x4b0 net/llc/llc_conn.c:949
  llc_ui_create+0x142/0x520 net/llc/af_llc.c:173
  __sock_create+0x536/0x930 net/socket.c:1277
  sock_create net/socket.c:1317 [inline]
  __sys_socket+0x106/0x260 net/socket.c:1347
  __do_sys_socket net/socket.c:1356 [inline]
  __se_sys_socket net/socket.c:1354 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1354
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 10276:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kfree+0xcf/0x230 mm/slab.c:3813
  sk_prot_free net/core/sock.c:1505 [inline]
  __sk_destruct+0x797/0xa80 net/core/sock.c:1587
  sk_destruct+0x78/0x90 net/core/sock.c:1595
  __sk_free+0xcf/0x300 net/core/sock.c:1606
  sk_free+0x42/0x50 net/core/sock.c:1617
  sock_put include/net/sock.h:1691 [inline]
  llc_sk_free+0x9d/0xb0 net/llc/llc_conn.c:1017
  llc_ui_release+0x161/0x2a0 net/llc/af_llc.c:218
  __sock_release+0xd7/0x250 net/socket.c:579
  sock_close+0x19/0x20 net/socket.c:1141
  __fput+0x385/0xa30 fs/file_table.c:278
  ____fput+0x15/0x20 fs/file_table.c:309
  task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:193 [inline]
  exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
  prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
  do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801c5794600
  which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 1151 bytes inside of
  2048-byte region [ffff8801c5794600, ffff8801c5794e00)
The buggy address belongs to the page:
page:ffffea000715e500 count:1 mapcount:0 mapping:ffff8801da800c40 index:0x0  
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffffea00075c0d88 ffffea0006f48908 ffff8801da800c40
raw: 0000000000000000 ffff8801c5794600 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801c5794900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801c5794980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801c5794a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                                 ^
  ffff8801c5794a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801c5794b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* Re: fore200e DMA cleanups and fixes
From: David Miller @ 2018-10-11  5:39 UTC (permalink / raw)
  To: hch; +Cc: 3chas3, netdev, linux-atm-general, linux-kernel
In-Reply-To: <20181009145720.32578-1-hch@lst.de>

From: Christoph Hellwig <hch@lst.de>
Date: Tue,  9 Oct 2018 16:57:13 +0200

> The fore200e driver came up during some dma-related audits, so
> here is the fallout.  Compile tested (x86 & sparc) only.

Series applied to net-next.

^ permalink raw reply

* Re: [PATCH net-next v7 28/28] net: WireGuard secure network tunnel
From: Jiri Pirko @ 2018-10-11  5:36 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: LKML, Netdev, David Miller, Greg Kroah-Hartman
In-Reply-To: <CAHmME9o_XjDXqRTGSxAjdNj9RXaHwgC_ZFYLUB8uD6mWvj=aEw@mail.gmail.com>

Wed, Oct 10, 2018 at 10:27:46PM CEST, Jason@zx2c4.com wrote:
>Hey Jiri,
>
>Actually, in the end I went with the suggestion from Andrew and Lukas,
>which is to follow Dan's guideline:
>https://lkml.org/lkml/2016/8/22/374 . It looks like this:
>
>https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/drivers/net/wireguard/device.c?h=jd/wireguard#n280

I prefer:
                 err = do_something();
                 if (err)
                         goto err_do_something;

But your style is also quite common. Up to you, I guess.

>
>Jason

^ permalink raw reply

* Re: [PATCH net-next V3] virtio_net: ethtool tx napi configuration
From: David Miller @ 2018-10-11  5:34 UTC (permalink / raw)
  To: jasowang; +Cc: mst, virtualization, netdev, linux-kernel, willemb
In-Reply-To: <20181009020626.31723-1-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue,  9 Oct 2018 10:06:26 +0800

> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> Interrupt moderation is currently not supported, so these accept and
> display the default settings of 0 usec and 1 frame.
> 
> Toggle tx napi through setting tx-frames. So as to not interfere
> with possible future interrupt moderation, value 1 means tx napi while
> value 0 means not.
> 
> Only allow the switching when device is down for simplicity.
> 
> Link: https://patchwork.ozlabs.org/patch/948149/
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> Changes from V2:
> - only allow the switching when device is done
> - remove unnecessary global variable and initialization
> Changes from V1:
> - try to synchronize with datapath to allow changing mode when
>   interface is up.
> - use tx-frames 0 as to disable tx napi while tx-frames 1 to enable tx napi

Applied, with...

> +	bool running = netif_running(dev);

this unused variable removed.

^ permalink raw reply

* Re: netconsole warning in 4.19.0-rc7
From: Cong Wang @ 2018-10-11  5:34 UTC (permalink / raw)
  To: Meelis Roos; +Cc: LKML, Linux Kernel Network Developers, Dave Jones
In-Reply-To: <e0611057-1b5b-9990-b9fa-060a9b5eba40@linux.ee>

(Cc'ing Dave)

On Wed, Oct 10, 2018 at 5:14 AM Meelis Roos <mroos@linux.ee> wrote:
>
> Thies 4.19-rc7 on a bunch of test machines and got this warning from one.
> It is reproducible and I have not noticed it before.
>
[...]
> [    9.914805] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:168 __local_bh_enable_ip+0x2e/0x44
> [    9.914806] Modules linked in:
> [    9.914808] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc7 #210
> [    9.914810] Hardware name: MicroLink                       /D850MV                         , BIOS MV85010A.86A.0067.P24.0304081124 04/08/2003
> [    9.914811] EIP: __local_bh_enable_ip+0x2e/0x44
> [    9.914813] Code: cc 02 5f c8 a9 00 00 0f 00 75 1f 83 ea 01 f7 da 01 15 cc 02 5f c8 a1 cc 02 5f c8 a9 00 ff 1f 00 74 0c ff 0d cc 02 5f c8 5d c3 <0f> 0b eb dd 66 a1 80 cd 5e c8 66 85 c0 74 e9 e8 87 ff ff ff eb e2
> [    9.914814] EAX: 80010200 EBX: f602b000 ECX: 36346270 EDX: 00000200
> [    9.914815] ESI: f620ecc0 EDI: f620ebac EBP: f600de40 ESP: f600de40
> [    9.914816] DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 EFLAGS: 00010006
> [    9.914817] CR0: 80050033 CR2: b7f5f000 CR3: 36389000 CR4: 000006d0
> [    9.914818] Call Trace:
> [    9.914819]  <IRQ>
> [    9.914820]  netpoll_send_skb_on_dev+0xa5/0x1b0

This is exactly what I mentioned in my review here:
https://marc.info/?l=linux-netdev&m=153816136624679&w=2

"But irq is disabled here, so not sure if rcu_read_lock_bh()
could cause trouble... "

^ permalink raw reply

* Re: [PATCH] isdn/hisax: amd7930_fn: Remove unnecessary parentheses
From: David Miller @ 2018-10-11  5:29 UTC (permalink / raw)
  To: natechancellor; +Cc: isdn, netdev, linux-kernel
In-Reply-To: <20181008225905.24214-1-natechancellor@gmail.com>

From: Nathan Chancellor <natechancellor@gmail.com>
Date: Mon,  8 Oct 2018 15:59:05 -0700

> Clang warns when multiple sets of parentheses are used for a single
> conditional statement.
> 
> drivers/isdn/hisax/amd7930_fn.c:628:32: warning: equality comparison
> with extraneous parentheses [-Wparentheses-equality]
>                 if ((cs->dc.amd7930.ph_state == 8)) {
>                      ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
> drivers/isdn/hisax/amd7930_fn.c:628:32: note: remove extraneous
> parentheses around the comparison to silence this warning
>                 if ((cs->dc.amd7930.ph_state == 8)) {
>                     ~                        ^   ~
> drivers/isdn/hisax/amd7930_fn.c:628:32: note: use '=' to turn this
> equality comparison into an assignment
>                 if ((cs->dc.amd7930.ph_state == 8)) {
>                                              ^~
>                                              =
> 1 warning generated.
> 
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 00/10] rxrpc: Fix packet reception code
From: David Miller @ 2018-10-11  5:28 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, pabeni, eric.dumazet, linux-afs, linux-kernel
In-Reply-To: <153903883882.17944.17642727588248415623.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Mon, 08 Oct 2018 23:47:18 +0100

> Here are a set of patches that prepares for and fix problems in rxrpc's
> package reception code.  There serious problems are:
 ...
> The second patch fixes (A) - (C); the third patch renders (B) and (C)
> non-issues by using the recap_rcv hook instead of data_ready - and the
> final patch fixes (D).  That last is the most complex.
> 
> The preparatory patches are:
 ...
> And then there are three main patches - note that these are mixed in with
> the preparatory patches somewhat:
 ...
> The patches are tagged here:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
> 	rxrpc-fixes-20181008

Pulled, thanks David.

^ permalink raw reply

* Re: [PATCH] net: aquantia: remove some redundant variable initializations
From: David Miller @ 2018-10-11  5:18 UTC (permalink / raw)
  To: colin.king; +Cc: igor.russkikh, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20181008133558.5841-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Mon,  8 Oct 2018 14:35:58 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> There are several variables being initialized that are being set later
> and hence the initialization is redundant and can be removed. Remove
> then.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net-next v5] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command
From: Vijay Khemka @ 2018-10-10 21:30 UTC (permalink / raw)
  To: Justin.Lee1@Dell.com, sam@mendozajonas.com, joel@jms.id.au
  Cc: linux-aspeed@lists.ozlabs.org, netdev@vger.kernel.org,
	openbmc@lists.ozlabs.org, Amithash Prasad, christian@cmd.nu
In-Reply-To: <245b32d9a3dd463f9334b9704b639ec4@AUSX13MPS302.AMER.DELL.COM>



On 10/10/18, 11:12 AM, "Justin.Lee1@Dell.com" <Justin.Lee1@Dell.com> wrote:

    The new command (NCSI_CMD_SEND_CMD) is added to allow user space application
    to send NC-SI command to the network card.
    Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and response.
    
    The work flow is as below. 
    
    Request:
    User space application
    	-> Netlink interface (msg)
    	-> new Netlink handler - ncsi_send_cmd_nl()
    	-> ncsi_xmit_cmd()
    
    Response:
    Response received - ncsi_rcv_rsp()
    	-> internal response handler - ncsi_rsp_handler_xxx()
    	-> ncsi_rsp_handler_netlink()
    	-> ncsi_send_netlink_rsp ()
    	-> Netlink interface (msg)
    	-> user space application
    
    Command timeout - ncsi_request_timeout()
    	-> ncsi_send_netlink_timeout ()
    	-> Netlink interface (msg with zero data length)
    	-> user space application
    
    Error:
    Error detected
    	-> ncsi_send_netlink_err ()
    	-> Netlink interface (err msg)
    	-> user space application
    
    
    Signed-off-by: Justin Lee <justin.lee1@dell.com> 

  Reviewed-by : Vijay Khemka <vijaykhemka@fb.com>
  


^ permalink raw reply

* [PATCH net v3] net/sched: cls_api: add missing validation of netlink attributes
From: Davide Caratti @ 2018-10-10 20:00 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Jamal Hadi Salim; +Cc: netdev

Similarly to what has been done in 8b4c3cdd9dd8 ("net: sched: Add policy
validation for tc attributes"), fix classifier code to add validation of
TCA_CHAIN and TCA_KIND netlink attributes.

tested with:
 # ./tdc.py -c filter

v2: Let sch_api and cls_api share nla_policy they have in common, thanks
    to David Ahern.
v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
    by TC modules, thanks to Cong Wang.
    While at it, restore the 'Delete / get qdisc' comment to its orginal
    position, just above tc_get_qdisc() function prototype.

Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/cls_api.c | 13 ++++++++-----
 net/sched/sch_api.c |  8 ++++----
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0a75cb2e5e7b..70f144ac5e1d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -31,6 +31,8 @@
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
 
+extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
+
 /* The list of all installed classifier types */
 static LIST_HEAD(tcf_proto_base);
 
@@ -1211,7 +1213,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 replay:
 	tp_created = 0;
 
-	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
@@ -1360,7 +1362,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
-	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
@@ -1475,7 +1477,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	void *fh = NULL;
 	int err;
 
-	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
@@ -1838,7 +1840,7 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 		return -EPERM;
 
 replay:
-	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
@@ -1949,7 +1951,8 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 	if (nlmsg_len(cb->nlh) < sizeof(*tcm))
 		return skb->len;
 
-	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
+	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, rtm_tca_policy,
+			  NULL);
 	if (err)
 		return err;
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 85e73f48e48f..6684641ea344 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1307,10 +1307,6 @@ check_loop_fn(struct Qdisc *q, unsigned long cl, struct qdisc_walker *w)
 	return 0;
 }
 
-/*
- * Delete/get qdisc.
- */
-
 const struct nla_policy rtm_tca_policy[TCA_MAX + 1] = {
 	[TCA_KIND]		= { .type = NLA_STRING },
 	[TCA_OPTIONS]		= { .type = NLA_NESTED },
@@ -1323,6 +1319,10 @@ const struct nla_policy rtm_tca_policy[TCA_MAX + 1] = {
 	[TCA_EGRESS_BLOCK]	= { .type = NLA_U32 },
 };
 
+/*
+ * Delete/get qdisc.
+ */
+
 static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			struct netlink_ext_ack *extack)
 {
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next] bpf: emit audit messages upon successful prog load and unload
From: Alexei Starovoitov @ 2018-10-10 19:53 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jesper Dangaard Brouer, Daniel Borkmann, ast, netdev, Jiri Olsa,
	acme
In-Reply-To: <20181008115740.GA17355@krava>

On Mon, Oct 08, 2018 at 01:57:40PM +0200, Jiri Olsa wrote:
> 
> I check that discussion and it's related only to bpf program load/unload,
> is there any plan to also notify about bpf program attachment?
> 
> in the step 2 you described:
> 
>   step 2 (future work)
>   single event for bpf prog_load with prog_id only.
>   Either via perf ring buffer or ftrace or tracepoints or some
>   other notification mechanism.
> 
> would you see this to be feasible also for bpf prog attachment notification?

I agree that on the first glance ring buffer notifications
for program attachment sound useful, but I have a hard time
seeing how we can build the complete solution on top of them.
progs can be attached via netlink, perf_event ioctl, bpf syscall.
Theoretically we can insert ring buffer events in all those places,
but there is no common format we can use. In networking cases
ifindex alone won't be enough, since there is xdp vs tc vs lwt vs etc.
Furthermore the program can run without being attached to anything.
Like the way folks use xdp is they load mini bpf prog that dispatches
all other progs via prog_array via tail_call mechanism.
So to execute newly loaded program users space only needs to store its FD
into prog_array.
Would you want to add notifications for all map updates then as well?
What would be the format of such notification? map_id and slot index?
but how audit daemon will now that this particular map is used
for running and under what conditions?
Single bpf dispatcher program can use multiple prog_arrays.
It seems to me that attach notifications are not a practical way
to introspect the "bpf program execution graph" in the kernel.
I suggest to take a look at bpftool. It can inspect networking,
cgroup, tracing progs already and show what programs are loaded and where
they are attached to. I think improving bpftool style of introspection
is more practical than inventing notifications for everything.

^ permalink raw reply

* [PATCH stable 4.9 v2 29/29] ipv4: frags: precedence bug in ip_expire()
From: Florian Fainelli @ 2018-10-10 19:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, gregkh, stable, edumazet, sthemmin, Dan Carpenter
In-Reply-To: <20181010193017.25221-1-f.fainelli@gmail.com>

From: Dan Carpenter <dan.carpenter@oracle.com>

(commit 70837ffe3085c9a91488b52ca13ac84424da1042 upstream)

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_fragment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 09565b14ba6b..cc8c6ac84d08 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -203,7 +203,7 @@ static void ip_expire(unsigned long arg)
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!qp->q.flags & INET_FRAG_FIRST_IN)
+	if (!(qp->q.flags & INET_FRAG_FIRST_IN))
 		goto out;
 
 	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 v2 28/29] ip: frags: fix crash in ip_do_fragment()
From: Florian Fainelli @ 2018-10-10 19:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, gregkh, stable, edumazet, sthemmin, Taehee Yoo
In-Reply-To: <20181010193017.25221-1-f.fainelli@gmail.com>

From: Taehee Yoo <ap420073@gmail.com>

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_fragment.c                  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9961b2102555..09565b14ba6b 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -597,6 +597,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			nextp = &fp->next;
 			fp->prev = NULL;
 			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			fp->sk = NULL;
 			head->data_len += fp->len;
 			head->len += fp->len;
 			if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 907c2d5753dd..b9147558a8f2 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -452,6 +452,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
+		fp->sk = NULL;
 	}
 	sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 v2 27/29] ip: process in-order fragments efficiently
From: Florian Fainelli @ 2018-10-10 19:30 UTC (permalink / raw)
  To: netdev
  Cc: davem, gregkh, stable, edumazet, sthemmin, Peter Oskolkov,
	Florian Westphal
In-Reply-To: <20181010193017.25221-1-f.fainelli@gmail.com>

From: Peter Oskolkov <posk@google.com>

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ++++++++++++++++++++++++---------------
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 535fa57af51e..8323d33c0ce2 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,7 +145,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 			fp = xp;
 		} while (fp);
 	} else {
-		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
+		sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5e2121b82588..9961b2102555 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -125,8 +125,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+			 struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -217,7 +217,12 @@ static void ip_expire(unsigned long arg)
 		head = skb_rb_first(&qp->q.rb_fragments);
 		if (!head)
 			goto out;
-		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		if (FRAG_CB(head)->next_frag)
+			rb_replace_node(&head->rbnode,
+					&FRAG_CB(head)->next_frag->rbnode,
+					&qp->q.rb_fragments);
+		else
+			rb_erase(&head->rbnode, &qp->q.rb_fragments);
 		memset(&head->rbnode, 0, sizeof(head->rbnode));
 		barrier();
 	}
@@ -318,7 +323,7 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
+	sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
@@ -327,6 +332,7 @@ static int ip_frag_reinit(struct ipq *qp)
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
 
@@ -338,7 +344,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct rb_node **rbn, *parent;
-	struct sk_buff *skb1;
+	struct sk_buff *skb1, *prev_tail;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -416,38 +422,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 */
 
 	/* Find out where to put this fragment.  */
-	skb1 = qp->q.fragments_tail;
-	if (!skb1) {
-		/* This is the first fragment we've received. */
-		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
-		qp->q.fragments_tail = skb;
-	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-		/* This is the common/special case: skb goes to the end. */
+	prev_tail = qp->q.fragments_tail;
+	if (!prev_tail)
+		ip4_frag_create_run(&qp->q, skb);  /* First fragment. */
+	else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+		/* This is the common case: skb goes to the end. */
 		/* Detect and discard overlaps. */
-		if (offset < (skb1->ip_defrag_offset + skb1->len))
+		if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
 			goto discard_qp;
-		/* Insert after skb1. */
-		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
-		qp->q.fragments_tail = skb;
+		if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+			ip4_frag_append_to_last_run(&qp->q, skb);
+		else
+			ip4_frag_create_run(&qp->q, skb);
 	} else {
-		/* Binary search. Note that skb can become the first fragment, but
-		 * not the last (covered above). */
+		/* Binary search. Note that skb can become the first fragment,
+		 * but not the last (covered above).
+		 */
 		rbn = &qp->q.rb_fragments.rb_node;
 		do {
 			parent = *rbn;
 			skb1 = rb_to_skb(parent);
 			if (end <= skb1->ip_defrag_offset)
 				rbn = &parent->rb_left;
-			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+			else if (offset >= skb1->ip_defrag_offset +
+						FRAG_CB(skb1)->frag_run_len)
 				rbn = &parent->rb_right;
 			else /* Found an overlap with skb1. */
 				goto discard_qp;
 		} while (*rbn);
 		/* Here we have parent properly set, and rbn pointing to
-		 * one of its NULL left/right children. Insert skb. */
+		 * one of its NULL left/right children. Insert skb.
+		 */
+		ip4_frag_init_run(skb);
 		rb_link_node(&skb->rbnode, parent, rbn);
+		rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 	}
-	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
 	if (dev)
 		qp->iif = dev->ifindex;
@@ -474,7 +483,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, skb, dev);
+		err = ip_frag_reasm(qp, skb, prev_tail, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -493,7 +502,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 
 /* Build a new IP datagram from all its fragments. */
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
-			 struct net_device *dev)
+			 struct sk_buff *prev_tail, struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
@@ -517,10 +526,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
+		if (RB_EMPTY_NODE(&skb->rbnode))
+			FRAG_CB(prev_tail)->next_frag = fp;
+		else
+			rb_replace_node(&skb->rbnode, &fp->rbnode,
+					&qp->q.rb_fragments);
 		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
 		skb_morph(skb, head);
+		FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
 		rb_replace_node(&head->rbnode, &skb->rbnode,
 				&qp->q.rb_fragments);
 		consume_skb(head);
@@ -556,7 +571,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		skb->truesize += clone->truesize;
+		head->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
@@ -569,24 +584,36 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	skb_push(head, head->data - skb_network_header(head));
 
 	/* Traverse the tree in order, to build frag_list. */
+	fp = FRAG_CB(head)->next_frag;
 	rbn = rb_next(&head->rbnode);
 	rb_erase(&head->rbnode, &qp->q.rb_fragments);
-	while (rbn) {
-		struct rb_node *rbnext = rb_next(rbn);
-		fp = rb_to_skb(rbn);
-		rb_erase(rbn, &qp->q.rb_fragments);
-		rbn = rbnext;
-		*nextp = fp;
-		nextp = &fp->next;
-		fp->prev = NULL;
-		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
-		head->data_len += fp->len;
-		head->len += fp->len;
-		if (head->ip_summed != fp->ip_summed)
-			head->ip_summed = CHECKSUM_NONE;
-		else if (head->ip_summed == CHECKSUM_COMPLETE)
-			head->csum = csum_add(head->csum, fp->csum);
-		head->truesize += fp->truesize;
+	while (rbn || fp) {
+		/* fp points to the next sk_buff in the current run;
+		 * rbn points to the next run.
+		 */
+		/* Go through the current run. */
+		while (fp) {
+			*nextp = fp;
+			nextp = &fp->next;
+			fp->prev = NULL;
+			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			head->data_len += fp->len;
+			head->len += fp->len;
+			if (head->ip_summed != fp->ip_summed)
+				head->ip_summed = CHECKSUM_NONE;
+			else if (head->ip_summed == CHECKSUM_COMPLETE)
+				head->csum = csum_add(head->csum, fp->csum);
+			head->truesize += fp->truesize;
+			fp = FRAG_CB(fp)->next_frag;
+		}
+		/* Move to the next run. */
+		if (rbn) {
+			struct rb_node *rbnext = rb_next(rbn);
+
+			fp = rb_to_skb(rbn);
+			rb_erase(rbn, &qp->q.rb_fragments);
+			rbn = rbnext;
+		}
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
@@ -622,6 +649,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	return 0;
 
 out_nomem:
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 v2 26/29] ip: add helpers to process in-order fragments faster.
From: Florian Fainelli @ 2018-10-10 19:30 UTC (permalink / raw)
  To: netdev
  Cc: davem, gregkh, stable, edumazet, sthemmin, Peter Oskolkov,
	Florian Westphal
In-Reply-To: <20181010193017.25221-1-f.fainelli@gmail.com>

From: Peter Oskolkov <posk@google.com>

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
---
 include/net/inet_frag.h |  6 ++++
 net/ipv4/ip_fragment.c  | 73 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1ff0433d94a7..a3812e9c8fee 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -56,7 +56,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -77,6 +79,7 @@ struct inet_frag_queue {
 	struct sk_buff		*fragments;  /* Used in IPv6. */
 	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
+	struct sk_buff		*last_run_head;
 	ktime_t			stamp;
 	int			len;
 	int			meat;
@@ -112,6 +115,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
 	if (atomic_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 11d3dc649ef0..5e2121b82588 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -56,6 +56,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+	struct inet_skb_parm	h;
+	struct sk_buff		*next_frag;
+	int			frag_run_len;
+};
+
+#define FRAG_CB(skb)		((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+	FRAG_CB(skb)->next_frag = NULL;
+	FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+					struct sk_buff *skb)
+{
+	RB_CLEAR_NODE(&skb->rbnode);
+	FRAG_CB(skb)->next_frag = NULL;
+
+	FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+	FRAG_CB(q->fragments_tail)->next_frag = skb;
+	q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+	if (q->last_run_head)
+		rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+			     &q->last_run_head->rbnode.rb_right);
+	else
+		rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+	rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+	ip4_frag_init_run(skb);
+	q->fragments_tail = skb;
+	q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
 	struct inet_frag_queue q;
@@ -652,6 +703,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
+
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, root);
+		while (skb) {
+			struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+			sum += skb->truesize;
+			kfree_skb(skb);
+			skb = next;
+		}
+	}
+	return sum;
+}
+EXPORT_SYMBOL(inet_frag_rbtree_purge);
+
 #ifdef CONFIG_SYSCTL
 static int dist_min;
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 v2 25/29] ip: use rb trees for IP frag queue.
From: Florian Fainelli @ 2018-10-10 19:30 UTC (permalink / raw)
  To: netdev
  Cc: davem, gregkh, stable, edumazet, sthemmin, Peter Oskolkov,
	Florian Westphal
In-Reply-To: <20181010193017.25221-1-f.fainelli@gmail.com>

From: Peter Oskolkov <posk@google.com>

(commit fa0f527358bd900ef92f925878ed6bfbd51305cc upstream)

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
        282078937 total packets received
        0 forwarded
        0 incoming packets discarded
        946760 incoming packets delivered
        18743456 requests sent out
        101 fragments dropped after timeout
        282077129 reassemblies required
        944952 packets reassembled ok
        262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).

Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h                  |   4 +-
 include/net/inet_frag.h                 |   3 +-
 net/ipv4/inet_fragment.c                |  16 ++-
 net/ipv4/ip_fragment.c                  | 182 +++++++++++++-----------
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c                   |   1 +
 6 files changed, 117 insertions(+), 90 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7e7e12aeaf82..e90fe6b83e00 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -643,14 +643,14 @@ struct sk_buff {
 				struct skb_mstamp skb_mstamp;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
 	};
 
 	union {
+		struct sock		*sk;
 		int			ip_defrag_offset;
 	};
 
-	struct sock		*sk;
 	struct net_device	*dev;
 
 	/*
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f47678d2ccc2..1ff0433d94a7 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -74,7 +74,8 @@ struct inet_frag_queue {
 	struct timer_list	timer;
 	spinlock_t		lock;
 	atomic_t		refcnt;
-	struct sk_buff		*fragments;
+	struct sk_buff		*fragments;  /* Used in IPv6. */
+	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
 	ktime_t			stamp;
 	int			len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 47c240f50b99..535fa57af51e 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 	fp = q->fragments;
 	nf = q->net;
 	f = nf->f;
-	while (fp) {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
+	if (fp) {
+		do {
+			struct sk_buff *xp = fp->next;
+
+			sum_truesize += fp->truesize;
+			kfree_skb(fp);
+			fp = xp;
+		} while (fp);
+	} else {
+		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8bfb34e9ea32..11d3dc649ef0 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -134,7 +134,7 @@ static bool frag_expire_skip_icmp(u32 user)
 static void ip_expire(unsigned long arg)
 {
 	const struct iphdr *iph;
-	struct sk_buff *head;
+	struct sk_buff *head = NULL;
 	struct net *net;
 	struct ipq *qp;
 	int err;
@@ -150,14 +150,31 @@ static void ip_expire(unsigned long arg)
 
 	ipq_kill(qp);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-	head = qp->q.fragments;
-
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+	if (!qp->q.flags & INET_FRAG_FIRST_IN)
 		goto out;
 
+	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
+	 * pull the head out of the tree in order to be able to
+	 * deal with head->dev.
+	 */
+	if (qp->q.fragments) {
+		head = qp->q.fragments;
+		qp->q.fragments = head->next;
+	} else {
+		head = skb_rb_first(&qp->q.rb_fragments);
+		if (!head)
+			goto out;
+		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		memset(&head->rbnode, 0, sizeof(head->rbnode));
+		barrier();
+	}
+	if (head == qp->q.fragments_tail)
+		qp->q.fragments_tail = NULL;
+
+	sub_frag_mem_limit(qp->q.net, head->truesize);
+
 	head->dev = dev_get_by_index_rcu(net, qp->iif);
 	if (!head->dev)
 		goto out;
@@ -177,16 +194,16 @@ static void ip_expire(unsigned long arg)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	skb_get(head);
 	spin_unlock(&qp->q.lock);
 	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
-	kfree_skb(head);
 	goto out_rcu_unlock;
 
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
 	rcu_read_unlock();
+	if (head)
+		kfree_skb(head);
 	ipq_put(qp);
 }
 
@@ -229,7 +246,7 @@ static int ip_frag_too_far(struct ipq *qp)
 	end = atomic_inc_return(&peer->rid);
 	qp->rid = end;
 
-	rc = qp->q.fragments && (end - start) > max;
+	rc = qp->q.fragments_tail && (end - start) > max;
 
 	if (rc) {
 		struct net *net;
@@ -243,7 +260,6 @@ static int ip_frag_too_far(struct ipq *qp)
 
 static int ip_frag_reinit(struct ipq *qp)
 {
-	struct sk_buff *fp;
 	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
@@ -251,20 +267,14 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	fp = qp->q.fragments;
-	do {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
-	} while (fp);
+	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
 	qp->q.len = 0;
 	qp->q.meat = 0;
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
@@ -276,7 +286,8 @@ static int ip_frag_reinit(struct ipq *qp)
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
-	struct sk_buff *prev, *next;
+	struct rb_node **rbn, *parent;
+	struct sk_buff *skb1;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -339,58 +350,58 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	if (err)
 		goto err;
 
-	/* Find out which fragments are in front and at the back of us
-	 * in the chain of fragments so far.  We must know where to put
-	 * this fragment, right?
-	 */
-	prev = qp->q.fragments_tail;
-	if (!prev || prev->ip_defrag_offset < offset) {
-		next = NULL;
-		goto found;
-	}
-	prev = NULL;
-	for (next = qp->q.fragments; next != NULL; next = next->next) {
-		if (next->ip_defrag_offset >= offset)
-			break;	/* bingo! */
-		prev = next;
-	}
+	/* Note : skb->rbnode and skb->dev share the same location. */
+	dev = skb->dev;
+	/* Makes sure compiler wont do silly aliasing games */
+	barrier();
 
-found:
 	/* RFC5722, Section 4, amended by Errata ID : 3089
 	 *                          When reassembling an IPv6 datagram, if
 	 *   one or more its constituent fragments is determined to be an
 	 *   overlapping fragment, the entire datagram (and any constituent
 	 *   fragments) MUST be silently discarded.
 	 *
-	 * We do the same here for IPv4.
+	 * We do the same here for IPv4 (and increment an snmp counter).
 	 */
 
-	/* Is there an overlap with the previous fragment? */
-	if (prev &&
-	    (prev->ip_defrag_offset + prev->len) > offset)
-		goto discard_qp;
-
-	/* Is there an overlap with the next fragment? */
-	if (next && next->ip_defrag_offset < end)
-		goto discard_qp;
+	/* Find out where to put this fragment.  */
+	skb1 = qp->q.fragments_tail;
+	if (!skb1) {
+		/* This is the first fragment we've received. */
+		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
+		qp->q.fragments_tail = skb;
+	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
+		/* This is the common/special case: skb goes to the end. */
+		/* Detect and discard overlaps. */
+		if (offset < (skb1->ip_defrag_offset + skb1->len))
+			goto discard_qp;
+		/* Insert after skb1. */
+		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
+		qp->q.fragments_tail = skb;
+	} else {
+		/* Binary search. Note that skb can become the first fragment, but
+		 * not the last (covered above). */
+		rbn = &qp->q.rb_fragments.rb_node;
+		do {
+			parent = *rbn;
+			skb1 = rb_to_skb(parent);
+			if (end <= skb1->ip_defrag_offset)
+				rbn = &parent->rb_left;
+			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+				rbn = &parent->rb_right;
+			else /* Found an overlap with skb1. */
+				goto discard_qp;
+		} while (*rbn);
+		/* Here we have parent properly set, and rbn pointing to
+		 * one of its NULL left/right children. Insert skb. */
+		rb_link_node(&skb->rbnode, parent, rbn);
+	}
+	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
-	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
-	dev = skb->dev;
 	if (dev)
 		qp->iif = dev->ifindex;
-	/* Makes sure compiler wont do silly aliasing games */
-	barrier();
 	skb->ip_defrag_offset = offset;
 
-	/* Insert this fragment in the chain of fragments. */
-	skb->next = next;
-	if (!next)
-		qp->q.fragments_tail = skb;
-	if (prev)
-		prev->next = skb;
-	else
-		qp->q.fragments = skb;
-
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
@@ -412,7 +423,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, skb, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -429,15 +440,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	return err;
 }
 
-
 /* Build a new IP datagram from all its fragments. */
-
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			 struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
-	struct sk_buff *fp, *head = qp->q.fragments;
+	struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
+	struct sk_buff **nextp; /* To build frag_list. */
+	struct rb_node *rbn;
 	int len;
 	int ihlen;
 	int err;
@@ -451,25 +462,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		goto out_fail;
 	}
 	/* Make the one we just received the head. */
-	if (prev) {
-		head = prev->next;
-		fp = skb_clone(head, GFP_ATOMIC);
+	if (head != skb) {
+		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-
-		fp->next = head->next;
-		if (!fp->next)
+		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
-		prev->next = fp;
-
-		skb_morph(head, qp->q.fragments);
-		head->next = qp->q.fragments->next;
-
-		consume_skb(qp->q.fragments);
-		qp->q.fragments = head;
+		skb_morph(skb, head);
+		rb_replace_node(&head->rbnode, &skb->rbnode,
+				&qp->q.rb_fragments);
+		consume_skb(head);
+		head = skb;
 	}
 
-	WARN_ON(!head);
 	WARN_ON(head->ip_defrag_offset != 0);
 
 	/* Allocate a new buffer for the datagram. */
@@ -494,24 +500,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		clone = alloc_skb(0, GFP_ATOMIC);
 		if (!clone)
 			goto out_nomem;
-		clone->next = head->next;
-		head->next = clone;
 		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
 		skb_frag_list_init(head);
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		head->data_len -= clone->len;
-		head->len -= clone->len;
+		skb->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
+		skb_shinfo(head)->frag_list = clone;
+		nextp = &clone->next;
+	} else {
+		nextp = &skb_shinfo(head)->frag_list;
 	}
 
-	skb_shinfo(head)->frag_list = head->next;
 	skb_push(head, head->data - skb_network_header(head));
 
-	for (fp=head->next; fp; fp = fp->next) {
+	/* Traverse the tree in order, to build frag_list. */
+	rbn = rb_next(&head->rbnode);
+	rb_erase(&head->rbnode, &qp->q.rb_fragments);
+	while (rbn) {
+		struct rb_node *rbnext = rb_next(rbn);
+		fp = rb_to_skb(rbn);
+		rb_erase(rbn, &qp->q.rb_fragments);
+		rbn = rbnext;
+		*nextp = fp;
+		nextp = &fp->next;
+		fp->prev = NULL;
+		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
 		head->data_len += fp->len;
 		head->len += fp->len;
 		if (head->ip_summed != fp->ip_summed)
@@ -522,7 +539,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
+	*nextp = NULL;
 	head->next = NULL;
+	head->prev = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
 	IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
@@ -550,6 +569,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	return 0;
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b81541701346..907c2d5753dd 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -470,6 +470,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 					  head->csum);
 
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 
 	return true;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 78656bbe50e7..74ffbcb306a6 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -466,6 +466,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
 	rcu_read_unlock();
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 	return 1;
 
-- 
2.17.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox