Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Alexei Starovoitov @ 2018-04-09 16:47 UTC (permalink / raw)
  To: Yonghong Song, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180409161819.4024793-1-yhs@fb.com>

On 4/9/18 9:18 AM, Yonghong Song wrote:
> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
...
> @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  		return -EINVAL;
>  	if (copy_from_user(&query, uquery, sizeof(query)))
>  		return -EFAULT;
> -	if (query.ids_len > BPF_TRACE_MAX_PROGS)
> +
> +	ids_len = query.ids_len;
> +	if (ids_len > BPF_TRACE_MAX_PROGS)
>  		return -E2BIG;
> +	ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
> +	if (!ids)
> +		return -ENOMEM;
>
>  	mutex_lock(&bpf_event_mutex);
>  	ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
> -				       uquery->ids,
> -				       query.ids_len,
> -				       &uquery->prog_cnt);
> +				       ids,
> +				       ids_len,
> +				       &prog_cnt);
>  	mutex_unlock(&bpf_event_mutex);
>
> +	if (!ret || ret == -ENOSPC) {
> +		if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
> +		    copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	kfree(ids);

alloc/free just to avoid this locking dependency feels suboptimal.

may be we can get rid of bpf_event_mutex in some cases?
the perf event itself is locked via perf_event_ctx_lock() when we're
calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
I forgot what was the motivation for us to introduce it in the
first place.

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Jesus Sanchez-Palencia @ 2018-04-09 16:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <alpine.DEB.2.21.1803280808490.3247@nanos.tec.linutronix.de>

Hi Thomas,


On 03/28/2018 12:48 AM, Thomas Gleixner wrote:

(...)

>
> There are two modes:
>
>       1) Send at the given TX time (Explicit mode)
>
>       2) Send before given TX time (Deadline mode)
>
> There is no need to specify 'drop if late' simply because if the message is
> handed in past the given TX time, it's too late by definition. What you are
> trying to implement is a hybrid of TSN and general purpose (not time aware)
> networking in one go. And you do that because your overall design is not
> looking at the big picture. You designed from a given use case assumption
> and tried to fit other things into it with duct tape.


Ok, I see the difference now, thanks. I have just two more questions about the
deadline mode, please see below.

(...)


>
>>> Coming back to the overall scheme. If you start upfront with a time slice
>>> manager which is designed to:
>>>
>>>   - Handle multiple channels
>>>
>>>   - Expose the time constraints, properties per channel
>>>
>>> then you can fit all kind of use cases, whether designed by committee or
>>> not. You can configure that thing per node or network wide. It does not
>>> make a difference. The only difference are the resulting constraints.
>>
>>
>> Ok, and I believe the above was covered by what we had proposed before, unless
>> what you meant by time constraints is beyond the configured port schedule.
>>
>> Are you suggesting that we'll need to have a kernel entity that is not only
>> aware of the current traffic classes 'schedule', but also of the resources that
>> are still available for new streams to be accommodated into the classes? Putting
>> it differently, is the TAS you envision just an entity that runs a schedule, or
>> is it a time-aware 'orchestrator'?
>
> In the first place its something which runs a defined schedule.
>
> The accomodation for new streams is required, but not necessarily at the
> root qdisc level. That might be a qdisc feeding into it.
>
> Assume you have a bandwidth reservation, aka time slot, for audio. If your
> audio related qdisc does deadline scheduling then you can add new streams
> to it up to the point where it's not longer able to fit.
>
> The only thing which might be needed at the root qdisc is the ability to
> utilize unused time slots for other purposes, but that's not required to be
> there in the first place as long as its designed in a way that it can be
> added later on.


Ok, agreed.


>
>>> So lets look once more at the picture in an abstract way:
>>>
>>>      	       [ NIC ]
>>> 	          |
>>> 	 [ Time slice manager ]
>>> 	    |           |
>>>          [ Ch 0 ] ... [ Ch N ]
>>>
>>> So you have a bunch of properties here:
>>>
>>> 1) Number of Channels ranging from 1 to N
>>>
>>> 2) Start point, slice period and slice length per channel
>>
>> Ok, so we agree that a TAS entity is needed. Assuming that channels are traffic
>> classes, do you have something else in mind other than a new root qdisc?
>
> Whatever you call it, the important point is that it is the gate keeper to
> the network adapter and there is no way around it. It fully controls the
> timed schedule how simple or how complex it may be.


Ok, and I've finally understood the nuance between the above and what we had
planned initially.


(...)


>>
>> * TAS:
>>
>>    The idea we are currently exploring is to add a "time-aware", priority based
>>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
>>
>>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
>> 	   queues 0 1 2 3                                              \
>>      	   sched-file gates.sched [base-time <interval>]               \
>>            [cycle-time <interval>] [extension-time <interval>]
>>
>>    <file> is multi-line, with each line being of the following format:
>>    <cmd> <gate mask> <interval in nanoseconds>
>>
>>    Qbv only defines one <cmd>: "S" for 'SetGates'
>>
>>    For example:
>>
>>    S 0x01 300
>>    S 0x03 500
>>
>>    This means that there are two intervals, the first will have the gate
>>    for traffic class 0 open for 300 nanoseconds, the second will have
>>    both traffic classes open for 500 nanoseconds.
>
> To accomodate stuff like control systems you also need a base line, which
> is not expressed as interval. Otherwise you can't schedule network wide
> explicit plans. That's either an absolute network-time (TAI) time stamp or
> an offset to a well defined network-time (TAI) time stamp, e.g. start of
> epoch or something else which is agreed on. The actual schedule then fast
> forwards past now (TAI) and sets up the slots from there. That makes node
> hotplug possible as well.


Sure, and the [base-time <interval>] on the command line above was actually
wrong. It should have been expressed as [base-time <timestamp>].



>> It would handle multiple channels and expose their constraints / properties.
>> Each channel also becomes a traffic class, so other qdiscs can be attached to
>> them separately.
>
> Right.
>
>> So, in summary, because our entire design is based on qdisc interfaces, what we
>> had proposed was a root qdisc (the time slice manager, as you put) that allows
>> for other qdiscs to be attached to each channel. The inner qdiscs define the
>> queueing modes for each channel, and tbs is just one of those modes. I
>> understand now that you want to allow for fully dynamic use-cases to be
>> supported as well, which we hadn't covered with our TAS proposal before because
>> we hadn't envisioned it being used for these systems' design.
>
> Yes, you have the root qdisc, which is in charge of the overall scheduling
> plan, how complex or not it is defined does not matter. It exposes traffic
> classes which have properties defined by the configuration.


Perfect. Let's see if we can agree on an overall plan, then. Hopefully I'm not
missing anything.

For the above we'll develop a new qdisc, designed along the 'taprio' ideas, thus
a Qbv style scheduler, to be used as root qdisc. It can run the schedule inside
the kernel or just offload it to the NIC if supported. Similarly to the other
multiqueue qdiscs, it will expose the HW Tx queues.

What is new here from the ideas we shared last year is that this new root qdisc
will be responsible for calling the attached qdiscs' dequeue functions during
their timeslices, making it the only entity capable of enqueueing packets into
the NIC.

This is the "global scheduler", but we still need the txtime aware qdisc. For
that, we'll modify tbs to accommodate the feedback from this thread. More below.


>
> The qdiscs which are attached to those traffic classes can be anything
> including:
>
>  - Simple feed through (Applications are time contraints aware and set the
>    exact schedule). qdisc has admission control.


This will be provided by the tbs qdisc. It will still provide a txtime sorted
list and hw offload, but now there will be a per-socket option that tells the
qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
called it) or a deadline. The drop_if_late flag will be removed.

When in explicit mode, packets from that socket are dequeued from the qdisc
during its time slice if their [(txtime - delta) < now].


>
>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>    of time constraints and provide the packet deadline. qdisc has admission
>    control. This can be a simple first comes, first served scheduler or
>    something like EDF which allows optimized utilization. The qdisc sets
>    the TX time depending on the deadline and feeds into the root.


This will be provided by tbs if the socket which is transmitting packets is
configured for deadline mode.

For the deadline -> txtime conversion, what I have in mind is: when dequeue is
called tbs will just change the skbuff's timestamp from the deadline to 'now'
(i.e. as soon as possible) and dequeue the packet. Would that be enough or
should we use the delta parameter of the qdisc on this case add make [txtime =
now + delta]? The only benefit of doing so would be to provide a configurable
'fudge' factor.

Another question for this mode (but perhaps that applies to both modes) is, what
if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
the packet during dequeue.


Putting it all together, we end up with:

1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
$ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting

2) a new cmsg-interface for setting a per-packet timestamp that will be used
either as a txtime or as deadline by tbs (and further the NIC driver for the
offlaod case): SCM_TXTIME.

3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
socket, and will have as parameters a clockid and a txtime mode (deadline or
explicit), that defines the semantics of the timestamp set on packets using
SCM_TXTIME.

4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .

5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its cli
will look like what was proposed for taprio (base time being an absolute timestamp).



If we all agree with the above, we will start by closing on 1-4 asap and will
focus on 5 next.

How does that sound?

Thanks,
Jesus



>
>  - FIFO/PRIO/XXX for general traffic. Applications do not know anything
>    about timing constraints. These qdiscs obviously have neither admission
>    control nor do they set a TX time.  The root qdisc just pulls from there
>    when the assigned time slot is due or if it (optionally) decides to use
>    underutilized time slots from other classes.
>
>  - .... Add your favourite scheduling mode(s).
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply

* [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-09 16:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
  ======================================================
  WARNING: possible circular locking dependency detected
  4.16.0-rc7+ #3 Not tainted
  ------------------------------------------------------
  syz-executor7/24531 is trying to acquire lock:
   (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854

  but task is already holding lock:
   (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&mm->mmap_sem){++++}:
       __might_fault+0x13a/0x1d0 mm/memory.c:4571
       _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
       copy_to_user include/linux/uaccess.h:155 [inline]
       bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
       perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
       _perf_ioctl kernel/events/core.c:4750 [inline]
       perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
       vfs_ioctl fs/ioctl.c:46 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
       SYSC_ioctl fs/ioctl.c:701 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

  -> #0 (bpf_event_mutex){+.+.}:
       lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
       __mutex_lock_common kernel/locking/mutex.c:756 [inline]
       __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
       mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
       perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
       perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
       _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
       put_event+0x24/0x30 kernel/events/core.c:4204
       perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
       remove_vma+0xb4/0x1b0 mm/mmap.c:172
       remove_vma_list mm/mmap.c:2490 [inline]
       do_munmap+0x82a/0xdf0 mm/mmap.c:2731
       mmap_region+0x59e/0x15a0 mm/mmap.c:1646
       do_mmap+0x6c0/0xe00 mm/mmap.c:1483
       do_mmap_pgoff include/linux/mm.h:2223 [inline]
       vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
       SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
       SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
       SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
       SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

  other info that might help us debug this:

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&mm->mmap_sem);
                                 lock(bpf_event_mutex);
                                 lock(&mm->mmap_sem);
    lock(bpf_event_mutex);

   *** DEADLOCK ***
  ======================================================

The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.

As suggested by Danial, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.

Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h      |  4 ++--
 kernel/bpf/core.c        | 25 +++++++++++++++++++------
 kernel/trace/bpf_trace.c | 24 ++++++++++++++++++++----
 3 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..486e65e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
 void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
 				struct bpf_prog *old_prog);
 int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
-			     __u32 __user *prog_ids, u32 request_cnt,
-			     __u32 __user *prog_cnt);
+			     u32 *prog_ids, u32 request_cnt,
+			     u32 *prog_cnt);
 int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
 			struct bpf_prog *exclude_prog,
 			struct bpf_prog *include_prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..a95a7de 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1683,22 +1683,35 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
 }
 
 int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
-			     __u32 __user *prog_ids, u32 request_cnt,
-			     __u32 __user *prog_cnt)
+			     u32 *prog_ids, u32 request_cnt,
+			     u32 *prog_cnt)
 {
-	u32 cnt = 0;
+	struct bpf_prog **prog;
+	u32 i = 0, cnt = 0;
 
 	if (array)
 		cnt = bpf_prog_array_length(array);
 
-	if (copy_to_user(prog_cnt, &cnt, sizeof(cnt)))
-		return -EFAULT;
+	*prog_cnt = cnt;
 
 	/* return early if user requested only program count or nothing to copy */
 	if (!request_cnt || !cnt)
 		return 0;
 
-	return bpf_prog_array_copy_to_user(array, prog_ids, request_cnt);
+	/* this function is called under trace/bpf_trace.c: bpf_event_mutex */
+	prog = rcu_dereference_check(array, 1)->progs;
+	for (; *prog; prog++) {
+		if (*prog == &dummy_bpf_prog.prog)
+			continue;
+		prog_ids[i] = (*prog)->aux->id;
+		if (++i == request_cnt) {
+			prog++;
+			break;
+		}
+	}
+	if (!!(*prog))
+		return -ENOSPC;
+	return 0;
 }
 
 static void bpf_prog_free_deferred(struct work_struct *work)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..bd1e458 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 {
 	struct perf_event_query_bpf __user *uquery = info;
 	struct perf_event_query_bpf query = {};
+	u32 *ids, prog_cnt, ids_len;
 	int ret;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 		return -EINVAL;
 	if (copy_from_user(&query, uquery, sizeof(query)))
 		return -EFAULT;
-	if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+	ids_len = query.ids_len;
+	if (ids_len > BPF_TRACE_MAX_PROGS)
 		return -E2BIG;
+	ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+	if (!ids)
+		return -ENOMEM;
 
 	mutex_lock(&bpf_event_mutex);
 	ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
-				       uquery->ids,
-				       query.ids_len,
-				       &uquery->prog_cnt);
+				       ids,
+				       ids_len,
+				       &prog_cnt);
 	mutex_unlock(&bpf_event_mutex);
 
+	if (!ret || ret == -ENOSPC) {
+		if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
+		    copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
+			ret = -EFAULT;
+			goto out;
+		}
+	}
+
+out:
+	kfree(ids);
 	return ret;
 }
 
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-09 15:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <87in953ryi.fsf@xmission.com>

On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >>>
> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >>>
> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >>> network devices. Uevents for network devices only show up in the network
> >> >>> namespaces these devices are moved to or created in.
> >> >>>
> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >>> into all network namespaces.
> >> >>>
> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >>> partially isolated as they were filtered by user namespaces:
> >> >>>
> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >>>
> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >>>                 return;
> >> >>>
> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >>>                 return;
> >> >>>
> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >>>                              CAP_NET_BROADCAST))
> >> >>>         j       return;
> >> >>> }
> >> >>>
> >> >>> The file_ns_capable() check will check whether the caller had
> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >>> been the intention all along. But there's one case where this breaks,
> >> >>> namely if a new user namespace is created by root on the host and an
> >> >>> identity mapping is established between root on the host and root in the
> >> >>> new user namespace. Here's a reproducer:
> >> >>>
> >> >>>  sudo unshare -U --map-root
> >> >>>  udevadm monitor -k
> >> >>>  # Now change to initial user namespace and e.g. do
> >> >>>  modprobe kvm
> >> >>>  # or
> >> >>>  rmmod kvm
> >> >>>
> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >>> host. This seems very anecdotal given that in the general case user
> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >>> with them.
> >> >>>
> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >>> make a decision what uevents should be sent.
> >> >>>
> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >>> in kobj_bcast_filter(). Specifically:
> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >>>   event will always be filtered.
> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >>>   the event will be filtered as well.
> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >>> always only sent to the initial user namespace. The regression potential
> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >>> anything with interesting devices.
> >> >>>
> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >>> ---
> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >>>
> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >>> --- a/lib/kobject_uevent.c
> >> >>> +++ b/lib/kobject_uevent.c
> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >>>  		return sock_ns != ns;
> >> >>>  	}
> >> >>>  
> >> >>> -	return 0;
> >> >>> +	/*
> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >>> +	 * namespace below.
> >> >>> +	 */
> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >>> +		return 1;
> >> >>> +
> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >>>  }
> >> >>>  #endif
> >> >>
> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> > 
> >> > No, this is not correct: As it is right now *without my patch* no
> >> > non-initial user namespace is receiving *any uevents* but those
> >> > specifically namespaced such as those for network devices. This patch
> >> > doesn't change that at all. The commit message outlines this in detail
> >> > how this comes about.
> >> > There is only one case where this currently breaks and this is as I
> >> > outlined explicitly in my commit message when you create a new user
> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> 
> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> Now it will return 1 sometimes.
> >
> > Oh sure, it's in the commit message though. The callchain is
> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> > net/netlink/af_netlink.c:do_one_broadcast():
> >
> > This codepiece will check whether the openened socket holds
> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> > which it won't because we don't have device namespaces and all devices
> > belong to the initial set of namespaces.
> >
> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >                              CAP_NET_BROADCAST))
> >         j       return;
> >
> 
> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> on their socket and has had someone with the appropriate privileges
> assign a peerrnetid.
> 
> All of which is to say that file_ns_capable is not nearly as applicable
> as it might be, and if you can pass the other two checks I think it is
> pointless (because the peernet attributes are not generated for
> kobj_uevents) but valid to receive events from outside your network
> namespace.
> 
> 
> I might be missing something but I don't see anything excluding network
> namespaces owned by !init_user_ns excluded from the kobject_uevent
> logic.
> 
> The uevent_sock_list has one entry per network namespace. And
> kobject_uevent_net_broacast appears to walk each one.
> 
> I had a memory of filtering uevent messages and I had a memory
> that the netlink_has_listeners had a per network namespace effect.
> Neither seems true from my inspection of the code tonight.
> 
> If we are not filtering ordinary uevents at least at the user namespace
> level that does seem to be at least a nuisance.
> 
> 
> Christian can you dig a little deeper into this.  I have a feeling that
> there are some real efficiency improvements that we could make to
> kobject_uevent_net_broadcast if nothing else.
> 
> Perhaps you could see where uevents are broadcast by poking
> the sysfs uevent of an existing device, and triggering a retransmit.

@Eric, so I did some intensive testing over the weekend and forget everything I
said before. Uevents are not filtered by the kernel at the moment. This is
currently - apart from network devices - a pure userspace thing. Specifically,
anyone  on the host can listen to all uevents from everywhere. It's neither
filtered by user nor by network namespace. The fact that I used

udevadm --debug monitor

to test my prior hypothesis was what led me to believe that uevents are already
correctly filtered.
Instead, what is actually happening is that every udev implementation out there
is discarding uevents that were send by uids != 0 in the CMSG_DATA.
Specifically,

- legacy standalone udevd:
  https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
- eudevd
  https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
- systemd-udevd
  https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
- ueventd (Android)
  https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81

For all of those I could trace this behavior back to the first released
version. (To be precise, for legacy udevd that eventually became systemd-udevd
I could trace it back to the first version that is still available on
git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
trivially true that it has the same behavior from the beginning.)
Android filters uevents in the same way but removed that check on January 8
2018 for what I think is invalid reasoning. The good news is that this is only
in their master branch. It hasn't even made it into an rc release for Android 8
yet. I filed a bug against Android and offered them a fix if they agree.

In any case, userspace udev is not making use of uevents at all right now since
any uid != 0 events are **explicitly** discarded.
The fact that you receive uevents for

sudo unshare -U --map-root -n
udevadm --debug monitor

is simply explained by the fact that container(0) <=> host(0) at which point
the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
discard it.
The use case for receiving uevents in containers/user namespaces is definitely
there but that's what the uevent injection patch series was for that we merged.
This is a much safer and saner solution.
The fact that all udev implementations filter uevents by uid != 0 very much
seems like a security mechanism in userspace that we probably should provide by
isolating uevents based on user and/or network namespaces.

Christian

^ permalink raw reply

* Re: Enable and configure storm prevention in a network device
From: Andrew Lunn @ 2018-04-09 15:18 UTC (permalink / raw)
  To: Murali Karicheri; +Cc: Florian Fainelli, David Miller, netdev
In-Reply-To: <51604140-bafe-2fd6-65b1-0a3b732c83bd@ti.com>

> > The Marvell switches have leaky buckets, which can be used for
> > limiting broadcast and multicast packets, as well as traffic shaping
> > in general. Storm prevention is just a form of traffic shaping, so if
> > we have generic traffic shaping, it can be used for storm prevention.
> > 
> TI's CPSW hardware as well has similar capability to limit broadcast and
> multicast packets at the ingress. Isn't it a traffic policing at the Ingress
> rather than traffic shaping as the hardware drops the frames at the ingress
> if the rate exceeds a limit?

Hi Murali

It depends on the generation of Marvell switches. Older ones have just
egress traffic shaping. Newer ones also have ingress rate limiting.

       Andrew

^ permalink raw reply

* Re: Enable and configure storm prevention in a network device
From: Murali Karicheri @ 2018-04-09 15:13 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli; +Cc: David Miller, netdev
In-Reply-To: <20180406143021.GK17495@lunn.ch>

Andrew,

On 04/06/2018 10:30 AM, Andrew Lunn wrote:
> On Thu, Apr 05, 2018 at 03:35:06PM -0700, Florian Fainelli wrote:
>> On 04/05/2018 01:20 PM, David Miller wrote:
>>> From: Murali Karicheri <m-karicheri2@ti.com>
>>> Date: Thu, 5 Apr 2018 16:14:49 -0400
>>>
>>>> Is there a standard way to implement and configure storm prevention
>>>> in a Linux network device?
>>>
>>> What kind of "storm", an interrupt storm?
>>>
>>
>> I would assume Murali is referring to L2 broadcast storms which is
>> common in switches. There is not an API for that AFAICT and I am not
>> sure what a proper API would look like.
> 
> tc?
> 
> The Marvell switches have leaky buckets, which can be used for
> limiting broadcast and multicast packets, as well as traffic shaping
> in general. Storm prevention is just a form of traffic shaping, so if
> we have generic traffic shaping, it can be used for storm prevention.
> 
TI's CPSW hardware as well has similar capability to limit broadcast and
multicast packets at the ingress. Isn't it a traffic policing at the Ingress
rather than traffic shaping as the hardware drops the frames at the ingress
if the rate exceeds a limit?

I haven't done any work on TC, but with my limited knowledge of TC, it
seems we might be able to use TC to offload the TC policing to the hardware
through ndo_setup_tc()? Could someone shed some light on how to do this
with tc? Some tc filer command and then some hook in kernel to offload this
to hardware?

Murali

>    Andrew
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: [PATCH net] inetpeer: fix uninit-value in inet_getpeer
From: Eric Dumazet @ 2018-04-09 15:08 UTC (permalink / raw)
  To: David Miller, edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180409.105839.445341256424332032.davem@davemloft.net>



On 04/09/2018 07:58 AM, David Miller wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon,  9 Apr 2018 06:43:27 -0700
> 
>> syzbot/KMSAN reported that p->dtime was read while it was
>> not yet initialized in :
>>
>> 	delta = (__u32)jiffies - p->dtime;
>> 	if (delta < ttl || !refcount_dec_if_one(&p->refcnt))
>> 		gc_stack[i] = NULL;
>>
>> This is a false positive, because the inetpeer wont be erased
>> from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
>> succeed. And this wont happen before first inet_putpeer() call
>> for this inetpeer has been done, and ->dtime field is written
>> exactly before the refcount_dec_and_test(&p->refcnt).
>>
>> The KMSAN report was :
>  ...
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Reported-by: syzbot <syzkaller@googlegroups.com>
> 
> Applied, but it looks like we are just adding assignments simply
> to placate these reports when the tools and facilities cannot
> see through the logic properly.
> 

To be fair, this is because the check on ->dtime should be done a second time after
the refcount_dec_if_one(&p->refcnt)

It is a tiny race, and we do not really care given nature of inetpeer cache, best effort,
and DDOS candidate anyway.

If we purge one entry too soon, this is not a big deal.

I believe tool is fine.

^ permalink raw reply

* Re: [PATCH] slip: Check if rstate is initialized before uncompressing
From: David Miller @ 2018-04-09 15:04 UTC (permalink / raw)
  To: tejaswit; +Cc: netdev
In-Reply-To: <20180409085345.GA5972@tejaswit-linux.qualcomm.com>

From: Tejaswi Tanikella <tejaswit@codeaurora.org>
Date: Mon, 9 Apr 2018 14:23:49 +0530

> @@ -673,6 +677,7 @@ struct slcompress *
>  	if (cs->cs_tcp.doff > 5)
>  	  memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4);
>  	cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
> +	cs->initialized = 1;
>  	/* Put headers back on packet
 ...
>  struct cstate {
>  	byte_t	cs_this;	/* connection id number (xmit) */
> +	byte_t	initialized;	/* non-zero if initialized */

Please use 'bool' and true/false for 'initialized'.

^ permalink raw reply

* Re: [PATCH RESEND v2] vhost-net: set packet weight of tx polling to 2 * vq size
From: David Miller @ 2018-04-09 15:02 UTC (permalink / raw)
  To: haibinzhang
  Cc: mst, jasowang, kvm, virtualization, netdev, linux-kernel,
	lidongchen, yunfangtai
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC4A32@EXMBX-SZMAIL011.tencent.com>

From: haibinzhang(张海斌) <haibinzhang@tencent.com>
Date: Mon, 9 Apr 2018 07:22:17 +0000

> handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> polling udp packets with small length(e.g. 1byte udp payload), because setting
> VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
> 
> Ping-Latencies shown below were tested between two Virtual Machines using
> netperf (UDP_STREAM, len=1), and then another machine pinged the client:
 ...
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v5] net: thunderx: rework mac addresses list to u64 array
From: David Miller @ 2018-04-09 15:00 UTC (permalink / raw)
  To: Vadim.Lomovtsev
  Cc: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
	linux-kernel, dnelson, robin.murphy, hch, gustavo,
	Vadim.Lomovtsev
In-Reply-To: <20180409132448.22278-1-Vadim.Lomovtsev@caviumnetworks.com>

From: Vadim Lomovtsev <Vadim.Lomovtsev@caviumnetworks.com>
Date: Mon,  9 Apr 2018 06:24:48 -0700

> From: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> 
> It is too expensive to pass u64 values via linked list, instead
> allocate array for them by overall number of mac addresses from netdev.
> 
> This eventually removes multiple kmalloc() calls, aviod memory
> fragmentation and allow to put single null check on kmalloc
> return value in order to prevent a potential null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] inetpeer: fix uninit-value in inet_getpeer
From: David Miller @ 2018-04-09 14:58 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180409134327.22367-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Mon,  9 Apr 2018 06:43:27 -0700

> syzbot/KMSAN reported that p->dtime was read while it was
> not yet initialized in :
> 
> 	delta = (__u32)jiffies - p->dtime;
> 	if (delta < ttl || !refcount_dec_if_one(&p->refcnt))
> 		gc_stack[i] = NULL;
> 
> This is a false positive, because the inetpeer wont be erased
> from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
> succeed. And this wont happen before first inet_putpeer() call
> for this inetpeer has been done, and ->dtime field is written
> exactly before the refcount_dec_and_test(&p->refcnt).
> 
> The KMSAN report was :
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied, but it looks like we are just adding assignments simply
to placate these reports when the tools and facilities cannot
see through the logic properly.

^ permalink raw reply

* Re: [PATCH 2/2] alx: add disable_wol paramenter
From: David Miller @ 2018-04-09 14:50 UTC (permalink / raw)
  To: andrew; +Cc: acelan.kao, jcliburn, chris.snook, rakesh, netdev, linux-kernel
In-Reply-To: <20180409123910.GB31060@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 9 Apr 2018 14:39:10 +0200

> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote:
>> The WoL feature was reported broken and will lead to
>> the system resume immediately after suspending.
>> This symptom is not happening on every system, so adding
>> disable_wol option and disable WoL by default to prevent the issue from
>> happening again.
> 
>>  const char alx_drv_name[] = "alx";
>>  
>> +/* disable WoL by default */
>> +bool disable_wol = 1;
>> +module_param(disable_wol, bool, 0);
>> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
>> +
> 
> Hi AceLan
> 
> This seems like you are papering over the cracks. And module
> parameters are not liked.
> 
> Please try to find the real problem.

Agreed.

^ permalink raw reply

* Re: [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
From: Jia-Ju Bai @ 2018-04-09 14:48 UTC (permalink / raw)
  To: Eric Dumazet, davem, kafai, weiwan, dsa, johannes.berg, fw
  Cc: linux-decnet-user, netdev, linux-kernel
In-Reply-To: <945ed4f2-e1e2-a038-2808-f18a79a35732@gmail.com>



On 2018/4/9 22:44, Eric Dumazet wrote:
>
> On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:
>> dn_route_init() is never called in atomic context.
>>
>> The call chain ending up at dn_route_init() is:
>> [1] dn_route_init() <- decnet_init()
>> decnet_init() is only set as a parameter of module_init().
>>
>> Despite never getting called from atomic context,
>> dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
>> which waits busily for allocation.
>> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
>> to avoid busy waiting and improve the possibility of sucessful allocation.
>>
>> This is found by a static analysis tool named DCNS written by myself.
>> And I also manually check it.
>>
>> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
>> ---
>>   net/decnet/dn_route.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
>> index 0bd3afd..59ed12a 100644
>> --- a/net/decnet/dn_route.c
>> +++ b/net/decnet/dn_route.c
>> @@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
>>   		while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
>>   			dn_rt_hash_mask--;
>>   		dn_rt_hash_table = (struct dn_rt_hash_bucket *)
>> -			__get_free_pages(GFP_ATOMIC, order);
>> +			__get_free_pages(GFP_KERNEL, order);
>>   	} while (dn_rt_hash_table == NULL && --order > 0);
>>   
>>   	if (!dn_rt_hash_table)
>>
> This might OOM under pressure.

Sorry, I do not understand this.
Could you please explain the reason?

> This would need __GFP_NOWARN | __GFP_NORETRY  I guess, and would target net-next
>

Do you mean
__get_free_pages(__GFP_NOWARN | __GFP_NORETRY) or
__get_free_pages(__GFP_NOWARN | __GFP_NORETRY | GFP_KERNEL)?


Best wishes,
Jia-Ju Bai

^ permalink raw reply

* Re: [PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init
From: Jia-Ju Bai @ 2018-04-09 14:46 UTC (permalink / raw)
  To: Eric Dumazet, gerrit, davem; +Cc: dccp, netdev, linux-kernel
In-Reply-To: <8216dee8-5e52-421e-026a-4c75d654be54@gmail.com>



On 2018/4/9 22:42, Eric Dumazet wrote:
>
> On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:
>> dccp_init() is never called in atomic context.
>> This function is only set as a parameter of module_init().
>>
>> Despite never getting called from atomic context,
>> dccp_init() calls __get_free_pages() with GFP_ATOMIC,
>> which waits busily for allocation.
> What do you mean by "waits busily" ?
>
> GFP_ATOMIC does not sleep, does not wait.

Sorry, I should modify it to "does not sleep".
Do you think it is okay?


Best wishes,
Jia-Ju Bai

^ permalink raw reply

* Re: [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
From: Eric Dumazet @ 2018-04-09 14:44 UTC (permalink / raw)
  To: Jia-Ju Bai, davem, kafai, weiwan, dsa, johannes.berg, fw
  Cc: linux-decnet-user, netdev, linux-kernel
In-Reply-To: <1523283035-25639-1-git-send-email-baijiaju1990@gmail.com>



On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:
> dn_route_init() is never called in atomic context.
> 
> The call chain ending up at dn_route_init() is:
> [1] dn_route_init() <- decnet_init()
> decnet_init() is only set as a parameter of module_init().
> 
> Despite never getting called from atomic context,
> dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
> which waits busily for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> to avoid busy waiting and improve the possibility of sucessful allocation.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
> 
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
> ---
>  net/decnet/dn_route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
> index 0bd3afd..59ed12a 100644
> --- a/net/decnet/dn_route.c
> +++ b/net/decnet/dn_route.c
> @@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
>  		while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
>  			dn_rt_hash_mask--;
>  		dn_rt_hash_table = (struct dn_rt_hash_bucket *)
> -			__get_free_pages(GFP_ATOMIC, order);
> +			__get_free_pages(GFP_KERNEL, order);
>  	} while (dn_rt_hash_table == NULL && --order > 0);
>  
>  	if (!dn_rt_hash_table)
> 

This might OOM under pressure.

This would need __GFP_NOWARN | __GFP_NORETRY  I guess, and would target net-next

^ permalink raw reply

* Re: WARNING in ip_rt_bug
From: David Miller @ 2018-04-09 14:43 UTC (permalink / raw)
  To: dvyukov
  Cc: syzbot+b09ac67a2af842b12eab, kuznet, linux-kernel, netdev,
	syzkaller-bugs, yoshfuji, edumazet
In-Reply-To: <CACT4Y+YfM=SL3-cgeJ_7JCvAwX2KsyMWV=HHpGfLV=-RKo7TQQ@mail.gmail.com>

From: Dmitry Vyukov <dvyukov@google.com>
Date: Mon, 9 Apr 2018 08:06:20 +0200

> +Eric said that perhaps we just need to revert:
> 
> commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc
> Date:   Sat May 21 07:16:42 2011 +0000
>     ipv4: Give backtrace in ip_rt_bug().

And I replied to him that we shouldn't.

Reverting makes the backtrace, and all the useful debugging
information, go away.  It won't fix the actual bug, which seems
to be that ICMP's route lookup tried to use an input route
for sending a packet.

^ permalink raw reply

* Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
From: Arnd Bergmann @ 2018-04-09 14:43 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, Networking, Linux Kernel Mailing List
In-Reply-To: <20180409143741.dzooqrzcgi2bnvd5@salvia>

On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Arnd,
>
> On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
>> We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m
>
> I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
> and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
> and CONFIG_NF_REJECT_IPV6=m.
>
> I mean, just like we do with NFT_FIB_INET.

That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
again, so that code gets built as a loadable module if
CONFIG_NF_REJECT_IPV6=m.

> BTW, I think this problem has been is not related to the recent patch,
> but something older that kbuild robot has triggered more easily for
> some reason?

02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
restricted to a loadable module with IPV6=m, but can now be
built-in, which causes that link error.

      Arnd

^ permalink raw reply

* Re: [PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init
From: Eric Dumazet @ 2018-04-09 14:42 UTC (permalink / raw)
  To: Jia-Ju Bai, gerrit, davem; +Cc: dccp, netdev, linux-kernel
In-Reply-To: <1523283004-25581-1-git-send-email-baijiaju1990@gmail.com>



On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:
> dccp_init() is never called in atomic context.
> This function is only set as a parameter of module_init().
> 
> Despite never getting called from atomic context,
> dccp_init() calls __get_free_pages() with GFP_ATOMIC,
> which waits busily for allocation.

What do you mean by "waits busily" ?

GFP_ATOMIC does not sleep, does not wait.

> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> to avoid busy waiting and improve the possibility of sucessful allocation.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
> 
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
> ---

^ permalink raw reply

* Re: [PATCH net] arp: fix arp_filter on l3slave devices
From: David Ahern @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Sasha Levin, Miguel Fadon Perlines, netdev@vger.kernel.org
  Cc: stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB1032EF9C8A7A3A1356D870CFFBBF0@DM5PR2101MB1032.namprd21.prod.outlook.com>

On 4/8/18 9:36 PM, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed by the -stable helper bot and determined
> to be a high probability candidate for -stable trees. (score: 33.5930)
> 
> The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, v4.4.126.
> 
> v4.16: Build OK!
> v4.15.15: Build OK!
> v4.14.32: Build OK!
> v4.9.92: Build OK!
> v4.4.126: Build OK!

All of those would be good for this patch. Thanks

^ permalink raw reply

* Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
From: Pablo Neira Ayuso @ 2018-04-09 14:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <20180409105322.2247296-1-arnd@arndb.de>

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]

Hi Arnd,

On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
> We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m

I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
and CONFIG_NF_REJECT_IPV6=m.

I mean, just like we do with NFT_FIB_INET.

BTW, I think this problem has been is not related to the recent patch,
but something older that kbuild robot has triggered more easily for
some reason?

Thanks for your patch!

[-- Attachment #2: x.patch --]
[-- Type: text/x-diff, Size: 358 bytes --]

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index d3220b43c832..b48c57bb9aaf 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -601,7 +601,8 @@ config NFT_REJECT
 
 config NFT_REJECT_INET
 	depends on NF_TABLES_INET
-	default NFT_REJECT
+	depends on NFT_REJECT_IPV4
+	depends on NFT_REJECT_IPV6
 	tristate
 
 config NFT_COMPAT

^ permalink raw reply related

* Re: Enable and configure storm prevention in a network device
From: Murali Karicheri @ 2018-04-09 14:33 UTC (permalink / raw)
  To: Florian Fainelli, David Miller; +Cc: netdev, andrew
In-Reply-To: <75eb56a4-ded2-5ed5-116c-776312f93cf3@gmail.com>

On 04/05/2018 06:35 PM, Florian Fainelli wrote:
> On 04/05/2018 01:20 PM, David Miller wrote:
>> From: Murali Karicheri <m-karicheri2@ti.com>
>> Date: Thu, 5 Apr 2018 16:14:49 -0400
>>
>>> Is there a standard way to implement and configure storm prevention
>>> in a Linux network device?
>>
>> What kind of "storm", an interrupt storm?
>>
> 
> I would assume Murali is referring to L2 broadcast storms which is
> common in switches. There is not an API for that AFAICT and I am not
> sure what a proper API would look like.
> 
Thanks Florian for adding more details. Yes, I am referring to
L2 broadcast or multicast storm. At this point I don't see a reason
why to exclude unicast storm as well if the hardware has the capability.


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: [PATCH net] net: dsa: mv88e6xxx: Fix receive time stamp race condition.
From: Richard Cochran @ 2018-04-09 14:19 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Andrew Lunn, David Miller, Florian Fainelli,
	Vivien Didelot
In-Reply-To: <20180409070314.12409-1-richardcochran@gmail.com>

On Mon, Apr 09, 2018 at 12:03:14AM -0700, Richard Cochran wrote:
> This patch fixes the race by moving the queue onto a list on the stack
> before reading out the latched time stamp value.
> 
> Fixes: c6fe0ad2c3499 ("net: dsa: mv88e6xxx: add rx/tx timestamping support")

Dave, please hold off on this patch.  I am seeing new problems in my
testing with this applied.  I still need to get to the bottom of
this.

Thanks,
Richard

^ permalink raw reply

* [PATCH] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create
From: Jia-Ju Bai @ 2018-04-09 14:11 UTC (permalink / raw)
  To: jon.maloy, ying.xue, davem
  Cc: netdev, tipc-discussion, linux-kernel, Jia-Ju Bai

tipc_mon_create() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable()
tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function
is not called in atomic context.

Despite never getting called from atomic context,
tipc_mon_create() calls kzalloc() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
 net/tipc/monitor.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 9e109bb..9714d80 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id)
 	if (tn->monitors[bearer_id])
 		return 0;
 
-	mon = kzalloc(sizeof(*mon), GFP_ATOMIC);
-	self = kzalloc(sizeof(*self), GFP_ATOMIC);
-	dom = kzalloc(sizeof(*dom), GFP_ATOMIC);
+	mon = kzalloc(sizeof(*mon), GFP_KERNEL);
+	self = kzalloc(sizeof(*self), GFP_KERNEL);
+	dom = kzalloc(sizeof(*dom), GFP_KERNEL);
 	if (!mon || !self || !dom) {
 		kfree(mon);
 		kfree(self);
-- 
1.9.1

^ permalink raw reply related

* [PATCH] net: nsci: Replace GFP_ATOMIC with GFP_KERNEL in ncsi_register_dev
From: Jia-Ju Bai @ 2018-04-09 14:11 UTC (permalink / raw)
  To: davem, arnd, sam; +Cc: netdev, linux-kernel, Jia-Ju Bai

ncsi_register_dev() is never called in atomic context.
This function is only called by ftgmac100_probe() in 
drivers/net/ethernet/faraday/ftgmac100.c.
And ftgmac100_probe() is only set as ".probe" in "struct platform_driver".

Despite never getting called from atomic context,
ncsi_register_dev() calls kzalloc() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
 net/ncsi/ncsi-manage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 3fd3c39..6b5b5a0 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1508,7 +1508,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
 		return nd;
 
 	/* Create NCSI device */
-	ndp = kzalloc(sizeof(*ndp), GFP_ATOMIC);
+	ndp = kzalloc(sizeof(*ndp), GFP_KERNEL);
 	if (!ndp)
 		return NULL;
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
From: Jia-Ju Bai @ 2018-04-09 14:10 UTC (permalink / raw)
  To: davem, kafai, weiwan, dsa, johannes.berg, fw
  Cc: linux-decnet-user, netdev, linux-kernel, Jia-Ju Bai

dn_route_init() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] dn_route_init() <- decnet_init()
decnet_init() is only set as a parameter of module_init().

Despite never getting called from atomic context,
dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
 net/decnet/dn_route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd..59ed12a 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
 		while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
 			dn_rt_hash_mask--;
 		dn_rt_hash_table = (struct dn_rt_hash_bucket *)
-			__get_free_pages(GFP_ATOMIC, order);
+			__get_free_pages(GFP_KERNEL, order);
 	} while (dn_rt_hash_table == NULL && --order > 0);
 
 	if (!dn_rt_hash_table)
-- 
1.9.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox