* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Yonghong Song @ 2018-04-10 16:58 UTC (permalink / raw)
To: Quentin Monnet, daniel, ast
Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
Josef Bacik, Andrey Ignatov
In-Reply-To: <20180410144157.4831-8-quentin.monnet@netronome.com>
On 4/10/18 7:41 AM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
>
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
>
> This patch contains descriptions for the following helper functions:
>
> Helpers from Lawrence:
> - bpf_setsockopt()
> - bpf_getsockopt()
> - bpf_sock_ops_cb_flags_set()
>
> Helpers from Yonghong:
> - bpf_perf_event_read_value()
> - bpf_perf_prog_read_value()
>
> Helper from Josef:
> - bpf_override_return()
>
> Helper from Andrey:
> - bpf_bind()
>
> Cc: Lawrence Brakmo <brakmo@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: Andrey Ignatov <rdna@fb.com>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
> include/uapi/linux/bpf.h | 184 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 184 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 15d9ccafebbe..7343af4196c8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1208,6 +1208,28 @@ union bpf_attr {
> * Return
> * 0
> *
> + * int bpf_setsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
> + * Description
> + * Emulate a call to **setsockopt()** on the socket associated to
> + * *bpf_socket*, which must be a full socket. The *level* at
> + * which the option resides and the name *optname* of the option
> + * must be specified, see **setsockopt(2)** for more information.
> + * The option value of length *optlen* is pointed by *optval*.
> + *
> + * This helper actually implements a subset of **setsockopt()**.
> + * It supports the following *level*\ s:
> + *
> + * * **SOL_SOCKET**, which supports the following *optname*\ s:
> + * **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**,
> + * **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**.
> + * * **IPPROTO_TCP**, which supports the following *optname*\ s:
> + * **TCP_CONGESTION**, **TCP_BPF_IW**,
> + * **TCP_BPF_SNDCWND_CLAMP**.
> + * * **IPPROTO_IP**, which supports *optname* **IP_TOS**.
> + * * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
> + * Return
> + * 0 on success, or a negative error in case of failure.
> + *
> * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
> * Description
> * Grow or shrink the room for data in the packet associated to
> @@ -1255,6 +1277,168 @@ union bpf_attr {
> * performed again.
> * Return
> * 0 on success, or a negative error in case of failure.
> + *
> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
> + * Description
> + * Read the value of a perf event counter, and store it into *buf*
> + * of size *buf_size*. This helper relies on a *map* of type
> + * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
> + * event counter is selected at the creation of the *map*. The
The nature of the perf event counter is selected when *map* is updated
with perf_event fd's.
> + * *map* is an array whose size is the number of available CPU
> + * cores, and each cell contains a value relative to one core. The
It is confusing to mix core/cpu here. Maybe just use perf_event
convention, always using cpu?
> + * value to retrieve is indicated by *flags*, that contains the
> + * index of the core to look up, masked with
> + * **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
> + * **BPF_F_CURRENT_CPU** to indicate that the value for the
> + * current CPU core should be retrieved.
> + *
> + * This helper behaves in a way close to
> + * **bpf_perf_event_read**\ () helper, save that instead of
> + * just returning the value observed, it fills the *buf*
> + * structure. This allows for additional data to be retrieved: in
> + * particular, the enabled and running times (in *buf*\
> + * **->enabled** and *buf*\ **->running**, respectively) are
> + * copied.
> + *
> + * These values are interesting, because hardware PMU (Performance
> + * Monitoring Unit) counters are limited resources. When there are
> + * more PMU based perf events opened than available counters,
> + * kernel will multiplex these events so each event gets certain
> + * percentage (but not all) of the PMU time. In case that
> + * multiplexing happens, the number of samples or counter value
> + * will not reflect the case compared to when no multiplexing
> + * occurs. This makes comparison between different runs difficult.
> + * Typically, the counter value should be normalized before
> + * comparing to other experiments. The usual normalization is done
> + * as follows.
> + *
> + * ::
> + *
> + * normalized_counter = counter * t_enabled / t_running
> + *
> + * Where t_enabled is the time enabled for event and t_running is
> + * the time running for event since last normalization. The
> + * enabled and running times are accumulated since the perf event
> + * open. To achieve scaling factor between two invocations of an
> + * eBPF program, users can can use CPU id as the key (which is
> + * typical for perf array usage model) to remember the previous
> + * value and do the calculation inside the eBPF program.
> + * Return
> + * 0 on success, or a negative error in case of failure.
> + *
> + * int bpf_perf_prog_read_value(struct bpf_perf_event_data_kern *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
> + * Description
> + * For en eBPF program attached to a perf event, retrieve the
> + * value of the event counter associated to *ctx* and store it in
> + * the structure pointed by *buf* and of size *buf_size*. Enabled
> + * and running times are also stored in the structure (see
> + * description of helper **bpf_perf_event_read_value**\ () for
> + * more details).
> + * Return
> + * 0 on success, or a negative error in case of failure.
> + *
> + * int bpf_getsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
> + * Description
> + * Emulate a call to **getsockopt()** on the socket associated to
> + * *bpf_socket*, which must be a full socket. The *level* at
> + * which the option resides and the name *optname* of the option
> + * must be specified, see **getsockopt(2)** for more information.
> + * The retrieved value is stored in the structure pointed by
> + * *opval* and of length *optlen*.
> + *
> + * This helper actually implements a subset of **getsockopt()**.
> + * It supports the following *level*\ s:
> + *
> + * * **IPPROTO_TCP**, which supports *optname*
> + * **TCP_CONGESTION**.
> + * * **IPPROTO_IP**, which supports *optname* **IP_TOS**.
> + * * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
> + * Return
> + * 0 on success, or a negative error in case of failure.
> + *
> + * int bpf_override_return(struct pt_reg *regs, u64 rc)
> + * Description
> + * Used for error injection, this helper uses kprobes to override
> + * the return value of the probed function, and to set it to *rc*.
> + * The first argument is the context *regs* on which the kprobe
> + * works.
> + *
> + * This helper works by setting setting the PC (program counter)
> + * to an override function which is run in place of the original
> + * probed function. This means the probed function is not run at
> + * all. The replacement function just returns with the required
> + * value.
> + *
> + * This helper has security implications, and thus is subject to
> + * restrictions. It is only available if the kernel was compiled
> + * with the **CONFIG_BPF_KPROBE_OVERRIDE** configuration
> + * option, and in this case it only works on functions tagged with
> + * **ALLOW_ERROR_INJECTION** in the kernel code.
> + *
> + * Also, the helper is only available for the architectures having
> + * the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing,
> + * x86 architecture is the only one to support this feature.
> + * Return
> + * 0
> + *
> + * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops_kern *bpf_sock, int argval)
> + * Description
> + * Attempt to set the value of the **bpf_sock_ops_cb_flags** field
> + * for the full TCP socket associated to *bpf_sock_ops* to
> + * *argval*.
> + *
> + * The primary use of this field is to determine if there should
> + * be calls to eBPF programs of type
> + * **BPF_PROG_TYPE_SOCK_OPS** at various points in the TCP
> + * code. A program of the same type can change its value, per
> + * connection and as necessary, when the connection is
> + * established. This field is directly accessible for reading, but
> + * this helper must be used for updates in order to return an
> + * error if an eBPF program tries to set a callback that is not
> + * supported in the current kernel.
> + *
> + * The supported callback values that *argval* can combine are:
> + *
> + * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out)
> + * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
> + * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
> + *
> + * Here are some examples of where one could call such eBPF
> + * program:
> + *
> + * * When RTO fires.
> + * * When a packet is retransmitted.
> + * * When the connection terminates.
> + * * When a packet is sent.
> + * * When a packet is received.
> + * Return
> + * Code **-EINVAL** if the socket is not a full TCP socket;
> + * otherwise, a positive number containing the bits that could not
> + * be set is returned (which comes down to 0 if all bits were set
> + * as required).
> + *
> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
> + * Description
> + * Bind the socket associated to *ctx* to the address pointed by
> + * *addr*, of length *addr_len*. This allows for making outgoing
> + * connection from the desired IP address, which can be useful for
> + * example when all processes inside a cgroup should use one
> + * single IP address on a host that has multiple IP configured.
> + *
> + * This helper works for IPv4 and IPv6, TCP and UDP sockets. The
> + * domain (*addr*\ **->sa_family**) must be **AF_INET** (or
> + * **AF_INET6**). Looking for a free port to bind to can be
> + * expensive, therefore binding to port is not permitted by the
> + * helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
> + * must be set to zero.
> + *
> + * As for the remote end, both parts of it can be overridden,
> + * remote IP and remote port. This can be useful if an application
> + * inside a cgroup wants to connect to another application inside
> + * the same cgroup or to itself, but knows nothing about the IP
> + * address assigned to the cgroup.
> + * Return
> + * 0 on success, or a negative error in case of failure.
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
>
^ permalink raw reply
* [PATCH bpf v4] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-10 16:37 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
======================================================
WARNING: possible circular locking dependency detected
4.16.0-rc7+ #3 Not tainted
------------------------------------------------------
syz-executor7/24531 is trying to acquire lock:
(bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
but task is already holding lock:
(&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++}:
__might_fault+0x13a/0x1d0 mm/memory.c:4571
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
_perf_ioctl kernel/events/core.c:4750 [inline]
perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> #0 (bpf_event_mutex){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
_free_event+0xbdb/0x10f0 kernel/events/core.c:4116
put_event+0x24/0x30 kernel/events/core.c:4204
perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
remove_vma+0xb4/0x1b0 mm/mmap.c:172
remove_vma_list mm/mmap.c:2490 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2731
mmap_region+0x59e/0x15a0 mm/mmap.c:1646
do_mmap+0x6c0/0xe00 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2223 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
*** DEADLOCK ***
======================================================
The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.
As suggested by Daniel, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.
Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
---
include/linux/bpf.h | 4 ++--
kernel/bpf/core.c | 45 +++++++++++++++++++++++++++++----------------
kernel/trace/bpf_trace.c | 25 +++++++++++++++++++++----
3 files changed, 52 insertions(+), 22 deletions(-)
Changelog:
v3 -> v4:
. Add comments to clarify that ZERO_SIZE_PTR is handled properly,
suggested by Daniel.
v2 -> v3:
. Remove the redundant label in function perf_event_query_prog_array.
v1 -> v2:
. Use the common core function for prog_id copying for two
different prog_array copy routines, suggested by Alexei.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..486e65e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
struct bpf_prog *old_prog);
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
- __u32 __user *prog_ids, u32 request_cnt,
- __u32 __user *prog_cnt);
+ u32 *prog_ids, u32 request_cnt,
+ u32 *prog_cnt);
int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..ba03ec3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1572,13 +1572,32 @@ int bpf_prog_array_length(struct bpf_prog_array __rcu *progs)
return cnt;
}
+static bool bpf_prog_array_copy_core(struct bpf_prog **prog,
+ u32 *prog_ids,
+ u32 request_cnt)
+{
+ int i = 0;
+
+ for (; *prog; prog++) {
+ if (*prog == &dummy_bpf_prog.prog)
+ continue;
+ prog_ids[i] = (*prog)->aux->id;
+ if (++i == request_cnt) {
+ prog++;
+ break;
+ }
+ }
+
+ return !!(*prog);
+}
+
int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
__u32 __user *prog_ids, u32 cnt)
{
struct bpf_prog **prog;
unsigned long err = 0;
- u32 i = 0, *ids;
bool nospc;
+ u32 *ids;
/* users of this function are doing:
* cnt = bpf_prog_array_length();
@@ -1595,16 +1614,7 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
return -ENOMEM;
rcu_read_lock();
prog = rcu_dereference(progs)->progs;
- for (; *prog; prog++) {
- if (*prog == &dummy_bpf_prog.prog)
- continue;
- ids[i] = (*prog)->aux->id;
- if (++i == cnt) {
- prog++;
- break;
- }
- }
- nospc = !!(*prog);
+ nospc = bpf_prog_array_copy_core(prog, ids, cnt);
rcu_read_unlock();
err = copy_to_user(prog_ids, ids, cnt * sizeof(u32));
kfree(ids);
@@ -1683,22 +1693,25 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
}
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
- __u32 __user *prog_ids, u32 request_cnt,
- __u32 __user *prog_cnt)
+ u32 *prog_ids, u32 request_cnt,
+ u32 *prog_cnt)
{
+ struct bpf_prog **prog;
u32 cnt = 0;
if (array)
cnt = bpf_prog_array_length(array);
- if (copy_to_user(prog_cnt, &cnt, sizeof(cnt)))
- return -EFAULT;
+ *prog_cnt = cnt;
/* return early if user requested only program count or nothing to copy */
if (!request_cnt || !cnt)
return 0;
- return bpf_prog_array_copy_to_user(array, prog_ids, request_cnt);
+ /* this function is called under trace/bpf_trace.c: bpf_event_mutex */
+ prog = rcu_dereference_check(array, 1)->progs;
+ return bpf_prog_array_copy_core(prog, prog_ids, request_cnt) ? -ENOSPC
+ : 0;
}
static void bpf_prog_free_deferred(struct work_struct *work)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..56ba0f2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
{
struct perf_event_query_bpf __user *uquery = info;
struct perf_event_query_bpf query = {};
+ u32 *ids, prog_cnt, ids_len;
int ret;
if (!capable(CAP_SYS_ADMIN))
@@ -985,16 +986,32 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
return -EINVAL;
if (copy_from_user(&query, uquery, sizeof(query)))
return -EFAULT;
- if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+ ids_len = query.ids_len;
+ if (ids_len > BPF_TRACE_MAX_PROGS)
return -E2BIG;
+ ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+ if (!ids)
+ return -ENOMEM;
+ /*
+ * The above kcalloc returns ZERO_SIZE_PTR when ids_len = 0, which
+ * is required when user only wants to check for uquery->prog_cnt.
+ * There is no need to check for it since the case is handled
+ * gracefully in bpf_prog_array_copy_info.
+ */
mutex_lock(&bpf_event_mutex);
ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
- uquery->ids,
- query.ids_len,
- &uquery->prog_cnt);
+ ids,
+ ids_len,
+ &prog_cnt);
mutex_unlock(&bpf_event_mutex);
+ if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
+ copy_to_user(uquery->ids, ids, ids_len * sizeof(u32)))
+ ret = -EFAULT;
+
+ kfree(ids);
return ret;
}
--
2.9.5
^ permalink raw reply related
* Re: [PATCH] make net_gso_ok return false when gso_type is zero(invalid)
From: Marcelo Ricardo Leitner @ 2018-04-10 16:32 UTC (permalink / raw)
To: Wenhua Shi; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <CAN6D2npQd9AowLCz5CXGNPPib+10nABh1dOvLWV15r1z6FvF8w@mail.gmail.com>
On Sun, Apr 08, 2018 at 08:41:21PM +0200, Wenhua Shi wrote:
> 2018-04-08 18:51 GMT+02:00 David Miller <davem@davemloft.net>:
> >
> > From: Wenhua Shi <march511@gmail.com>
> > Date: Fri, 6 Apr 2018 03:43:39 +0200
> >
> > > Signed-off-by: Wenhua Shi <march511@gmail.com>
> >
> > This precondition should be made impossible instead of having to do
> > an extra check everywhere that this helper is invoked, many of which
> > are in fast paths.
>
> I believe the precondition you said is quite true. In my situation, I
> have to disable GSO for some packet and I notice that it leads to a
> worse performance (slower than 1Mbps, was almost 800Mbps).
>
> Here's the hook I use on debian 9.4, kernel version 4.9:
There is quite a distance between 4.9 and net/net-next. Did you test
on a more recent kernel too?
Note that TCP stack now works with GSO being always on.
0a6b2a1dc2a2 ("tcp: switch to GSO being always on")
^ permalink raw reply
* Re: [PATCH bpf v3] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-10 16:04 UTC (permalink / raw)
To: Daniel Borkmann, ast, netdev; +Cc: kernel-team
In-Reply-To: <1469ff7e-4fe4-b25f-25f1-273fdd3d68c2@iogearbox.net>
On 4/10/18 1:54 AM, Daniel Borkmann wrote:
> On 04/10/2018 09:21 AM, Yonghong Song wrote:
>> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
>> The error details:
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 4.16.0-rc7+ #3 Not tainted
>> ------------------------------------------------------
>> syz-executor7/24531 is trying to acquire lock:
>> (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
>>
>> but task is already holding lock:
>> (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&mm->mmap_sem){++++}:
>> __might_fault+0x13a/0x1d0 mm/memory.c:4571
>> _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
>> copy_to_user include/linux/uaccess.h:155 [inline]
>> bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
>> perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
>> _perf_ioctl kernel/events/core.c:4750 [inline]
>> perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
>> vfs_ioctl fs/ioctl.c:46 [inline]
>> do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>> SYSC_ioctl fs/ioctl.c:701 [inline]
>> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> -> #0 (bpf_event_mutex){+.+.}:
>> lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>> __mutex_lock_common kernel/locking/mutex.c:756 [inline]
>> __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
>> mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>> perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
>> perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
>> _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
>> put_event+0x24/0x30 kernel/events/core.c:4204
>> perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
>> remove_vma+0xb4/0x1b0 mm/mmap.c:172
>> remove_vma_list mm/mmap.c:2490 [inline]
>> do_munmap+0x82a/0xdf0 mm/mmap.c:2731
>> mmap_region+0x59e/0x15a0 mm/mmap.c:1646
>> do_mmap+0x6c0/0xe00 mm/mmap.c:1483
>> do_mmap_pgoff include/linux/mm.h:2223 [inline]
>> vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
>> SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
>> SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
>> SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
>> SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
>> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> other info that might help us debug this:
>>
>> Possible unsafe locking scenario:
>>
>> CPU0 CPU1
>> ---- ----
>> lock(&mm->mmap_sem);
>> lock(bpf_event_mutex);
>> lock(&mm->mmap_sem);
>> lock(bpf_event_mutex);
>>
>> *** DEADLOCK ***
>> ======================================================
>>
>> The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
>> user space to query prog array on the same tp") where copy_to_user,
>> which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
>> At the same time, during perf_event file descriptor close,
>> mm->mmap_sem is held first and then subsequent
>> perf_event_detach_bpf_prog needs bpf_event_mutex lock.
>> Such a senario caused a deadlock.
>>
>> As suggested by Danial, moving copy_to_user out of the
>
> Nit: typo :)
Oh, my bad. will correct.
>
>> bpf_event_mutex lock should fix the problem.
>>
>> Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
>> Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> include/linux/bpf.h | 4 ++--
>> kernel/bpf/core.c | 45 +++++++++++++++++++++++++++++----------------
>> kernel/trace/bpf_trace.c | 19 +++++++++++++++----
>> 3 files changed, 46 insertions(+), 22 deletions(-)
>>
>> Changelog:
>> v2 -> v3:
>> . Remove the redundant label in function perf_event_query_prog_array.
>> v1 -> v2:
>> . Use the common core function for prog_id copying for two
>> different prog_array copy routines, suggested by Alexei.
>>
> [...]
>> static void bpf_prog_free_deferred(struct work_struct *work)
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index d88e96d..f505d43 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>> {
>> struct perf_event_query_bpf __user *uquery = info;
>> struct perf_event_query_bpf query = {};
>> + u32 *ids, prog_cnt, ids_len;
>> int ret;
>>
>> if (!capable(CAP_SYS_ADMIN))
>> @@ -985,16 +986,26 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>> return -EINVAL;
>> if (copy_from_user(&query, uquery, sizeof(query)))
>> return -EFAULT;
>> - if (query.ids_len > BPF_TRACE_MAX_PROGS)
>> +
>> + ids_len = query.ids_len;
>> + if (ids_len > BPF_TRACE_MAX_PROGS)
>> return -E2BIG;
>> + ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
>> + if (!ids)
>> + return -ENOMEM;
>
> Fix looks good to me, but could you still add a comment stating that we don't
> need to check for ZERO_SIZE_PTR above since we handle this gracefully in the
> bpf_prog_array_copy_info() plus it's also required for the case where the user
> only wants to check for the uquery->prog_cnt, but nothing else.
Will add the comment and send another version.
Thanks.
>
>> mutex_lock(&bpf_event_mutex);
>> ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
>> - uquery->ids,
>> - query.ids_len,
>> - &uquery->prog_cnt);
>> + ids,
>> + ids_len,
>> + &prog_cnt);
>> mutex_unlock(&bpf_event_mutex);
>>
>> + if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
>> + copy_to_user(uquery->ids, ids, ids_len * sizeof(u32)))
>> + ret = -EFAULT;
>> +
>> + kfree(ids);
>> return ret;
>> }
>
> Thanks,
> Daniel
>
^ permalink raw reply
* Re: [RFC] connector: add group_exit_code and signal_flags fields to exit_proc_event
From: Stefan Strogin @ 2018-04-10 16:01 UTC (permalink / raw)
To: Evgeniy Polyakov, matthltc@us.ibm.com
Cc: netdev@vger.kernel.org, xe-linux-external@cisco.com,
Victor Kamensky, Taras Kondratiuk, Ruslan Bilovol
In-Reply-To: <8933031523231387@web29j.yandex.ru>
Hi Evgeniy,
On 09/04/18 02:49, Evgeniy Polyakov wrote:
> Hi everyone
>
> Sorry for that late reply
>
> 01.03.2018, 21:58, "Stefan Strogin" <sstrogin@cisco.com>:
>> So I was thinking to add these two fields to union event_data:
>> task->signal->group_exit_code
>> task->signal->flags
>> This won't increase size of struct proc_event (because of comm_proc_event)
>> and shouldn't break backward compatibility for the user-space. But it will
>> add some useful information about what caused the process death.
>> What do you think, is it an acceptable approach?
>
> As I saw in other discussion, doesn't it break userspace API, or you are sure that no sizes has been increased?
> You are using the same structure as used for plain signals and add group status there, how will userspace react,
> if it was compiled with older headers? What if it uses zero-field alignment, i.e. allocating exactly the size of structure with byte precision?
>
Please ignore this RFC, I was wrong about the fields I need for the problem.
I have sent this patch: https://lkml.org/lkml/2018/3/29/531,
would be grateful for a review.
As for breaking UAPI and structure sizes, look:
> struct proc_event {
...
> union { /* must be last field of proc_event struct */
...
> struct comm_proc_event {
> __kernel_pid_t process_pid;
> __kernel_pid_t process_tgid;
> char comm[16];
> } comm;
__kernel_pid_t is int that is always 4 bytes in Linux, then
sizeof(event_data.comm) == 24 on all platforms.
>
> struct coredump_proc_event {
> __kernel_pid_t process_pid;
> __kernel_pid_t process_tgid;
> + __kernel_pid_t parent_pid;
> + __kernel_pid_t parent_tgid;
> } coredump;
sizeof(event_data.coredump) == 16
>
> struct exit_proc_event {
> __kernel_pid_t process_pid;
> __kernel_pid_t process_tgid;
> __u32 exit_code, exit_signal;
> + __kernel_pid_t parent_pid;
> + __kernel_pid_t parent_tgid;
> } exit;
sizeof(event_data.exit) == 24
>
> } event_data;
> };
Therefore, sizeof(event_data) is always = 24 - with old headers and new ones.
sizeof(struct proc_event) is the same as well.
Hence user-space software with old and new headers will allocate the same size.
If the user-space program somehow allocates space only for an internal structure,
e.g. for event_data.exit, I still don't see any problems if it allocates and
handles only first 16 bytes of the structure using old headers.
^ permalink raw reply
* Re: [PATCH net] sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
From: Marcelo Ricardo Leitner @ 2018-04-10 16:00 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, netdev, Eric Dumazet, Vlad Yasevich,
Neil Horman
In-Reply-To: <20180408145208.228542-1-edumazet@google.com>
On Sun, Apr 08, 2018 at 07:52:08AM -0700, Eric Dumazet wrote:
> Check must happen before call to ipv6_addr_v4mapped()
Please don't forget to also Cc linux-sctp@vger.kernel.org on sctp
patches.
Thanks!
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-10 15:59 UTC (permalink / raw)
To: Jiri Pirko
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <20180410154304.GG2028@nanopsycho>
On 4/10/2018 8:43 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>>>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>>>>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>>>>>> + struct net_device *child_netdev)
>>>>>>>>>> +{
>>>>>>>>>> + struct virtnet_bypass_info *vbi;
>>>>>>>>>> + bool backup;
>>>>>>>>>> +
>>>>>>>>>> + vbi = netdev_priv(bypass_netdev);
>>>>>>>>>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>>>>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>>>>>> + rtnl_dereference(vbi->active_netdev)) {
>>>>>>>>>> + netdev_info(bypass_netdev,
>>>>>>>>>> + "%s attempting to join bypass dev when %s already present\n",
>>>>>>>>>> + child_netdev->name, backup ? "backup" : "active");
>>>>>>>>> Bypass module should check if there is already some other netdev
>>>>>>>>> enslaved and refuse right there.
>>>>>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>>>>>> as its bypass_netdev is same as the backup_netdev.
>>>>>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>>>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>>>>>> for 3 netdev scenario.
>>>>>>> Just let me undestand it clearly. What I expect the difference would be
>>>>>>> between 2netdev and3 netdev model is this:
>>>>>>> 2netdev:
>>>>>>> bypass_master
>>>>>>> /
>>>>>>> /
>>>>>>> VF_slave
>>>>>>>
>>>>>>> 3netdev:
>>>>>>> bypass_master
>>>>>>> / \
>>>>>>> / \
>>>>>>> VF_slave backup_slave
>>>>>>>
>>>>>>> Is that correct? If not, how does it look like?
>>>>>>>
>>>>>>>
>>>>>> Looks correct.
>>>>>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>>>>>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>>>>> marked as the 2 slaves of this new netdev.
>>>>> You say it looks correct and in another sentence you provide completely
>>>>> different description. Could you please look again?
>>>>>
>>>> To be exact, 2 netdev model with netvsc looks like this.
>>>>
>>>> netvsc_netdev
>>>> /
>>>> /
>>>> VF_slave
>>>>
>>>> With virtio_net, 3 netdev model
>>>>
>>>> bypass_netdev
>>>> / \
>>>> / \
>>>> VF_slave virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>> bypass_netdev
>> / \
>> / \
>> VF_slave virtio_net netdev (original)
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
> netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
> configured on it (well you could, but the rx_handler would eat every
> incoming packet). So you will break the user bacause he would have to
> move the configuration to the new master device.
> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.
Forgot to mention that bypass_netdev takes over the name of the original
netdev and
virtio_net netdev will get the backup name.
So the userspace network configuration doesn't need to change.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2018-04-10 15:55 UTC (permalink / raw)
To: Jiri Pirko
Cc: Samudrala, Sridhar, Michael S. Tsirkin, Stephen Hemminger,
David Miller, Netdev, virtualization, virtio-dev,
Brandeburg, Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <20180410154304.GG2028@nanopsycho>
On Tue, Apr 10, 2018 at 8:43 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>> > > > > [...]
>>> > > > >
>>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>> > > > > > > > + struct net_device *child_netdev)
>>> > > > > > > > +{
>>> > > > > > > > + struct virtnet_bypass_info *vbi;
>>> > > > > > > > + bool backup;
>>> > > > > > > > +
>>> > > > > > > > + vbi = netdev_priv(bypass_netdev);
>>> > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>> > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>> > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
>>> > > > > > > > + netdev_info(bypass_netdev,
>>> > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
>>> > > > > > > > + child_netdev->name, backup ? "backup" : "active");
>>> > > > > > > Bypass module should check if there is already some other netdev
>>> > > > > > > enslaved and refuse right there.
>>> > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>> > > > > > as its bypass_netdev is same as the backup_netdev.
>>> > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>> > > > > > for 3 netdev scenario.
>>> > > > > Just let me undestand it clearly. What I expect the difference would be
>>> > > > > between 2netdev and3 netdev model is this:
>>> > > > > 2netdev:
>>> > > > > bypass_master
>>> > > > > /
>>> > > > > /
>>> > > > > VF_slave
>>> > > > >
>>> > > > > 3netdev:
>>> > > > > bypass_master
>>> > > > > / \
>>> > > > > / \
>>> > > > > VF_slave backup_slave
>>> > > > >
>>> > > > > Is that correct? If not, how does it look like?
>>> > > > >
>>> > > > >
>>> > > > Looks correct.
>>> > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>> > > > marked as the 2 slaves of this new netdev.
>>> > > You say it looks correct and in another sentence you provide completely
>>> > > different description. Could you please look again?
>>> > >
>>> > To be exact, 2 netdev model with netvsc looks like this.
>>> >
>>> > netvsc_netdev
>>> > /
>>> > /
>>> > VF_slave
>>> >
>>> > With virtio_net, 3 netdev model
>>> >
>>> > bypass_netdev
>>> > / \
>>> > / \
>>> > VF_slave virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>>
>> bypass_netdev
>> / \
>> / \
>>VF_slave virtio_net netdev (original)
>
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
> netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
> configured on it (well you could, but the rx_handler would eat every
> incoming packet). So you will break the user bacause he would have to
> move the configuration to the new master device.
That's exactly the point why I need to hide the lower netdev slaves
and trying the align the naming of the bypass with where IP was
configured on the original netdev. The current 3-netdev model is not
"transparent" by any means.
-Siwei
> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-10 15:43 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <d08ed1e6-678a-7b34-cd2e-52788ceec919@intel.com>
Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > [...]
>> > > > >
>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > > > > > + struct net_device *child_netdev)
>> > > > > > > > +{
>> > > > > > > > + struct virtnet_bypass_info *vbi;
>> > > > > > > > + bool backup;
>> > > > > > > > +
>> > > > > > > > + vbi = netdev_priv(bypass_netdev);
>> > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > + netdev_info(bypass_netdev,
>> > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
>> > > > > > > > + child_netdev->name, backup ? "backup" : "active");
>> > > > > > > Bypass module should check if there is already some other netdev
>> > > > > > > enslaved and refuse right there.
>> > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > > > > > for 3 netdev scenario.
>> > > > > Just let me undestand it clearly. What I expect the difference would be
>> > > > > between 2netdev and3 netdev model is this:
>> > > > > 2netdev:
>> > > > > bypass_master
>> > > > > /
>> > > > > /
>> > > > > VF_slave
>> > > > >
>> > > > > 3netdev:
>> > > > > bypass_master
>> > > > > / \
>> > > > > / \
>> > > > > VF_slave backup_slave
>> > > > >
>> > > > > Is that correct? If not, how does it look like?
>> > > > >
>> > > > >
>> > > > Looks correct.
>> > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> > > > marked as the 2 slaves of this new netdev.
>> > > You say it looks correct and in another sentence you provide completely
>> > > different description. Could you please look again?
>> > >
>> > To be exact, 2 netdev model with netvsc looks like this.
>> >
>> > netvsc_netdev
>> > /
>> > /
>> > VF_slave
>> >
>> > With virtio_net, 3 netdev model
>> >
>> > bypass_netdev
>> > / \
>> > / \
>> > VF_slave virtio_net netdev
>> Could you also mark the original netdev which is there now? is it
>> bypass_netdev or virtio_net_netdev ?
>
> bypass_netdev
> / \
> / \
>VF_slave virtio_net netdev (original)
That does not make sense.
1) You diverge from the behaviour of the netvsc, where the original
netdev is a master of the VF
2) If the original netdev is a slave, you cannot have any IP address
configured on it (well you could, but the rx_handler would eat every
incoming packet). So you will break the user bacause he would have to
move the configuration to the new master device.
This only makes sense that the original netdev becomes the master for both
netvsc and virtio_net.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-10 15:27 UTC (permalink / raw)
To: Jiri Pirko
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <20180410152205.GF2028@nanopsycho>
On 4/10/2018 8:22 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>>>> [...]
>>>>>
>>>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>>>> + struct net_device *child_netdev)
>>>>>>>> +{
>>>>>>>> + struct virtnet_bypass_info *vbi;
>>>>>>>> + bool backup;
>>>>>>>> +
>>>>>>>> + vbi = netdev_priv(bypass_netdev);
>>>>>>>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>>>> + rtnl_dereference(vbi->active_netdev)) {
>>>>>>>> + netdev_info(bypass_netdev,
>>>>>>>> + "%s attempting to join bypass dev when %s already present\n",
>>>>>>>> + child_netdev->name, backup ? "backup" : "active");
>>>>>>> Bypass module should check if there is already some other netdev
>>>>>>> enslaved and refuse right there.
>>>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>>>> as its bypass_netdev is same as the backup_netdev.
>>>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>>>> for 3 netdev scenario.
>>>>> Just let me undestand it clearly. What I expect the difference would be
>>>>> between 2netdev and3 netdev model is this:
>>>>> 2netdev:
>>>>> bypass_master
>>>>> /
>>>>> /
>>>>> VF_slave
>>>>>
>>>>> 3netdev:
>>>>> bypass_master
>>>>> / \
>>>>> / \
>>>>> VF_slave backup_slave
>>>>>
>>>>> Is that correct? If not, how does it look like?
>>>>>
>>>>>
>>>> Looks correct.
>>>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>>>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>>> marked as the 2 slaves of this new netdev.
>>> You say it looks correct and in another sentence you provide completely
>>> different description. Could you please look again?
>>>
>> To be exact, 2 netdev model with netvsc looks like this.
>>
>> netvsc_netdev
>> /
>> /
>> VF_slave
>>
>> With virtio_net, 3 netdev model
>>
>> bypass_netdev
>> / \
>> / \
>> VF_slave virtio_net netdev
> Could you also mark the original netdev which is there now? is it
> bypass_netdev or virtio_net_netdev ?
bypass_netdev
/ \
/ \
VF_slave virtio_net netdev (original)
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-10 15:22 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <2ccfad76-589d-9dca-7e4b-9bafee41f844@intel.com>
Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> > > [...]
>> > >
>> > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > > > + struct net_device *child_netdev)
>> > > > > > +{
>> > > > > > + struct virtnet_bypass_info *vbi;
>> > > > > > + bool backup;
>> > > > > > +
>> > > > > > + vbi = netdev_priv(bypass_netdev);
>> > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > + rtnl_dereference(vbi->active_netdev)) {
>> > > > > > + netdev_info(bypass_netdev,
>> > > > > > + "%s attempting to join bypass dev when %s already present\n",
>> > > > > > + child_netdev->name, backup ? "backup" : "active");
>> > > > > Bypass module should check if there is already some other netdev
>> > > > > enslaved and refuse right there.
>> > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > > > as its bypass_netdev is same as the backup_netdev.
>> > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > > > for 3 netdev scenario.
>> > > Just let me undestand it clearly. What I expect the difference would be
>> > > between 2netdev and3 netdev model is this:
>> > > 2netdev:
>> > > bypass_master
>> > > /
>> > > /
>> > > VF_slave
>> > >
>> > > 3netdev:
>> > > bypass_master
>> > > / \
>> > > / \
>> > > VF_slave backup_slave
>> > >
>> > > Is that correct? If not, how does it look like?
>> > >
>> > >
>> > Looks correct.
>> > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> > marked as the 2 slaves of this new netdev.
>> You say it looks correct and in another sentence you provide completely
>> different description. Could you please look again?
>>
>To be exact, 2 netdev model with netvsc looks like this.
>
> netvsc_netdev
> /
> /
> VF_slave
>
>With virtio_net, 3 netdev model
>
> bypass_netdev
> / \
> / \
>VF_slave virtio_net netdev
Could you also mark the original netdev which is there now? is it
bypass_netdev or virtio_net_netdev ?
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-10 15:13 UTC (permalink / raw)
To: Jiri Pirko
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <20180410105504.GA2028@nanopsycho>
On 4/10/2018 3:55 AM, Jiri Pirko wrote:
> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>> [...]
>>>
>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>> + struct net_device *child_netdev)
>>>>>> +{
>>>>>> + struct virtnet_bypass_info *vbi;
>>>>>> + bool backup;
>>>>>> +
>>>>>> + vbi = netdev_priv(bypass_netdev);
>>>>>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>> + rtnl_dereference(vbi->active_netdev)) {
>>>>>> + netdev_info(bypass_netdev,
>>>>>> + "%s attempting to join bypass dev when %s already present\n",
>>>>>> + child_netdev->name, backup ? "backup" : "active");
>>>>> Bypass module should check if there is already some other netdev
>>>>> enslaved and refuse right there.
>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>> as its bypass_netdev is same as the backup_netdev.
>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>> for 3 netdev scenario.
>>> Just let me undestand it clearly. What I expect the difference would be
>>> between 2netdev and3 netdev model is this:
>>> 2netdev:
>>> bypass_master
>>> /
>>> /
>>> VF_slave
>>>
>>> 3netdev:
>>> bypass_master
>>> / \
>>> / \
>>> VF_slave backup_slave
>>>
>>> Is that correct? If not, how does it look like?
>>>
>>>
>> Looks correct.
>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> marked as the 2 slaves of this new netdev.
> You say it looks correct and in another sentence you provide completely
> different description. Could you please look again?
>
To be exact, 2 netdev model with netvsc looks like this.
netvsc_netdev
/
/
VF_slave
With virtio_net, 3 netdev model
bypass_netdev
/ \
/ \
VF_slave virtio_net netdev
^ permalink raw reply
* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-10 15:04 UTC (permalink / raw)
To: Christian Brauner
Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180410143515.GA14186@gmail.com>
Christian Brauner <christian.brauner@canonical.com> writes:
> On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>>
>> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >>
>> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >> >>>
>> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >> >>>
>> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >> >>> namespaces these devices are moved to or created in.
>> >> >> >>>
>> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >> >>> into all network namespaces.
>> >> >> >>>
>> >> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >> >>>
>> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >> >>>
>> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >> >>> return;
>> >> >> >>>
>> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
>> >> >> >>> return;
>> >> >> >>>
>> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >>> CAP_NET_BROADCAST))
>> >> >> >>> j return;
>> >> >> >>> }
>> >> >> >>>
>> >> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >> >>> identity mapping is established between root on the host and root in the
>> >> >> >>> new user namespace. Here's a reproducer:
>> >> >> >>>
>> >> >> >>> sudo unshare -U --map-root
>> >> >> >>> udevadm monitor -k
>> >> >> >>> # Now change to initial user namespace and e.g. do
>> >> >> >>> modprobe kvm
>> >> >> >>> # or
>> >> >> >>> rmmod kvm
>> >> >> >>>
>> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >> >>> with them.
>> >> >> >>>
>> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >> >>> make a decision what uevents should be sent.
>> >> >> >>>
>> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >> >>> event will always be filtered.
>> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >> >>> initial user namespace but was opened from a non-initial user namespace
>> >> >> >>> the event will be filtered as well.
>> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >> >>> anything with interesting devices.
>> >> >> >>>
>> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >> >>> ---
>> >> >> >>> lib/kobject_uevent.c | 10 +++++++++-
>> >> >> >>> 1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> >>>
>> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >> >>> --- a/lib/kobject_uevent.c
>> >> >> >>> +++ b/lib/kobject_uevent.c
>> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >> >>> return sock_ns != ns;
>> >> >> >>> }
>> >> >> >>>
>> >> >> >>> - return 0;
>> >> >> >>> + /*
>> >> >> >>> + * The kobject does not carry a namespace tag so filter by user
>> >> >> >>> + * namespace below.
>> >> >> >>> + */
>> >> >> >>> + if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >> >>> + return 1;
>> >> >> >>> +
>> >> >> >>> + /* Check if socket was opened from non-initial user namespace. */
>> >> >> >>> + return sk_user_ns(dsk) != &init_user_ns;
>> >> >> >>> }
>> >> >> >>> #endif
>> >> >> >>
>> >> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> >> >
>> >> >> > No, this is not correct: As it is right now *without my patch* no
>> >> >> > non-initial user namespace is receiving *any uevents* but those
>> >> >> > specifically namespaced such as those for network devices. This patch
>> >> >> > doesn't change that at all. The commit message outlines this in detail
>> >> >> > how this comes about.
>> >> >> > There is only one case where this currently breaks and this is as I
>> >> >> > outlined explicitly in my commit message when you create a new user
>> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> >>
>> >> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> >> Now it will return 1 sometimes.
>> >> >
>> >> > Oh sure, it's in the commit message though. The callchain is
>> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> >> > net/netlink/af_netlink.c:do_one_broadcast():
>> >> >
>> >> > This codepiece will check whether the openened socket holds
>> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> >> > which it won't because we don't have device namespaces and all devices
>> >> > belong to the initial set of namespaces.
>> >> >
>> >> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> > CAP_NET_BROADCAST))
>> >> > j return;
>> >> >
>> >>
>> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> >> on their socket and has had someone with the appropriate privileges
>> >> assign a peerrnetid.
>> >>
>> >> All of which is to say that file_ns_capable is not nearly as applicable
>> >> as it might be, and if you can pass the other two checks I think it is
>> >> pointless (because the peernet attributes are not generated for
>> >> kobj_uevents) but valid to receive events from outside your network
>> >> namespace.
>> >>
>> >>
>> >> I might be missing something but I don't see anything excluding network
>> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> >> logic.
>> >>
>> >> The uevent_sock_list has one entry per network namespace. And
>> >> kobject_uevent_net_broacast appears to walk each one.
>> >>
>> >> I had a memory of filtering uevent messages and I had a memory
>> >> that the netlink_has_listeners had a per network namespace effect.
>> >> Neither seems true from my inspection of the code tonight.
>> >>
>> >> If we are not filtering ordinary uevents at least at the user namespace
>> >> level that does seem to be at least a nuisance.
>> >>
>> >>
>> >> Christian can you dig a little deeper into this. I have a feeling that
>> >> there are some real efficiency improvements that we could make to
>> >> kobject_uevent_net_broadcast if nothing else.
>> >>
>> >> Perhaps you could see where uevents are broadcast by poking
>> >> the sysfs uevent of an existing device, and triggering a retransmit.
>> >
>> > @Eric, so I did some intensive testing over the weekend and forget everything I
>> > said before. Uevents are not filtered by the kernel at the moment. This is
>> > currently - apart from network devices - a pure userspace thing. Specifically,
>> > anyone on the host can listen to all uevents from everywhere. It's neither
>> > filtered by user nor by network namespace. The fact that I used
>> >
>> > udevadm --debug monitor
>> >
>> > to test my prior hypothesis was what led me to believe that uevents are already
>> > correctly filtered.
>> > Instead, what is actually happening is that every udev implementation out there
>> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
>> > Specifically,
>>
>> Yes. I remember something of the sort. I believe udev also filters to
>> ensure that the netlink port id == 0 to ensure the message came from
>> the kernel and was not spoofed in any way.
>>
>> > - legacy standalone udevd:
>> > https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
>> > - eudevd
>> > https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
>> > - systemd-udevd
>> > https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
>> > - ueventd (Android)
>> > https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>> >
>> > For all of those I could trace this behavior back to the first released
>> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
>> > I could trace it back to the first version that is still available on
>> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
>> > trivially true that it has the same behavior from the beginning.)
>> > Android filters uevents in the same way but removed that check on January 8
>> > 2018 for what I think is invalid reasoning. The good news is that this is only
>> > in their master branch. It hasn't even made it into an rc release for Android 8
>> > yet. I filed a bug against Android and offered them a fix if they agree.
>> >
>> > In any case, userspace udev is not making use of uevents at all right now since
>> > any uid != 0 events are **explicitly** discarded.
>> > The fact that you receive uevents for
>> >
>> > sudo unshare -U --map-root -n
>> > udevadm --debug monitor
>> >
>> > is simply explained by the fact that container(0) <=> host(0) at which point
>> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
>> > discard it.
>> > The use case for receiving uevents in containers/user namespaces is definitely
>> > there but that's what the uevent injection patch series was for that we merged.
>> > This is a much safer and saner solution.
>> > The fact that all udev implementations filter uevents by uid != 0 very much
>> > seems like a security mechanism in userspace that we probably should provide by
>> > isolating uevents based on user and/or network namespaces.
>>
>> So in summary. Uevents are filtered in a user namespace (by userspace)
>> because the received uid != 0. It instead == 65534 == "nobody" because
>> the global root uid is not mapped.
>
> Exactly.
>
>>
>> Which means that we can modify the kernel to not send to all network
>> namespaces whose user_ns != &init_user_ns because we know that userspace
>> will ignore the message because of the uid anyway. Which means when
>
> Yes.
>
>> net-next reopens you can send that patch. But please base it on just
>> not including network namespaces in the list, as that is much more
>> efficient than adding more conditions to the filter.
>
> I'll send a patch out once net-next reopens. I'll also make sure to
> inform all udev userspace implementations of the change. It won't affect
> them but it is nice for them to know that they're safer now.
The real danger is in a user namespace or in a container really is too
many daemons responding to events will generate a thundering hurd of
activity when there is really nothing to do.
> Something like this (Proper commit message and so on will be added once
> I sent this out.):
Exactly.
I would make the comment say something like: "ignore all but the initial
user namespace".
Eric
> From 68d3c27435520cd600874999b6d9d17572854a7a Mon Sep 17 00:00:00 2001
> From: Christian Brauner <christian.brauner@ubuntu.com>
> Date: Tue, 10 Apr 2018 11:56:49 +0200
> Subject: [PATCH] netns: restrict uevents to initial user namespace
>
> /* Here'll be a useful commit message describing in detail what's
> * happening once I sent it to net-next. */
>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> lib/kobject_uevent.c | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..22a2c1a98b8f 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -703,9 +703,16 @@ static int uevent_net_init(struct net *net)
>
> net->uevent_sock = ue_sk;
>
> - mutex_lock(&uevent_sock_mutex);
> - list_add_tail(&ue_sk->list, &uevent_sock_list);
> - mutex_unlock(&uevent_sock_mutex);
> + /*
> + * Only sent uevents to uevent sockets that are located in network
> + * namespaces owned by the initial user namespace.
> + */
> + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> + mutex_lock(&uevent_sock_mutex);
> + list_add_tail(&ue_sk->list, &uevent_sock_list);
> + mutex_unlock(&uevent_sock_mutex);
> + }
> +
> return 0;
> }
>
> @@ -713,9 +720,11 @@ static void uevent_net_exit(struct net *net)
> {
> struct uevent_sock *ue_sk = net->uevent_sock;
>
> - mutex_lock(&uevent_sock_mutex);
> - list_del(&ue_sk->list);
> - mutex_unlock(&uevent_sock_mutex);
> + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> + mutex_lock(&uevent_sock_mutex);
> + list_del(&ue_sk->list);
> + mutex_unlock(&uevent_sock_mutex);
> + }
>
> netlink_kernel_release(ue_sk->sk);
> kfree(ue_sk);
^ permalink raw reply
* Re: [PATCH net] ip_gre: clear feature flags when incompatible o_flags are set
From: David Miller @ 2018-04-10 15:04 UTC (permalink / raw)
To: sd; +Cc: netdev, lorenzo.bianconi, lucien.xin, u9012063
In-Reply-To: <a989fb36846d45587455e648ed4e8ff8707bcee2.1523357388.git.sd@queasysnail.net>
From: Sabrina Dubroca <sd@queasysnail.net>
Date: Tue, 10 Apr 2018 12:57:18 +0200
> Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via
> netlink") added the ability to change o_flags, but missed that the
> GSO/LLTX features are disabled by default, and only enabled some gre
> features are unused. Thus we also need to disable the GSO/LLTX features
> on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.
>
> These two examples should result in the same features being set:
>
> ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0
>
> ip link set gre_none type gre seq
> ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq
>
> Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Applied and queued up for -stable, thank you.
^ permalink raw reply
* Re: [PATCH linux] net: fix deadlock while clearing neighbor proxy table
From: David Miller @ 2018-04-10 15:02 UTC (permalink / raw)
To: w.bumiller; +Cc: netdev, yoshfuji
In-Reply-To: <20180410091514.28704-1-w.bumiller@proxmox.com>
From: Wolfgang Bumiller <w.bumiller@proxmox.com>
Date: Tue, 10 Apr 2018 11:15:14 +0200
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 7b7a14abba28..601df647588c 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -292,7 +292,6 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
> write_lock_bh(&tbl->lock);
> neigh_flush_dev(tbl, dev);
> pneigh_ifdown(tbl, dev);
> - write_unlock_bh(&tbl->lock);
If we are going to fix it this way, we need to annotate the code here in some
way so that future readers understand why the tbl->lock is not being released
here.
One way is to add a comment.
Another way is to rename pneigh_ifdown() to "pneigh_ifdown_and_unlock()".
^ permalink raw reply
* Re: [PATCH] net: usb: hso: Replace GFP_ATOMIC with GFP_KERNEL in hso_create_device
From: Johan Hovold @ 2018-04-10 14:59 UTC (permalink / raw)
To: Jia-Ju Bai
Cc: davem, andreas, johan, johannes.berg, stephen, Linyu.Yuan,
linux-usb, netdev, linux-kernel
In-Reply-To: <1523370924-32425-1-git-send-email-baijiaju1990@gmail.com>
On Tue, Apr 10, 2018 at 10:35:24PM +0800, Jia-Ju Bai wrote:
> hso_create_device() is never called in atomic context.
>
> The call chains ending up at hso_create_device() are:
> [1] hso_create_device() <- hso_create_bulk_serial_device() <- hso_probe()
> [2] hso_create_device() <- hso_create_mux_serial_device() <- hso_probe()
> [3] hso_create_device() <- hso_create_net_device() <- hso_probe()
> hso_probe() is set as ".probe" in struct usb_driver,
> so it is not called in atomic context.
>
> Despite never getting called from atomic context,
> hso_create_device() calls kzalloc() with GFP_ATOMIC,
> which does not sleep for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> which can sleep and improve the possibility of sucessful allocation.
>
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
>
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Thanks for the patch. This looks good to me.
Reviewed-by: Johan Hovold <johan@kernel.org>
Johan
^ permalink raw reply
* Re: mISDN: Remove VLAs
From: David Miller @ 2018-04-10 14:57 UTC (permalink / raw)
To: labbott; +Cc: netdev
This patch puts more than 1400 bytes onto the kernel stack in
l1oip_socket_send(), which is a cure worse than the disease.
I'm not applying this, sorry.
^ permalink raw reply
* Re: [PATCH net-next 0/3] ave: fix the activation issues for some UniPhier SoCs
From: David Miller @ 2018-04-10 14:53 UTC (permalink / raw)
To: hayashi.kunihiko
Cc: netdev, andrew, f.fainelli, robh+dt, mark.rutland,
linux-arm-kernel, linux-kernel, devicetree, yamada.masahiro,
masami.hiramatsu, jaswinder.singh
In-Reply-To: <1523255925-6469-1-git-send-email-hayashi.kunihiko@socionext.com>
From: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Date: Mon, 9 Apr 2018 15:38:42 +0900
> This add the following stuffs to fix the activation issues and satisfy
> requirements for AVE ethernet driver implemented on some UniPhier SoCs.
>
> - Add support for additional necessary clocks and resets, because the kernel
> is stalled on Pro4 due to lack of them.
>
> - Check whether the SoC supports the specified phy-mode
>
> - Add DT property support indicating system controller that has the feature
> for configurating phy-mode including built-in phy on LD11.
The net-next tree is currently closed, please resubmit this when the merge
window is open and I explicitly open the net-next tree for submissions again.
Thank you.
^ permalink raw reply
* Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
From: David Miller @ 2018-04-10 14:50 UTC (permalink / raw)
To: jasowang
Cc: stefanha, virtualization, syzkaller-bugs, mst, torvalds, kvm,
linux-kernel, netdev
In-Reply-To: <8642c057-e49d-d95d-19e5-6d304f20064e@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Tue, 10 Apr 2018 14:40:10 +0800
> On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
>> v2:
>> * Rewrote the conditional to make the vq access check clearer [Linus]
>> * Added Patch 2 to make the return type consistent and harder to misuse
>> * [Linus]
>>
>> The first patch fixes the vhost virtqueue access check which was
>> recently
>> broken. The second patch replaces the int return type with bool to
>> prevent
>> future bugs.
>>
>> Stefan Hajnoczi (2):
>> vhost: fix vhost_vq_access_ok() log check
>> vhost: return bool from *_access_ok() functions
>>
>> drivers/vhost/vhost.h | 4 +--
>> drivers/vhost/vhost.c | 70
>> ++++++++++++++++++++++++++-------------------------
>> 2 files changed, 38 insertions(+), 36 deletions(-)
>>
>
> Acked-by: Jason Wang <jasowang@redhat.com>
This sereis doesn't apply cleanly to the net tree, please respin
and add the appropriate Acks and Fixes: tags that Michael and Jason
have provided.
Thank you.
^ permalink raw reply
* [RFC bpf-next v2 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
To: daniel, ast
Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man,
John Fastabend
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.
The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.
This patch contains descriptions for the following helper functions, all
written by John:
- bpf_redirect_map()
- bpf_sk_redirect_map()
- bpf_sock_map_update()
- bpf_msg_redirect_map()
- bpf_msg_apply_bytes()
- bpf_msg_cork_bytes()
- bpf_msg_pull_data()
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
include/uapi/linux/bpf.h | 140 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 140 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7343af4196c8..db090ad03626 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1250,6 +1250,51 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+ * Description
+ * Redirect the packet to the endpoint referenced by *map* at
+ * index *key*. Depending on its type, his *map* can contain
+ * references to net devices (for forwarding packets through other
+ * ports), or to CPUs (for redirecting XDP frames to another CPU;
+ * but this is not fully implemented as of this writing).
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ * Return
+ * **XDP_REDIRECT** on success, or **XDP_ABORT** on error.
+ *
+ * int bpf_sk_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+ * Description
+ * Redirect the packet to the socket referenced by *map* (of type
+ * **BPF_MAP_TYPE_SOCKMAP**) at index *key*. The only flag
+ * supported for now is **BPF_F_INGRESS**, which indicates the
+ * packet is to be redirected to the ingress side of the socket
+ * instead of (by default) egress.
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ * Return
+ * **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_sock_map_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
+ * Description
+ * Add an entry to, or update a *map* referencing sockets. The
+ * *skops* is used as a new value for the entry associated to
+ * *key*. *flags* is one of:
+ *
+ * **BPF_NOEXIST**
+ * The entry for *key* must not exist in the map.
+ * **BPF_EXIST**
+ * The entry for *key* must already exist in the map.
+ * **BPF_ANY**
+ * No condition on the existence of the entry for *key*.
+ *
+ * If the *map* has eBPF programs (parser and verdict), those will
+ * be inherited by the socket being added. If the socket is
+ * already attached to eBPF programs, this results in an error.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
* Description
* Adjust the address pointed by *xdp_md*\ **->data_meta** by
@@ -1417,6 +1462,101 @@ union bpf_attr {
* be set is returned (which comes down to 0 if all bits were set
* as required).
*
+ * int bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
+ * Description
+ * This helper is used in programs implementing policies at the
+ * socket level. If the message *msg* is allowed to pass (i.e. if
+ * the verdict eBPF program returns **SK_PASS**), redirect it to
+ * the socket referenced by *map* (of type
+ * **BPF_MAP_TYPE_SOCKMAP**) at index *key*. The only flag
+ * supported for now is **BPF_F_INGRESS**, which indicates the
+ * packet is to be redirected to the ingress side of the socket
+ * instead of (by default) egress.
+ * Return
+ * **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
+ * Description
+ * For socket policies, apply the verdict of the eBPF program to
+ * the next *bytes* (number of bytes) of message *msg*.
+ *
+ * For example, this helper can be used in the following cases:
+ *
+ * * A single **sendmsg**\ () or **sendfile**\ () system call
+ * contains multiple logical messages that the eBPF program is
+ * supposed to read and for which it should apply a verdict.
+ * * An eBPF program only cares to read the first *bytes* of a
+ * *msg*. If the message has a large payload, then setting up
+ * and calling the eBPF program repeatedly for all bytes, even
+ * though the verdict is already known, would create unnecessary
+ * overhead.
+ *
+ * When called from within an eBPF program, the helper sets a
+ * counter internal to the BPF infrastructure, that is used to
+ * apply the last verdict to the next *bytes*. If *bytes* is
+ * smaller than the current data being processed from a
+ * **sendmsg**\ () or **sendfile**\ () system call, the first
+ * *bytes* will be sent and the eBPF program will be re-run with
+ * the pointer for start of data pointing to byte number *bytes*
+ * **+ 1**. If *bytes* is larger than the current data being
+ * processed, then the eBPF verdict will be applied to multiple
+ * **sendmsg**\ () or **sendfile**\ () calls until *bytes* are
+ * consumed.
+ *
+ * Note that if a socket closes with the internal counter holding
+ * a non-zero value, this is not a problem because data is not
+ * being buffered for *bytes* and is sent as it is received.
+ * Return
+ * 0
+ *
+ * int bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
+ * Description
+ * For socket policies, prevent the execution of the verdict eBPF
+ * program for message *msg* until *bytes* (byte number) have been
+ * accumulated.
+ *
+ * This can be used when one needs a specific number of bytes
+ * before a verdict can be assigned, even if the data spans
+ * multiple **sendmsg**\ () or **sendfile**\ () calls. The extreme
+ * case would be a user calling **sendmsg**\ () repeatedly with
+ * 1-byte long message segments. Obviously, this is bad for
+ * performance, but it is still valid. If the eBPF program needs
+ * *bytes* bytes to validate a header, this helper can be used to
+ * prevent the eBPF program to be called again until *bytes* have
+ * been accumulated.
+ * Return
+ * 0
+ *
+ * int bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
+ * Description
+ * For socket policies, pull in non-linear data from user space
+ * for *msg* and set pointers *msg*\ **->data** and *msg*\
+ * **->data_end** to *start* and *end* bytes offsets into *msg*,
+ * respectively.
+ *
+ * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ * *msg* it can only parse data that the (**data**, **data_end**)
+ * pointers have already consumed. For **sendmsg**\ () hooks this
+ * is likely the first scatterlist element. But for calls relying
+ * on the **sendpage** handler (e.g. **sendfile**\ ()) this will
+ * be the range (**0**, **0**) because the data is shared with
+ * user space and by default the objective is to avoid allowing
+ * user space to modify data while (or after) eBPF verdict is
+ * being decided. This helper can be used to pull in data and to
+ * set the start and end pointer to given values. Data will be
+ * copied if necessary (i.e. if data was not linear and if start
+ * and end pointers do not point to the same chunk).
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
* Description
* Bind the socket associated to *ctx* to the address pointed by
--
2.14.1
^ permalink raw reply related
* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
To: daniel, ast
Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man,
Lawrence Brakmo, Yonghong Song, Josef Bacik, Andrey Ignatov
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.
The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.
This patch contains descriptions for the following helper functions:
Helpers from Lawrence:
- bpf_setsockopt()
- bpf_getsockopt()
- bpf_sock_ops_cb_flags_set()
Helpers from Yonghong:
- bpf_perf_event_read_value()
- bpf_perf_prog_read_value()
Helper from Josef:
- bpf_override_return()
Helper from Andrey:
- bpf_bind()
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
include/uapi/linux/bpf.h | 184 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 15d9ccafebbe..7343af4196c8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1208,6 +1208,28 @@ union bpf_attr {
* Return
* 0
*
+ * int bpf_setsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
+ * Description
+ * Emulate a call to **setsockopt()** on the socket associated to
+ * *bpf_socket*, which must be a full socket. The *level* at
+ * which the option resides and the name *optname* of the option
+ * must be specified, see **setsockopt(2)** for more information.
+ * The option value of length *optlen* is pointed by *optval*.
+ *
+ * This helper actually implements a subset of **setsockopt()**.
+ * It supports the following *level*\ s:
+ *
+ * * **SOL_SOCKET**, which supports the following *optname*\ s:
+ * **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**,
+ * **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**.
+ * * **IPPROTO_TCP**, which supports the following *optname*\ s:
+ * **TCP_CONGESTION**, **TCP_BPF_IW**,
+ * **TCP_BPF_SNDCWND_CLAMP**.
+ * * **IPPROTO_IP**, which supports *optname* **IP_TOS**.
+ * * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
* Description
* Grow or shrink the room for data in the packet associated to
@@ -1255,6 +1277,168 @@ union bpf_attr {
* performed again.
* Return
* 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
+ * Description
+ * Read the value of a perf event counter, and store it into *buf*
+ * of size *buf_size*. This helper relies on a *map* of type
+ * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
+ * event counter is selected at the creation of the *map*. The
+ * *map* is an array whose size is the number of available CPU
+ * cores, and each cell contains a value relative to one core. The
+ * value to retrieve is indicated by *flags*, that contains the
+ * index of the core to look up, masked with
+ * **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
+ * **BPF_F_CURRENT_CPU** to indicate that the value for the
+ * current CPU core should be retrieved.
+ *
+ * This helper behaves in a way close to
+ * **bpf_perf_event_read**\ () helper, save that instead of
+ * just returning the value observed, it fills the *buf*
+ * structure. This allows for additional data to be retrieved: in
+ * particular, the enabled and running times (in *buf*\
+ * **->enabled** and *buf*\ **->running**, respectively) are
+ * copied.
+ *
+ * These values are interesting, because hardware PMU (Performance
+ * Monitoring Unit) counters are limited resources. When there are
+ * more PMU based perf events opened than available counters,
+ * kernel will multiplex these events so each event gets certain
+ * percentage (but not all) of the PMU time. In case that
+ * multiplexing happens, the number of samples or counter value
+ * will not reflect the case compared to when no multiplexing
+ * occurs. This makes comparison between different runs difficult.
+ * Typically, the counter value should be normalized before
+ * comparing to other experiments. The usual normalization is done
+ * as follows.
+ *
+ * ::
+ *
+ * normalized_counter = counter * t_enabled / t_running
+ *
+ * Where t_enabled is the time enabled for event and t_running is
+ * the time running for event since last normalization. The
+ * enabled and running times are accumulated since the perf event
+ * open. To achieve scaling factor between two invocations of an
+ * eBPF program, users can can use CPU id as the key (which is
+ * typical for perf array usage model) to remember the previous
+ * value and do the calculation inside the eBPF program.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_perf_prog_read_value(struct bpf_perf_event_data_kern *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
+ * Description
+ * For en eBPF program attached to a perf event, retrieve the
+ * value of the event counter associated to *ctx* and store it in
+ * the structure pointed by *buf* and of size *buf_size*. Enabled
+ * and running times are also stored in the structure (see
+ * description of helper **bpf_perf_event_read_value**\ () for
+ * more details).
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_getsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
+ * Description
+ * Emulate a call to **getsockopt()** on the socket associated to
+ * *bpf_socket*, which must be a full socket. The *level* at
+ * which the option resides and the name *optname* of the option
+ * must be specified, see **getsockopt(2)** for more information.
+ * The retrieved value is stored in the structure pointed by
+ * *opval* and of length *optlen*.
+ *
+ * This helper actually implements a subset of **getsockopt()**.
+ * It supports the following *level*\ s:
+ *
+ * * **IPPROTO_TCP**, which supports *optname*
+ * **TCP_CONGESTION**.
+ * * **IPPROTO_IP**, which supports *optname* **IP_TOS**.
+ * * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_override_return(struct pt_reg *regs, u64 rc)
+ * Description
+ * Used for error injection, this helper uses kprobes to override
+ * the return value of the probed function, and to set it to *rc*.
+ * The first argument is the context *regs* on which the kprobe
+ * works.
+ *
+ * This helper works by setting setting the PC (program counter)
+ * to an override function which is run in place of the original
+ * probed function. This means the probed function is not run at
+ * all. The replacement function just returns with the required
+ * value.
+ *
+ * This helper has security implications, and thus is subject to
+ * restrictions. It is only available if the kernel was compiled
+ * with the **CONFIG_BPF_KPROBE_OVERRIDE** configuration
+ * option, and in this case it only works on functions tagged with
+ * **ALLOW_ERROR_INJECTION** in the kernel code.
+ *
+ * Also, the helper is only available for the architectures having
+ * the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing,
+ * x86 architecture is the only one to support this feature.
+ * Return
+ * 0
+ *
+ * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops_kern *bpf_sock, int argval)
+ * Description
+ * Attempt to set the value of the **bpf_sock_ops_cb_flags** field
+ * for the full TCP socket associated to *bpf_sock_ops* to
+ * *argval*.
+ *
+ * The primary use of this field is to determine if there should
+ * be calls to eBPF programs of type
+ * **BPF_PROG_TYPE_SOCK_OPS** at various points in the TCP
+ * code. A program of the same type can change its value, per
+ * connection and as necessary, when the connection is
+ * established. This field is directly accessible for reading, but
+ * this helper must be used for updates in order to return an
+ * error if an eBPF program tries to set a callback that is not
+ * supported in the current kernel.
+ *
+ * The supported callback values that *argval* can combine are:
+ *
+ * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out)
+ * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
+ * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
+ *
+ * Here are some examples of where one could call such eBPF
+ * program:
+ *
+ * * When RTO fires.
+ * * When a packet is retransmitted.
+ * * When the connection terminates.
+ * * When a packet is sent.
+ * * When a packet is received.
+ * Return
+ * Code **-EINVAL** if the socket is not a full TCP socket;
+ * otherwise, a positive number containing the bits that could not
+ * be set is returned (which comes down to 0 if all bits were set
+ * as required).
+ *
+ * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
+ * Description
+ * Bind the socket associated to *ctx* to the address pointed by
+ * *addr*, of length *addr_len*. This allows for making outgoing
+ * connection from the desired IP address, which can be useful for
+ * example when all processes inside a cgroup should use one
+ * single IP address on a host that has multiple IP configured.
+ *
+ * This helper works for IPv4 and IPv6, TCP and UDP sockets. The
+ * domain (*addr*\ **->sa_family**) must be **AF_INET** (or
+ * **AF_INET6**). Looking for a free port to bind to can be
+ * expensive, therefore binding to port is not permitted by the
+ * helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
+ * must be set to zero.
+ *
+ * As for the remote end, both parts of it can be overridden,
+ * remote IP and remote port. This can be useful if an application
+ * inside a cgroup wants to connect to another application inside
+ * the same cgroup or to itself, but knows nothing about the IP
+ * address assigned to the cgroup.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
--
2.14.1
^ permalink raw reply related
* [RFC bpf-next v2 6/8] bpf: add documentation for eBPF helpers (42-50)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
To: daniel, ast
Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man,
Kaixu Xia, Martin KaFai Lau, Sargun Dhillon, Thomas Graf,
Gianluca Borello, Chenbo Feng
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.
The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.
This patch contains descriptions for the following helper functions:
Helper from Kaixu:
- bpf_perf_event_read()
Helpers from Martin:
- bpf_skb_under_cgroup()
- bpf_xdp_adjust_head()
Helpers from Sargun:
- bpf_probe_write_user()
- bpf_current_task_under_cgroup()
Helper from Thomas:
- bpf_skb_change_head()
Helper from Gianluca:
- bpf_probe_read_str()
Helpers from Chenbo:
- bpf_get_socket_cookie()
- bpf_get_socket_uid()
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Gianluca Borello <g.borello@gmail.com>
Cc: Chenbo Feng <fengc@google.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
include/uapi/linux/bpf.h | 158 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 158 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index af429ec79f50..15d9ccafebbe 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -701,6 +701,25 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * u64 bpf_perf_event_read(struct bpf_map *map, u64 flags)
+ * Description
+ * Read the value of a perf event counter. This helper relies on a
+ * *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The
+ * nature of the perf event counter is selected at the creation of
+ * the *map*. The *map* is an array whose size is the number of
+ * available CPU cores, and each cell contains a value relative to
+ * one core. The value to retrieve is indicated by *flags*, that
+ * contains the index of the core to look up, masked with
+ * **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
+ * **BPF_F_CURRENT_CPU** to indicate that the value for the
+ * current CPU core should be retrieved.
+ *
+ * Note that before Linux 4.13, only hardware perf event can be
+ * retrieved.
+ * Return
+ * The value of the perf event counter read from the map, or a
+ * negative error code in case of failure.
+ *
* int bpf_redirect(u32 ifindex, u64 flags)
* Description
* Redirect the packet to another net device of index *ifindex*.
@@ -939,6 +958,17 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * int bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index)
+ * Description
+ * Check whether *skb* is a descendant of the cgroup2 held by
+ * *map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*.
+ * Return
+ * The return value depends on the result of the test, and can be:
+ *
+ * * 0, if the *skb* failed the cgroup2 descendant test.
+ * * 1, if the *skb* succeeded the cgroup2 descendant test.
+ * * A negative error code, if an error occurred.
+ *
* u32 bpf_get_hash_recalc(struct sk_buff *skb)
* Description
* Retrieve the hash of the packet, *skb*\ **->hash**. If it is
@@ -959,6 +989,37 @@ union bpf_attr {
* Return
* A pointer to the current task struct.
*
+ * int bpf_probe_write_user(void *dst, const void *src, u32 len)
+ * Description
+ * Attempt in a safe way to write *len* bytes from the buffer
+ * *src* to *dst* in memory. It only works for threads that are in
+ * user context.
+ *
+ * This helper should not be used to implement any kind of
+ * security mechanism because of TOC-TOU attacks, but rather to
+ * debug, divert, and manipulate execution of semi-cooperative
+ * processes.
+ *
+ * Keep in mind that this feature is meant for experiments, and it
+ * has a risk of crashing the system and running programs.
+ * Therefore, when an eBPF program using this helper is attached,
+ * a warning including PID and process name is printed to kernel
+ * logs.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
+ * Description
+ * Check whether the probe is being run is the context of a given
+ * subset of the cgroup2 hierarchy. The cgroup2 to test is held by
+ * *map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*.
+ * Return
+ * The return value depends on the result of the test, and can be:
+ *
+ * * 0, if the *skb* task belongs to the cgroup2.
+ * * 1, if the *skb* task does not belong to the cgroup2.
+ * * A negative error code, if an error occurred.
+ *
* int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
* Description
* Resize (trim or grow) the packet associated to *skb* to the
@@ -1043,6 +1104,103 @@ union bpf_attr {
* Return
* The id of current NUMA node.
*
+ * int bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
+ * Description
+ * Grows headroom of packet associated to *skb* and adjusts the
+ * offset of the MAC header accordingly, adding *len* bytes of
+ * space. It automatically extends and reallocates memory as
+ * required.
+ *
+ * This helper can be used on a layer 3 *skb* to push a MAC header
+ * for redirection into a layer 2 device.
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
+ * Description
+ * Adjust (move) *xdp_md*\ **->data** by *delta* bytes. Note that
+ * it is possible to use a negative value for *delta*. This helper
+ * can be used to prepare the packet for pushing or popping
+ * headers.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
+ * Description
+ * Copy a NUL terminated string from an unsafe address
+ * *unsafe_ptr* to *dst*. The *size* should include the
+ * terminating NUL byte. In case the string length is smaller than
+ * *size*, the target is not padded with further NUL bytes. If the
+ * string length is larger than *size*, just *size*-1 bytes are
+ * copied and the last byte is set to NUL.
+ *
+ * On success, the length of the copied string is returned. This
+ * makes this helper useful in tracing programs for reading
+ * strings, and more importantly to get its length at runtime. See
+ * the following snippet:
+ *
+ * ::
+ *
+ * SEC("kprobe/sys_open")
+ * void bpf_sys_open(struct pt_regs *ctx)
+ * {
+ * char buf[PATHLEN]; // PATHLEN is defined to 256
+ * int res = bpf_probe_read_str(buf, sizeof(buf),
+ * ctx->di);
+ *
+ * // Consume buf, for example push it to
+ * // userspace via bpf_perf_event_output(); we
+ * // can use res (the string length) as event
+ * // size, after checking its boundaries.
+ * }
+ *
+ * In comparison, using **bpf_probe_read()** helper here instead
+ * to read the string would require to estimate the length at
+ * compile time, and would often result in copying more memory
+ * than necessary.
+ *
+ * Another useful use case is when parsing individual process
+ * arguments or individual environment variables navigating
+ * *current*\ **->mm->arg_start** and *current*\
+ * **->mm->env_start**: using this helper and the return value,
+ * one can quickly iterate at the right offset of the memory area.
+ * Return
+ * On success, the strictly positive length of the string,
+ * including the trailing NUL character. On error, a negative
+ * value.
+ *
+ * u64 bpf_get_socket_cookie(struct sk_buff *skb)
+ * Description
+ * Retrieve the socket cookie generated by the kernel from a
+ * **struct sk_buff** with a known socket. If none has been set
+ * yet, generate a new cookie. This helper can be useful for
+ * monitoring per socket networking traffic statistics as it
+ * provides a unique socket identifier per namespace.
+ * Return
+ * A 8-byte long non-decreasing number on success, or 0 if the
+ * socket field is missing inside *skb*.
+ *
+ * u32 bpf_get_socket_uid(struct sk_buff *skb)
+ * Return
+ * The owner UID of the socket associated to *skb*. If the socket
+ * is **NULL**, or if it is not a full socket (i.e. if it is a
+ * time-wait or a request socket instead), **overflowuid** value
+ * is returned (note that **overflowuid** might also be the actual
+ * UID value for the socket).
+ *
* u32 bpf_set_hash(struct sk_buff *skb, u32 hash)
* Description
* Set the full hash for *skb* (set the field *skb*\ **->hash**)
--
2.14.1
^ permalink raw reply related
* [RFC bpf-next v2 5/8] bpf: add documentation for eBPF helpers (33-41)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.
The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.
This patch contains descriptions for the following helper functions, all
written by Daniel:
- bpf_get_hash_recalc()
- bpf_skb_change_tail()
- bpf_skb_pull_data()
- bpf_csum_update()
- bpf_set_hash_invalid()
- bpf_get_numa_node_id()
- bpf_set_hash()
- bpf_skb_adjust_room()
- bpf_xdp_adjust_meta()
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
include/uapi/linux/bpf.h | 155 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 155 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d147d9dd6a83..af429ec79f50 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -939,9 +939,164 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * u32 bpf_get_hash_recalc(struct sk_buff *skb)
+ * Description
+ * Retrieve the hash of the packet, *skb*\ **->hash**. If it is
+ * not set, in particular if the hash was cleared due to mangling,
+ * recompute this hash. Later accesses to the hash can be done
+ * directly with *skb*\ **->hash**.
+ *
+ * Calling **bpf_set_hash_invalid**\ (), changing a packet
+ * prototype with **bpf_skb_change_proto**\ (), or calling
+ * **bpf_skb_store_bytes**\ () with the
+ * **BPF_F_INVALIDATE_HASH** are actions susceptible to clear
+ * the hash and to trigger a new computation for the next call to
+ * **bpf_get_hash_recalc**\ ().
+ * Return
+ * The 32-bit hash.
+ *
* u64 bpf_get_current_task(void)
* Return
* A pointer to the current task struct.
+ *
+ * int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
+ * Description
+ * Resize (trim or grow) the packet associated to *skb* to the
+ * new *len*. The *flags* are reserved for future usage, and must
+ * be left at zero.
+ *
+ * The basic idea is that the helper performs the needed work to
+ * change the size of the packet, then the eBPF program rewrites
+ * the rest via helpers like **bpf_skb_store_bytes**\ (),
+ * **bpf_l3_csum_replace**\ (), **bpf_l3_csum_replace**\ ()
+ * and others. This helper is a slow path utility intended for
+ * replies with control messages. And because it is targeted for
+ * slow path, the helper itself can afford to be slow: it
+ * implicitly linearizes, unclones and drops offloads from the
+ * *skb*.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_pull_data(struct sk_buff *skb, u32 len)
+ * Description
+ * Pull in non-linear data in case the *skb* is non-linear and not
+ * all of *len* are part of the linear section. Make *len* bytes
+ * from *skb* readable and writable. If a zero value is passed for
+ * *len*, then the whole length of the *skb* is pulled.
+ *
+ * This helper is only needed for reading and writing with direct
+ * packet access.
+ *
+ * For direct packet access, when testing that offsets to access
+ * are within packet boundaries (test on *skb*\ **->data_end**)
+ * fails, programs just bail out, or, in the direct read case, use
+ * **bpf_skb_load_bytes()** as an alternative to overcome this
+ * limitation. If such data sits in non-linear parts, it is
+ * possible to pull them in once with the new helper, retest and
+ * eventually access them.
+ *
+ * At the same time, this also makes sure the skb is uncloned,
+ * which is a necessary condition for direct write. As this needs
+ * to be an invariant for the write part only, the verifier
+ * detects writes and adds a prologue that is calling
+ * **bpf_skb_pull_data()** to effectively unclone the skb from the
+ * very beginning in case it is indeed cloned.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * s64 bpf_csum_update(struct sk_buff *skb, __wsum csum)
+ * Description
+ * Add the checksum *csum* into *skb*\ **->csum** in case the
+ * driver fed us an IP checksum. Return an error otherwise. This
+ * header is intended to be used in combination with
+ * **bpf_csum_diff()** helper, in particular when the checksum
+ * needs to be updated after data has been written into the packet
+ * through direct packet access.
+ * Return
+ * The checksum on success, or a negative error code in case of
+ * failure.
+ *
+ * void bpf_set_hash_invalid(struct sk_buff *skb)
+ * Description
+ * Invalidate the current *skb*\ **->hash**. It can be used after
+ * mangling on headers through direct packet access, in order to
+ * indicate that the hash is outdated and to trigger a
+ * recalculation the next time the kernel tries to access this
+ * hash.
+ *
+ * int bpf_get_numa_node_id(void)
+ * Description
+ * Return the id of the current NUMA node. The primary use case
+ * for this helper is the selection of sockets for the local NUMA
+ * node, when the program is attached to sockets using the
+ * **SO_ATTACH_REUSEPORT_EBPF** option (see also **socket(7)**).
+ * Return
+ * The id of current NUMA node.
+ *
+ * u32 bpf_set_hash(struct sk_buff *skb, u32 hash)
+ * Description
+ * Set the full hash for *skb* (set the field *skb*\ **->hash**)
+ * to value *hash*.
+ * Return
+ * 0
+ *
+ * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
+ * Description
+ * Grow or shrink the room for data in the packet associated to
+ * *skb* by *len_diff*, and according to the selected *mode*.
+ *
+ * There is a single supported mode at this time:
+ *
+ * * **BPF_ADJ_ROOM_NET**: Adjust room at the network layer
+ * (room space is added or removed below the layer 3 header).
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
+ * Description
+ * Adjust the address pointed by *xdp_md*\ **->data_meta** by
+ * *delta* (which can be positive or negative). Note that this
+ * operation modifies the address stored in *xdp_md*\ **->data**,
+ * so the latter must be loaded only after the helper has been
+ * called.
+ *
+ * The use of *xdp_md*\ **->data_meta** is optional and programs
+ * are not required to use it. The rationale is that when the
+ * packet is processed with XDP (e.g. as DoS filter), it is
+ * possible to push further meta data along with it before passing
+ * to the stack, and to give the guarantee that an ingress eBPF
+ * program attached as a TC classifier on the same device can pick
+ * this up for further post-processing. Since TC works with socket
+ * buffers, it remains possible to set from XDP the **mark** or
+ * **priority** pointers, or other pointers for the socket buffer.
+ * Having this scratch space generic and programmable allows for
+ * more flexibility as the user is free to store whatever meta
+ * data they need.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
--
2.14.1
^ permalink raw reply related
* [RFC bpf-next v2 4/8] bpf: add documentation for eBPF helpers (23-32)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.
The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.
This patch contains descriptions for the following helper functions, all
written by Daniel:
- bpf_get_prandom_u32()
- bpf_get_smp_processor_id()
- bpf_get_cgroup_classid()
- bpf_get_route_realm()
- bpf_skb_load_bytes()
- bpf_csum_diff()
- bpf_skb_get_tunnel_opt()
- bpf_skb_set_tunnel_opt()
- bpf_skb_change_proto()
- bpf_skb_change_type()
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
include/uapi/linux/bpf.h | 125 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f3ea8824efbc..d147d9dd6a83 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -473,6 +473,14 @@ union bpf_attr {
* The number of bytes written to the buffer, or a negative error
* in case of failure.
*
+ * u32 bpf_prandom_u32(void)
+ * Return
+ * A random 32-bit unsigned value.
+ *
+ * u32 bpf_get_smp_processor_id(void)
+ * Return
+ * The SMP (Symmetric multiprocessing) processor id.
+ *
* int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
* Description
* Store *len* bytes from address *from* into the packet
@@ -604,6 +612,13 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
+ * Description
+ * Retrieve the classid for the current task, i.e. for the
+ * net_cls (network classifier) cgroup to which *skb* belongs.
+ * Return
+ * The classid, or 0 for the default unconfigured classid.
+ *
* int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
* Description
* Push a *vlan_tci* (VLAN tag control information) of protocol
@@ -703,6 +718,14 @@ union bpf_attr {
* are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
* error.
*
+ * u32 bpf_get_route_realm(struct sk_buff *skb)
+ * Description
+ * Retrieve the realm or the route, that is to say the
+ * **tclassid** field of the destination for the *skb*.
+ * Return
+ * The realm of the route for the packet associated to *sdb*, or 0
+ * if none was found.
+ *
* int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
* Description
* Write perf raw sample into a perf event held by *map* of type
@@ -779,6 +802,21 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * int bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
+ * Description
+ * This helper was provided as an easy way to load data from a
+ * packet. It can be used to load *len* bytes from *offset* from
+ * the packet associated to *skb*, into the buffer pointed by
+ * *to*.
+ *
+ * Since Linux 4.7, this helper is deprecated in favor of
+ * "direct packet access", enabling packet data to be manipulated
+ * with *skb*\ **->data** and *skb*\ **->data_end** pointing
+ * respectively to the first byte of packet data and to the byte
+ * after the last byte of packet data.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
* Description
* Walk a user or a kernel stack and return its id. To achieve
@@ -814,6 +852,93 @@ union bpf_attr {
* The positive or null stack id on success, or a negative error
* in case of failure.
*
+ * s64 bpf_csum_diff(__be32 *from, u32 from_size, __be32 *to, u32 to_size, __wsum seed)
+ * Description
+ * Compute a checksum difference, from the raw buffer pointed by
+ * *from*, of length *from_size* (that must be a multiple of 4),
+ * towards the raw buffer pointed by *to*, of size *to_size*
+ * (same remark). An optional *seed* can be added to the value.
+ *
+ * This is flexible enough to be used in several ways:
+ *
+ * * With *from_size* == 0, *to_size* > 0 and *seed* set to
+ * checksum, it can be used when pushing new data.
+ * * With *from_size* > 0, *to_size* == 0 and *seed* set to
+ * checksum, it can be used when removing data from a packet.
+ * * With *from_size* > 0, *to_size* > 0 and *seed* set to 0, it
+ * can be used to compute a diff. Note that *from_size* and
+ * *to_size* do not need to be equal.
+ * Return
+ * The checksum result, or a negative error code in case of
+ * failure.
+ *
+ * int bpf_skb_get_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size)
+ * Description
+ * Retrieve tunnel options metadata for the packet associated to
+ * *skb*, and store the raw tunnel option data to the buffer *opt*
+ * of *size*.
+ * Return
+ * The size of the option data retrieved.
+ *
+ * int bpf_skb_set_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size)
+ * Description
+ * Set tunnel options metadata for the packet associated to *skb*
+ * to the option data contained in the raw buffer *opt* of *size*.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
+ * Description
+ * Change the protocol of the *skb* to *proto*. Currently
+ * supported are transition from IPv4 to IPv6, and from IPv6 to
+ * IPv4. The helper takes care of the groundwork for the
+ * transition, including resizing the socket buffer. The eBPF
+ * program is expected to fill the new headers, if any, via
+ * **skb_store_bytes**\ () and to recompute the checksums with
+ * **bpf_l3_csum_replace**\ () and **bpf_l4_csum_replace**\
+ * ().
+ *
+ * Internally, the GSO type is marked as dodgy so that headers are
+ * checked and segments are recalculated by the GSO/GRO engine.
+ * The size for GSO target is adapted as well.
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_change_type(struct sk_buff *skb, u32 type)
+ * Description
+ * Change the packet type for the packet associated to *skb*. This
+ * comes down to setting *skb*\ **->pkt_type** to *type*, except
+ * the eBPF program does not have a write access to *skb*\
+ * **->pkt_type** beside this helper. Using a helper here allows
+ * for graceful handling of errors.
+ *
+ * The major use case is to change incoming *skb*s to
+ * **PACKET_HOST** in a programmatic way instead of having to
+ * recirculate via **redirect**\ (..., **BPF_F_INGRESS**), for
+ * example.
+ *
+ * Note that *type* only allows certain values. At this time, they
+ * are:
+ *
+ * **PACKET_HOST**
+ * Packet is for us.
+ * **PACKET_BROADCAST**
+ * Send packet to all.
+ * **PACKET_MULTICAST**
+ * Send packet to group.
+ * **PACKET_OTHERHOST**
+ * Send packet to someone else.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* u64 bpf_get_current_task(void)
* Return
* A pointer to the current task struct.
--
2.14.1
^ permalink raw reply related
* [RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.
The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.
This patch contains descriptions for the following helper functions, all
writter by Alexei:
- bpf_get_current_pid_tgid()
- bpf_get_current_uid_gid()
- bpf_get_current_comm()
- bpf_skb_vlan_push()
- bpf_skb_vlan_pop()
- bpf_skb_get_tunnel_key()
- bpf_skb_set_tunnel_key()
- bpf_redirect()
- bpf_perf_event_output()
- bpf_get_stackid()
- bpf_get_current_task()
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
include/uapi/linux/bpf.h | 237 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 237 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2bc653a3a20f..f3ea8824efbc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -580,6 +580,243 @@ union bpf_attr {
* performed again.
* Return
* 0 on success, or a negative error in case of failure.
+ *
+ * u64 bpf_get_current_pid_tgid(void)
+ * Return
+ * A 64-bit integer containing the current tgid and pid, and
+ * created as such:
+ * *current_task*\ **->tgid << 32 \|**
+ * *current_task*\ **->pid**.
+ *
+ * u64 bpf_get_current_uid_gid(void)
+ * Return
+ * A 64-bit integer containing the current GID and UID, and
+ * created as such: *current_gid* **<< 32 \|** *current_uid*.
+ *
+ * int bpf_get_current_comm(char *buf, u32 size_of_buf)
+ * Description
+ * Copy the **comm** attribute of the current task into *buf* of
+ * *size_of_buf*. The **comm** attribute contains the name of
+ * the executable (excluding the path) for the current task. The
+ * *size_of_buf* must be strictly positive. On success, the
+ * helper makes sure that the *buf* is NUL-terminated. On failure,
+ * it is filled with zeroes.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
+ * Description
+ * Push a *vlan_tci* (VLAN tag control information) of protocol
+ * *vlan_proto* to the packet associated to *skb*, then update
+ * the checksum. Note that if *vlan_proto* is different from
+ * **ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
+ * be **ETH_P_8021Q**.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_vlan_pop(struct sk_buff *skb)
+ * Description
+ * Pop a VLAN header from the packet associated to *skb*.
+ *
+ * A call to this helper is susceptible to change data from the
+ * packet. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * Description
+ * Get tunnel metadata. This helper takes a pointer *key* to an
+ * empty **struct bpf_tunnel_key** of **size**, that will be
+ * filled with tunnel metadata for the packet associated to *skb*.
+ * The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
+ * indicates that the tunnel is based on IPv6 protocol instead of
+ * IPv4.
+ *
+ * This is typically used on the receive path to perform a lookup
+ * or a packet redirection based on the value of *key*:
+ *
+ * ::
+ *
+ * struct bpf_tunnel_key key = {};
+ * bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
+ * lookup or redirect based on key ...
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * Description
+ * Populate tunnel metadata for packet associated to *skb.* The
+ * tunnel metadata is set to the contents of *key*, of *size*. The
+ * *flags* can be set to a combination of the following values:
+ *
+ * **BPF_F_TUNINFO_IPV6**
+ * Indicate that the tunnel is based on IPv6 protocol
+ * instead of IPv4.
+ * **BPF_F_ZERO_CSUM_TX**
+ * For IPv4 packets, add a flag to tunnel metadata
+ * indicating that checksum computation should be skipped
+ * and checksum set to zeroes.
+ * **BPF_F_DONT_FRAGMENT**
+ * Add a flag to tunnel metadata indicating that the
+ * packet should not be fragmented.
+ * **BPF_F_SEQ_NUMBER**
+ * Add a flag to tunnel metadata indicating that a
+ * sequence number should be added to tunnel header before
+ * sending the packet. This flag was added for GRE
+ * encapsulation, but might be used with other protocols
+ * as well in the future.
+ *
+ * Here is a typical usage on the transmit path:
+ *
+ * ::
+ *
+ * struct bpf_tunnel_key key;
+ * populate key ...
+ * bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
+ * bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_redirect(u32 ifindex, u64 flags)
+ * Description
+ * Redirect the packet to another net device of index *ifindex*.
+ * This helper is somewhat similar to **bpf_clone_redirect**\
+ * (), except that the packet is not cloned, which provides
+ * increased performance.
+ *
+ * For hooks other than XDP, *flags* can be set to
+ * **BPF_F_INGRESS**, which indicates the packet is to be
+ * redirected to the ingress interface instead of (by default)
+ * egress. Currently, XDP does not support any flag.
+ * Return
+ * For XDP, the helper returns **XDP_REDIRECT** on success or
+ * **XDP_ABORT** on error. For other program types, the values
+ * are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
+ * error.
+ *
+ * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * Description
+ * Write perf raw sample into a perf event held by *map* of type
+ * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf event must
+ * have the following attributes: **PERF_SAMPLE_RAW** as
+ * **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
+ * **PERF_COUNT_SW_BPF_OUTPUT** as **config**.
+ *
+ * The *flags* are used to indicate the index in *map* for which
+ * the value must be put, masked with **BPF_F_INDEX_MASK**.
+ * Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
+ * to indicate that the index of the current CPU core should be
+ * used.
+ *
+ * The value to write, of *size*, is passed through eBPF stack and
+ * pointed by *data*.
+ *
+ * The context of the program *ctx* needs also be passed to the
+ * helper, and will get interpreted as a pointer to a **struct
+ * pt_reg**.
+ *
+ * On user space, a program willing to read the values needs to
+ * call **perf_event_open**\ () on the perf event (either for
+ * one or for all CPUs) and to store the file descriptor into the
+ * *map*. This must be done before the eBPF program can send data
+ * into it. An example is available in file
+ * *samples/bpf/trace_output_user.c* in the Linux kernel source
+ * tree (the eBPF program counterpart is in
+ * *samples/bpf/trace_output_kern.c*). It looks like the
+ * following snippet:
+ *
+ * ::
+ *
+ * volatile struct perf_event_mmap_page *header;
+ * struct perf_event_attr attr = {
+ * .sample_type = PERF_SAMPLE_RAW,
+ * .type = PERF_TYPE_SOFTWARE,
+ * .config = PERF_COUNT_SW_BPF_OUTPUT,
+ * };
+ * int page_size;
+ * int mmap_size;
+ * int key = 0;
+ * int pmu_fd;
+ * void *base;
+ *
+ * if (load_bpf_file(filename))
+ * return -1;
+ *
+ * pmu_fd = sys_perf_event_open(&attr,
+ * -1, // pid
+ * 0, // cpu
+ * -1, // group_fd
+ * 0);
+ *
+ * assert(pmu_fd >= 0);
+ * assert(bpf_map_update_elem(map_fd[0], &key,
+ * &pmu_fd, BPF_ANY) == 0);
+ * assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
+ *
+ * page_size = getpagesize();
+ * mmap_size = page_size * (page_cnt + 1);
+ *
+ * base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+ * MAP_SHARED, fd, 0);
+ * if (base == MAP_FAILED)
+ * return -1;
+ *
+ * header = base;
+ *
+ * **bpf_perf_event_output**\ () achieves better performance
+ * than **bpf_trace_printk**\ () for sharing data with user
+ * space, and is much better suitable for streaming data from eBPF
+ * programs.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
+ * Description
+ * Walk a user or a kernel stack and return its id. To achieve
+ * this, the helper needs *ctx*, which is a pointer to the context
+ * on which the tracing program is executed, and a pointer to a
+ * *map* of type **BPF_MAP_TYPE_STACK_TRACE**.
+ *
+ * The last argument, *flags*, holds the number of stack frames to
+ * skip (from 0 to 255), masked with
+ * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ * a combination of the following flags:
+ *
+ * **BPF_F_USER_STACK**
+ * Collect a user space stack instead of a kernel stack.
+ * **BPF_F_FAST_STACK_CMP**
+ * Compare stacks by hash only.
+ * **BPF_F_REUSE_STACKID**
+ * If two different stacks hash into the same *stackid*,
+ * discard the old one.
+ *
+ * The stack id retrieved is a 32 bit long integer handle which
+ * can be further combined with other data (including other stack
+ * ids) and used as a key into maps. This can be useful for
+ * generating a variety of graphs (such as flame graphs or off-cpu
+ * graphs).
+ *
+ * For walking a stack, this helper is an improvement over
+ * **bpf_probe_read**\ (), which can be used with unrolled loops
+ * but is not efficient and consumes a lot of eBPF instructions.
+ * Instead, **bpf_get_stackid**\ () can collect up to
+ * **PERF_MAX_STACK_DEPTH** both kernel and user frames.
+ * Return
+ * The positive or null stack id on success, or a negative error
+ * in case of failure.
+ *
+ * u64 bpf_get_current_task(void)
+ * Return
+ * A pointer to the current task struct.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
--
2.14.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox