Netdev List

Netdev List
 help / color / mirror / Atom feed

* general protection fault in perf_tp_event_match (2)
From: syzbot @ 2019-08-08 17:24 UTC (permalink / raw)
  To: acme, alexander.shishkin, ast, bpf, daniel, jolsa, kafai,
	linux-kernel, mingo, namhyung, netdev, peterz, songliubraving,
	syzkaller-bugs, yhs

Hello,

syzbot found the following crash on:

HEAD commit:    1e78030e Merge tag 'mmc-v5.3-rc1' of git://git.kernel.org/..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1011831a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4c7b914a2680c9c6
dashboard link: https://syzkaller.appspot.com/bug?extid=076ba900c4a9a0f67aba
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+076ba900c4a9a0f67aba@syzkaller.appspotmail.com

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 22070 Comm: syz-executor.3 Not tainted 5.3.0-rc2+ #86
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:perf_tp_event_match+0x31/0x260 kernel/events/core.c:8560
Code: 89 f6 41 55 49 89 d5 41 54 53 48 89 fb e8 b7 0e ea ff 48 8d bb d0 01  
00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84  
c0 74 08 3c 03 0f 8e cc 01 00 00 44 8b a3 d0 01 00
RSP: 0018:ffff88804ffa7790 EFLAGS: 00010007
RAX: dffffc0000000000 RBX: 00000000ffffff9f RCX: ffffffff818bcb73
RDX: 000000002000002d RSI: ffffffff818890b9 RDI: 000000010000016f
RBP: ffff88804ffa77b0 R08: ffff8880531ba640 R09: ffffed100a6374c9
R10: ffffed100a6374c8 R11: ffff8880531ba647 R12: ffff8880ae830860
R13: ffff8880ae830860 R14: ffff88804ffa7880 R15: dffffc0000000000
FS:  00005555556d7940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000738008 CR3: 000000004cad5000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  perf_tp_event+0x1ea/0x730 kernel/events/core.c:8611
  perf_trace_run_bpf_submit+0x131/0x190 kernel/events/core.c:8586
  perf_trace_sched_wakeup_template+0x42d/0x5d0  
include/trace/events/sched.h:57
  trace_sched_wakeup_new include/trace/events/sched.h:103 [inline]
  wake_up_new_task+0x70f/0xbd0 kernel/sched/core.c:2848
  _do_fork+0x26c/0xfa0 kernel/fork.c:2393
  __do_sys_clone kernel/fork.c:2524 [inline]
  __se_sys_clone kernel/fork.c:2505 [inline]
  __x64_sys_clone+0x18d/0x250 kernel/fork.c:2505
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457dfa
Code: f7 d8 64 89 04 25 d4 02 00 00 64 4c 8b 0c 25 10 00 00 00 31 d2 4d 8d  
91 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff  
ff 0f 87 f5 00 00 00 85 c0 41 89 c5 0f 85 fc 00 00
RSP: 002b:00007ffcf0b1c640 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 00007ffcf0b1c640 RCX: 0000000000457dfa
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
RBP: 00007ffcf0b1c680 R08: 0000000000000001 R09: 00005555556d7940
R10: 00005555556d7c10 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcf0b1c6d0
Modules linked in:
---[ end trace 8f4efeb0ada52ec1 ]---
RIP: 0010:perf_tp_event_match+0x31/0x260 kernel/events/core.c:8560
Code: 89 f6 41 55 49 89 d5 41 54 53 48 89 fb e8 b7 0e ea ff 48 8d bb d0 01  
00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84  
c0 74 08 3c 03 0f 8e cc 01 00 00 44 8b a3 d0 01 00
RSP: 0018:ffff88804ffa7790 EFLAGS: 00010007
RAX: dffffc0000000000 RBX: 00000000ffffff9f RCX: ffffffff818bcb73
RDX: 000000002000002d RSI: ffffffff818890b9 RDI: 000000010000016f
RBP: ffff88804ffa77b0 R08: ffff8880531ba640 R09: ffffed100a6374c9
R10: ffffed100a6374c8 R11: ffff8880531ba647 R12: ffff8880ae830860
R13: ffff8880ae830860 R14: ffff88804ffa7880 R15: dffffc0000000000
FS:  00005555556d7940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000738008 CR3: 000000004cad5000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH net v3] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-08-08 17:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: John Fastabend, David Miller, Network Development, davejwatson,
	borisp, aviadye, Daniel Borkmann, Eric Dumazet,
	Alexei Starovoitov, oss-drivers
In-Reply-To: <CA+FuTSc7H6X+rRnxZ5NcFiNy+pw1YCONiUr+K6g800DXzT_0EA@mail.gmail.com>

On Thu, 8 Aug 2019 11:59:18 -0400, Willem de Bruijn wrote:
> > diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> > index 7c0b2b778703..43922d86e510 100644
> > --- a/net/tls/tls_device.c
> > +++ b/net/tls/tls_device.c
> > @@ -373,9 +373,9 @@ static int tls_push_data(struct sock *sk,
> >         struct tls_context *tls_ctx = tls_get_ctx(sk);
> >         struct tls_prot_info *prot = &tls_ctx->prot_info;
> >         struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx);
> > -       int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> >         int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> >         struct tls_record_info *record = ctx->open_record;
> > +       int tls_push_record_flags;
> >         struct page_frag *pfrag;
> >         size_t orig_size = size;
> >         u32 max_open_record_len;
> > @@ -390,6 +390,9 @@ static int tls_push_data(struct sock *sk,
> >         if (sk->sk_err)
> >                 return -sk->sk_err;
> >
> > +       flags |= MSG_SENDPAGE_DECRYPTED;
> > +       tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> > +  
> 
> Without being too familiar with this code: can this plaintext flag be
> set once, closer to the call to do_tcp_sendpages, in tls_push_sg?
> 
> Instead of two locations with multiple non-trivial codepaths between
> them and do_tcp_sendpages.
> 
> Or are there paths where the flag is not set? Which I imagine would
> imply already passing s/w encrypted ciphertext.

tls_push_sg() is shared with sw path which doesn't have the device
validation. 

Device TLS can read tls_push_sg() via tls_push_partial_record() and
tls_push_data(). tls_push_data() is addressed directly here,
tls_push_partial_record() is again shared with SW path, so we have to
address it by adding the flag in tls_device_write_space().

The alternative is to add a conditional to tls_push_sg() which is 
a little less nice from performance and layering PoV but it is a lot
simpler..

Should I change?

^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Andrii Nakryiko @ 2019-08-08 17:47 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Alexei Starovoitov, daniel@iogearbox.net, Kernel Team,
	Masahiro Yamada, Arnaldo Carvalho de Melo, Jiri Olsa,
	Sam Ravnborg
In-Reply-To: <89a6e282-0250-4264-128d-469be99073e9@fb.com>

On Wed, Aug 7, 2019 at 9:24 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
> > Make .BTF section allocated and expose its contents through sysfs.
> >
> > /sys/kernel/btf directory is created to contain all the BTFs present
> > inside kernel. Currently there is only kernel's main BTF, represented as
> > /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
> > each module will expose its BTF as /sys/kernel/btf/<module-name> file.
> >
> > Current approach relies on a few pieces coming together:
> > 1. pahole is used to take almost final vmlinux image (modulo .BTF and
> >     kallsyms) and generate .BTF section by converting DWARF info into
> >     BTF. This section is not allocated and not mapped to any segment,
> >     though, so is not yet accessible from inside kernel at runtime.
> > 2. objcopy dumps .BTF contents into binary file and subsequently
> >     convert binary file into linkable object file with automatically
> >     generated symbols _binary__btf_kernel_bin_start and
> >     _binary__btf_kernel_bin_end, pointing to start and end, respectively,
> >     of BTF raw data.
> > 3. final vmlinux image is generated by linking this object file (and
> >     kallsyms, if necessary). sysfs_btf.c then creates
> >     /sys/kernel/btf/kernel file and exposes embedded BTF contents through
> >     it. This allows, e.g., libbpf and bpftool access BTF info at
> >     well-known location, without resorting to searching for vmlinux image
> >     on disk (location of which is not standardized and vmlinux image
> >     might not be even available in some scenarios, e.g., inside qemu
> >     during testing).
> >
> > Alternative approach using .incbin assembler directive to embed BTF
> > contents directly was attempted but didn't work, because sysfs_proc.o is
> > not re-compiled during link-vmlinux.sh stage. This is required, though,
> > to update embedded BTF data (initially empty data is embedded, then
> > pahole generates BTF info and we need to regenerate sysfs_btf.o with
> > updated contents, but it's too late at that point).
> >
> > If BTF couldn't be generated due to missing or too old pahole,
> > sysfs_btf.c handles that gracefully by detecting that
> > _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
> > /sys/kernel/btf at all.
> >
> > v1->v2:
> > - allow kallsyms stage to re-use vmlinux generated by gen_btf();
> >
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Jiri Olsa <jolsa@kernel.org>
> > Cc: Sam Ravnborg <sam@ravnborg.org>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---

[...]

> > +
> > +     # dump .BTF section into raw binary file to link with final vmlinux
> > +     bin_arch=$(${OBJDUMP} -f ${1} | grep architecture | \
> > +             cut -d, -f1 | cut -d' ' -f2)
> > +     ${OBJCOPY} --dump-section .BTF=.btf.kernel.bin ${1} 2>/dev/null
> > +     ${OBJCOPY} -I binary -O ${CONFIG_OUTPUT_FORMAT} -B ${bin_arch} \
> > +             --rename-section .data=.BTF .btf.kernel.bin ${2}
>
> Currently, the binary size on my config is about 2.6MB. Do you think
> we could or need to compress it to make it smaller? I tried gzip
> and the compressed size is 0.9MB.

I'd really prefer to keep it uncompressed for two main reasons:
- by having this in uncompressed form, kernel itself can use this BTF
data from inside with almost no additional memory (except maybe for
index from type ID to actual location of type info), which opens up a
lot of new and interesting opportunities, like kernel returning its
own BTF and BTF type ID for various types (think about driver metdata,
all those special maps, etc).
- if we are doing compression, now we need to decide on best
compression format, teach it libbpf (which will make libbpf also
bigger and depending on extra libraries), etc.

So basically, in exchange of 1-1.5MB extra memory we get a bunch of
new problems we normally don't have to deal with.

>
> >   }
> >
> >   # Create ${2} .o file with all symbols from the ${1} object file
> > @@ -153,6 +164,7 @@ sortextable()
> >   # Delete output files in case of error
> >   cleanup()
> >   {

[...]

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Carlos Antonio Neira Bustos @ 2019-08-08 17:48 UTC (permalink / raw)
  To: Yonghong Song
  Cc: netdev@vger.kernel.org, ebiederm@xmission.com, brouer@redhat.com,
	quentin.monnet@netronome.com
In-Reply-To: <96c7ea2e-7acf-e81a-61dc-a4d4562c736a@fb.com>

Yonghong,

I have modified the patch following your feedback. 
Let me know if I'm missing something.

Bests

From 70f8d5584700c9cfc82c006901d8ee9595c53f15 Mon Sep 17 00:00:00 2001
From: Carlos <cneirabustos@gmail.com>
Date: Wed, 7 Aug 2019 20:04:30 -0400
Subject: [PATCH] [PATCH v6 bpf-next] BPF: New helper to obtain namespace data 
 from current task

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

        u32 pid = bpf_get_current_pid_tgid() >> 32;
        if (pid != <pid_arg_passed_in>)
                return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

        struct bpf_pidns pidns;
        bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
        u32 pid = pidns.tgid;
        u32 nsid = pidns.nsid;
        if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
                return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 fs/internal.h                                      |   2 -
 fs/namei.c                                         |   1 -
 include/linux/bpf.h                                |   1 +
 include/linux/namei.h                              |   4 +
 include/uapi/linux/bpf.h                           |  27 +++-
 kernel/bpf/core.c                                  |   1 +
 kernel/bpf/helpers.c                               |  64 ++++++++++
 kernel/trace/bpf_trace.c                           |   2 +
 samples/bpf/Makefile                               |   3 +
 samples/bpf/trace_ns_info_user.c                   |  35 ++++++
 samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
 tools/include/uapi/linux/bpf.h                     |  27 +++-
 tools/testing/selftests/bpf/Makefile               |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
 .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
 tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
 16 files changed, 399 insertions(+), 6 deletions(-)
 create mode 100644 samples/bpf/trace_ns_info_user.c
 create mode 100644 samples/bpf/trace_ns_info_user_kern.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns.c

diff --git a/fs/internal.h b/fs/internal.h
index 315fcd8d237c..6647e15dd419 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -59,8 +59,6 @@ extern int finish_clean_context(struct fs_context *fc);
 /*
  * namei.c
  */
-extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
-			   struct path *path, struct path *root);
 extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..a89fc72a4a10 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -19,7 +19,6 @@
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
-#include <linux/fs.h>
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/fsnotify.h>
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f9a506147c8a..e4adf5e05afd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..b45c8b6f7cb4 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -6,6 +6,7 @@
 #include <linux/path.h>
 #include <linux/fcntl.h>
 #include <linux/errno.h>
+#include <linux/fs.h>
 
 enum { MAX_NESTED_LINKS = 8 };
 
@@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
 
 extern void nd_jump_link(struct path *path);
 
+extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
+			   struct path *path, struct path *root);
+
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
 	((char *) name)[min(len, maxlen)] = '\0';
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4393bd4b2419..b0d4869fb860 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2741,6 +2741,24 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Copies into *pidns* pid, namespace id and tgid as seen by the
+ *		current namespace and also device from /proc/self/ns/pid.
+ *		*size_of_pidns* must be the size of *pidns*
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper returns always the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		**-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
+ *		or tgid of the current task.
+ *
+ *		**-ENOMEM**  if allocation fails.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2853,7 +2871,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3604,4 +3623,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..3159f2a0188c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5e28718928ca..41fbf1f28a48 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,12 @@
 #include <linux/uidgid.h>
 #include <linux/filter.h>
 #include <linux/ctype.h>
+#include <linux/pid_namespace.h>
+#include <linux/major.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>
+
 
 #include "../../lib/kstrtox.h"
 
@@ -312,6 +318,64 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 	preempt_enable();
 }
 
+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+	 size)
+{
+	const char *pidns_path = "/proc/self/ns/pid";
+	struct pid_namespace *pidns = NULL;
+	struct filename *tmp = NULL;
+	struct inode *inode;
+	struct path kp;
+	pid_t tgid = 0;
+	pid_t pid = 0;
+	int ret;
+	int len;
+
+	if (unlikely(size != sizeof(struct bpf_pidns_info)))
+		return -EINVAL;
+	pidns = task_active_pid_ns(current);
+	if (unlikely(!pidns))
+		goto clear;
+	pidns_info->nsid =  pidns->ns.inum;
+	pid = task_pid_nr_ns(current, pidns);
+	if (unlikely(!pid))
+		goto clear;
+	tgid = task_tgid_nr_ns(current, pidns);
+	if (unlikely(!tgid))
+		goto clear;
+	pidns_info->tgid = (u32) tgid;
+	pidns_info->pid = (u32) pid;
+	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
+	if (unlikely(!tmp)) {
+		memset((void *)pidns_info, 0, (size_t) size);
+		return -ENOMEM;
+	}
+	len = strlen(pidns_path) + 1;
+	memcpy((char *)tmp->name, pidns_path, len);
+	tmp->uptr = NULL;
+	tmp->aname = NULL;
+	tmp->refcnt = 1;
+	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
+	if (ret) {
+		memset((void *)pidns_info, 0, (size_t) size);
+		return ret;
+	}
+	inode = d_backing_inode(kp.dentry);
+	pidns_info->dev = inode->i_sb->s_dev;
+	return 0;
+clear:
+	memset((void *)pidns_info, 0, (size_t) size);
+	return -EINVAL;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+	.func		= bpf_get_current_pidns_info,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..5e1dc22765a5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
+	case BPF_FUNC_get_current_pidns_info:
+		return &bpf_get_current_pidns_info_proto;
 	default:
 		return NULL;
 	}
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1d9be26b4edd..238453ff27d2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
 hostprogs-y += ibumad
 hostprogs-y += hbm
+hostprogs-y += trace_ns_info
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
 hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
+trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
 always += ibumad_kern.o
 always += hbm_out_kern.o
 always += hbm_edt_kern.o
+always += trace_ns_info_user_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
new file mode 100644
index 000000000000..e06d08db6f30
--- /dev/null
+++ b/samples/bpf/trace_ns_info_user.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <linux/bpf.h>
+#include <unistd.h>
+#include "bpf/libbpf.h"
+#include "bpf_load.h"
+
+/* This code was taken verbatim from tracex1_user.c, it's used
+ * to exercize bpf_get_current_pidns_info() helper call.
+ */
+int main(int ac, char **argv)
+{
+	FILE *f;
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
+	printf("loading %s\n", filename);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	f = popen("taskset 1 ping  localhost", "r");
+	(void) f;
+	read_trace_pipe();
+	return 0;
+}
diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
new file mode 100644
index 000000000000..96675e02b707
--- /dev/null
+++ b/samples/bpf/trace_ns_info_user_kern.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+typedef __u64 u64;
+typedef __u32 u32;
+
+
+/* kprobe is NOT a stable ABI
+ * kernel functions can be removed, renamed or completely change semantics.
+ * Number of arguments and their positions can change, etc.
+ * In such case this bpf+kprobe example will no longer be meaningful
+ */
+
+/* This will call bpf_get_current_pidns_info() to display pid and ns values
+ * as seen by the current namespace, on the far left you will see the pid as
+ * seen as by the root namespace.
+ */
+
+SEC("kprobe/__netif_receive_skb_core")
+int bpf_prog1(struct pt_regs *ctx)
+{
+	char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
+	struct bpf_pidns_info nsinfo;
+	int ok = 0;
+
+	ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
+	if (ok == 0)
+		bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
+				 (u32) nsinfo.dev, (u32)nsinfo.pid);
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4393bd4b2419..b0d4869fb860 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2741,6 +2741,24 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Copies into *pidns* pid, namespace id and tgid as seen by the
+ *		current namespace and also device from /proc/self/ns/pid.
+ *		*size_of_pidns* must be the size of *pidns*
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper returns always the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		**-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
+ *		or tgid of the current task.
+ *
+ *		**-ENOMEM**  if allocation fails.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2853,7 +2871,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3604,4 +3623,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a0336a..1f97b571b581 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
-	test_sockopt_multi test_tcp_rtt
+	test_sockopt_multi test_tcp_rtt test_pidns
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 120aa86c58d3..c96795a9d983 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
 static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
 					  int ip_len, void *tcp, int tcp_len) =
 	(void *) BPF_FUNC_tcp_gen_syncookie;
+static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
+					 unsigned int buf_size) =
+	(void *) BPF_FUNC_get_current_pidns_info;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
new file mode 100644
index 000000000000..e1d2facfa762
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <linux/bpf.h>
+#include <errno.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") nsidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") pidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int trace(void *ctx)
+{
+	struct bpf_pidns_info nsinfo;
+	__u32 key = 0, *expected_pid, *val;
+	char fmt[] = "ERROR nspid:%d\n";
+
+	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
+		return -EINVAL;
+
+	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
+
+
+	if (!expected_pid || *expected_pid != nsinfo.pid)
+		return 0;
+
+	val = bpf_map_lookup_elem(&nsidmap, &key);
+	if (val)
+		*val = nsinfo.nsid;
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
new file mode 100644
index 000000000000..a7254055f294
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_pidns.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
+
+#define CHECK(condition, tag, format...) ({		\
+	int __ret = !!(condition);			\
+	if (__ret) {					\
+		printf("%s:FAIL:%s ", __func__, tag);	\
+		printf(format);				\
+	} else {					\
+		printf("%s:PASS:%s\n", __func__, tag);	\
+	}						\
+	__ret;						\
+})
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map)
+		return -1;
+	return bpf_map__fd(map);
+}
+
+
+int main(int argc, char **argv)
+{
+	const char *probe_name = "syscalls/sys_enter_nanosleep";
+	const char *file = "test_pidns_kern.o";
+	int err, bytes, efd, prog_fd, pmu_fd;
+	int pidmap_fd, nsidmap_fd;
+	struct perf_event_attr attr = {};
+	struct bpf_object *obj;
+	__u32 knsid = 0;
+	__u32 key = 0, pid;
+	int exit_code = 1;
+	struct stat st;
+	char buf[256];
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+		goto cleanup_cgroup_env;
+
+	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
+	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  nsidmap_fd, errno))
+		goto close_prog;
+
+	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  pidmap_fd, errno))
+		goto close_prog;
+
+	pid = getpid();
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+		  "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+		  errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	/* trigger some syscalls */
+	sleep(1);
+
+	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+
+	if (stat("/proc/self/ns/pid", &st))
+		goto close_pmu;
+
+	if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
+		  "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
+		goto close_pmu;
+
+	exit_code = 0;
+	printf("%s:PASS\n", argv[0]);
+
+close_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+cleanup_cgroup_env:
+	return exit_code;
+}
-- 
2.11.0






On Thu, Aug 08, 2019 at 05:09:51AM +0000, Yonghong Song wrote:
> 
> 
> On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> > The code has been modified to avoid syscalls that could sleep.
> > Please let me know if any other modification is needed.
> > 
> >  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> > From: Carlos <cneirabustos@gmail.com>
> > Date: Wed, 7 Aug 2019 20:04:30 -0400
> > Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
> >   from current task
> > 
> > This helper obtains the active namespace from current and returns pid, tgid,
> > device and namespace id as seen from that namespace, allowing to instrument
> > a process inside a container.
> > Device is read from /proc/self/ns/pid, as in the future it's possible that
> > different pid_ns files may belong to different devices, according
> > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> > conference.
> > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> > scripts but this helper returns the pid as seen by the root namespace which is
> > fine when a bcc script is not executed inside a container.
> > When the process of interest is inside a container, pid filtering will not work
> > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> > returning the pid as it's seen by the current namespace where the script is
> > executing.
> > 
> > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> > used to do pid filtering even inside a container.
> > 
> > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> > 
> >          u32 pid = bpf_get_current_pid_tgid() >> 32;
> >          if (pid != <pid_arg_passed_in>)
> >                  return 0;
> > Could be modified to use bpf_get_current_pidns_info() as follows:
> > 
> >          struct bpf_pidns pidns;
> >          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
> >          u32 pid = pidns.tgid;
> >          u32 nsid = pidns.nsid;
> >          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
> >                  return 0;
> > 
> > To find out the name PID namespace id of a process, you could use this command:
> > 
> > $ ps -h -o pidns -p <pid_of_interest>
> > 
> > Or this other command:
> > 
> > $ ls -Li /proc/<pid_of_interest>/ns/pid
> > 
> > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> > ---
> >   fs/namei.c                                         |   2 +-
> >   include/linux/bpf.h                                |   1 +
> >   include/linux/namei.h                              |   4 +
> >   include/uapi/linux/bpf.h                           |  29 ++++-
> >   kernel/bpf/core.c                                  |   1 +
> >   kernel/bpf/helpers.c                               |  78 ++++++++++++
> >   kernel/trace/bpf_trace.c                           |   2 +
> >   samples/bpf/Makefile                               |   3 +
> >   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> >   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> >   tools/include/uapi/linux/bpf.h                     |  29 ++++-
> >   tools/testing/selftests/bpf/Makefile               |   2 +-
> >   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> >   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> >   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
> >   15 files changed, 418 insertions(+), 4 deletions(-)
> >   create mode 100644 samples/bpf/trace_ns_info_user.c
> >   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
> >   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 209c51a5226c..d1eca36972d2 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -19,7 +19,6 @@
> >   #include <linux/export.h>
> >   #include <linux/kernel.h>
> >   #include <linux/slab.h>
> > -#include <linux/fs.h>
> >   #include <linux/namei.h>
> >   #include <linux/pagemap.h>
> >   #include <linux/fsnotify.h>
> > @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> >   	putname(name);
> >   	return retval;
> >   }
> > +EXPORT_SYMBOL(filename_lookup);
> 
> No need to export symbols. bpf uses it and bpf is in the core, not in 
> modules.
> 
> >   
> >   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
> >   static int path_parentat(struct nameidata *nd, unsigned flags,
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f9a506147c8a..e4adf5e05afd 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
> >   extern const struct bpf_func_proto bpf_strtol_proto;
> >   extern const struct bpf_func_proto bpf_strtoul_proto;
> >   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> >   
> >   /* Shared helpers among cBPF and eBPF. */
> >   void bpf_user_rnd_init_once(void);
> > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > index 9138b4471dbf..2c24e8c71d46 100644
> > --- a/include/linux/namei.h
> > +++ b/include/linux/namei.h
> > @@ -6,6 +6,7 @@
> >   #include <linux/path.h>
> >   #include <linux/fcntl.h>
> >   #include <linux/errno.h>
> > +#include <linux/fs.h>
> >   
> >   enum { MAX_NESTED_LINKS = 8 };
> >   
> > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
> >   
> >   extern void nd_jump_link(struct path *path);
> >   
> > +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> > +		    struct path *path, struct path *root);
> 
> The previous definition in fs/internal.h should be removed.
> 
> > +
> >   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
> >   {
> >   	((char *) name)[min(len, maxlen)] = '\0';
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4393bd4b2419..6f601f7106e2 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2741,6 +2741,26 @@ union bpf_attr {
> >    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
> >    *
> >    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> > + *
> > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> > + *	Description
> > + *		Copies into *pidns* pid, namespace id and tgid as seen by the
> > + *		current namespace and also device from /proc/self/ns/pid.
> > + *		*size_of_pidns* must be the size of *pidns*
> > + *
> > + *		This helper is used when pid filtering is needed inside a
> > + *		container as bpf_get_current_tgid() helper returns always the
> > + *		pid id as seen by the root namespace.
> > + *	Return
> > + *		0 on success
> > + *
> > + *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
> > + *		Or if size_of_pidns is not valid.
> 
> Maybe reword by following the code sequence.
>     if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
>     the current task.
> 
> > + *
> > + *		**-ENOMEM**  if allocation fails.
> 
> Maybe some other error codes in filename_lookup() function?
> 
> > + *
> > + *		If unable to get the inode from /proc/self/ns/pid an error code
> > + *		will be returned.
> 
> You do not need this. The description of error code cases should cover this.
> 
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)		\
> >   	FN(unspec),			\
> > @@ -2853,7 +2873,8 @@ union bpf_attr {
> >   	FN(sk_storage_get),		\
> >   	FN(sk_storage_delete),		\
> >   	FN(send_signal),		\
> > -	FN(tcp_gen_syncookie),
> > +	FN(tcp_gen_syncookie),		\
> > +	FN(get_current_pidns_info),
> >   
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >    * function eBPF program intends to call
> > @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
> >   	__s32	retval;
> >   };
> >   
> > +struct bpf_pidns_info {
> > +	__u32 dev;
> > +	__u32 nsid;
> > +	__u32 tgid;
> > +	__u32 pid;
> > +};
> >   #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 8191a7db2777..3159f2a0188c 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> >   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> >   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> >   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> >   
> >   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> >   {
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 5e28718928ca..571f24077db2 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -11,6 +11,12 @@
> >   #include <linux/uidgid.h>
> >   #include <linux/filter.h>
> >   #include <linux/ctype.h>
> > +#include <linux/pid_namespace.h>
> > +#include <linux/major.h>
> > +#include <linux/stat.h>
> > +#include <linux/namei.h>
> > +#include <linux/version.h>
> > +
> >   
> >   #include "../../lib/kstrtox.h"
> >   
> > @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> >   	preempt_enable();
> >   }
> >   
> > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> > +	 size)
> > +{
> > +	const char *name = "/proc/self/ns/pid";
> 
> maybe rename this variable to pidns_path?
> 
> > +	struct pid_namespace *pidns = NULL;
> > +	struct filename *tmp = NULL;
> 
> Maybe rename this variable to name?
> 
> > +	int len = strlen(name) + 1;
> 
> We can delay this assignment later until it is needed.
> 
> > +	struct inode *inode;
> > +	struct path kp;
> > +	pid_t tgid = 0;
> > +	pid_t pid = 0;
> > +	int ret;
> > +
> > +	if (unlikely(size != sizeof(struct bpf_pidns_info)))
> > +		return -EINVAL;
> > +
> > +	pidns = task_active_pid_ns(current);
> > +
> 
> we can save an empty line here.
> 
> > +	if (unlikely(!pidns))
> > +		goto clear;
> > +
> > +	pidns_info->nsid =  pidns->ns.inum;
> > +	pid = task_pid_nr_ns(current, pidns);
> > +
> 
> We can save an empty line here.
> 
> > +	if (unlikely(!pid))
> > +		goto clear;
> > +
> > +	tgid = task_tgid_nr_ns(current, pidns);
> > +
> ditto. save an empty line.
> > +	if (unlikely(!tgid))
> > +		goto clear;
> > +
> > +	pidns_info->tgid = (u32) tgid;
> > +	pidns_info->pid = (u32) pid;
> > +
> > +	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> > +	if (unlikely(!tmp)) {
> > +		memset((void *)pidns_info, 0, (size_t) size);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	memcpy((char *)tmp->name, name, len);
> > +	tmp->uptr = NULL;
> > +	tmp->aname = NULL;
> > +	tmp->refcnt = 1;
> > +
> ditto. save an empty line.
> > +	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> > +
> ditto. save an empty line.
> > +	if (ret) {
> > +		memset((void *)pidns_info, 0, (size_t) size);
> > +		return ret;
> > +	}
> > +
> > +	inode = d_backing_inode(kp.dentry);
> > +	pidns_info->dev = inode->i_sb->s_dev;
> > +
> > +	return 0;
> > +
> > +clear:
> > +	memset((void *)pidns_info, 0, (size_t) size);
> > +
> save an empty line.
> > +	return -EINVAL;
> > +}
> > +
> > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> > +	.func	= bpf_get_current_pidns_info,
> make the "= " aligned with others?
> > +	.gpl_only	= false,
> > +	.ret_type	= RET_INTEGER,
> > +	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> > +	.arg2_type	= ARG_CONST_SIZE,
> > +};
> > +
> >   #ifdef CONFIG_CGROUPS
> >   BPF_CALL_0(bpf_get_current_cgroup_id)
> >   {
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index ca1255d14576..5e1dc22765a5 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >   #endif
> >   	case BPF_FUNC_send_signal:
> >   		return &bpf_send_signal_proto;
> > +	case BPF_FUNC_get_current_pidns_info:
> > +		return &bpf_get_current_pidns_info_proto;
> >   	default:
> >   		return NULL;
> >   	}
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 1d9be26b4edd..238453ff27d2 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> >   hostprogs-y += xdp_sample_pkts
> >   hostprogs-y += ibumad
> >   hostprogs-y += hbm
> > +hostprogs-y += trace_ns_info
> [...]

^ permalink raw reply related

* Re: [v3,1/4] tools: bpftool: add net attach command to attach XDP on interface
From: Jakub Kicinski @ 2019-08-08 17:49 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <CAEKGpzj1VKWuWioEmRkNXrgfDdT-KkWZWsrbY+p=yyK8sPctwg@mail.gmail.com>

On Thu, 8 Aug 2019 07:15:22 +0900, Daniel T. Lee wrote:
> > > +             return -EINVAL;
> > > +     }
> > > +
> > > +     NEXT_ARG();  
> >
> > nit: the new line should be before NEXT_ARG(), IOV NEXT_ARG() belongs
> > to the code which consumed the argument
> >  
> 
> I'm not sure I'm following.
> Are you saying that, at here the newline shouldn't be necessary?

I mean this is better:

	if (!is_prefix(*argv, "bla-bla"))
		return -EINVAL;
	NEXT_ARG();

	if (!is_prefix(*argv, "bla-bla"))
		return -EINVAL;
	NEXT_ARG();

Than this:

	if (!is_prefix(*argv, "bla-bla"))
		return -EINVAL;

	NEXT_ARG();
	if (!is_prefix(*argv, "bla-bla"))
		return -EINVAL;

	NEXT_ARG();

Because the NEXT_ARG() "belongs" to the code that "consumed" the option.

So instead of this:

     attach_type = parse_attach_type(*argv);
     if (attach_type == max_net_attach_type) {
             p_err("invalid net attach/detach type");  
             return -EINVAL;
     }

     NEXT_ARG();  
     progfd = prog_parse_fd(&argc, &argv);
     if (progfd < 0)
             return -EINVAL;

This seems more logical to me:

     attach_type = parse_attach_type(*argv);
     if (attach_type == max_net_attach_type) {
             p_err("invalid net attach/detach type");  
             return -EINVAL;
     }
     NEXT_ARG();  

     progfd = prog_parse_fd(&argc, &argv);
     if (progfd < 0)
             return -EINVAL;

^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Andrii Nakryiko @ 2019-08-08 17:53 UTC (permalink / raw)
  To: Greg KH
  Cc: Yonghong Song, Andrii Nakryiko, bpf@vger.kernel.org,
	netdev@vger.kernel.org, Alexei Starovoitov, daniel@iogearbox.net,
	Kernel Team, Masahiro Yamada, Arnaldo Carvalho de Melo, Jiri Olsa,
	Sam Ravnborg
In-Reply-To: <20190808060812.GA25150@kroah.com>

On Wed, Aug 7, 2019 at 11:08 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Aug 08, 2019 at 04:24:25AM +0000, Yonghong Song wrote:
> >
> >
> > On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
> > > Make .BTF section allocated and expose its contents through sysfs.
>
> Was this original patch not on bpf@vger?  I can't find it in my
> archive.  Anyway...
>
> > > /sys/kernel/btf directory is created to contain all the BTFs present
> > > inside kernel. Currently there is only kernel's main BTF, represented as
> > > /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
> > > each module will expose its BTF as /sys/kernel/btf/<module-name> file.
>
> Why are you using sysfs for this?  Who uses "BTF"s?  Are these debugging
> images that only people working on developing bpf programs are going to
> need, or are these things that you are going to need on a production
> system?

We need it in production system. One immediate and direct use case is
BPF CO-RE (Compile Once - Run Everywhere), which aims to allow to
pre-compile BPF applications (even those that read internal kernel
structures) using any local kernel headers, and then distribute and
run them in binary form on all target production machines without
dependencies on kernel headers and having Clang on target machine to
compile C to BPF IR. Libbpf is doing all those adjustments/relocations
based on kernel's actual BTF. See [0] for a summary and slides, if you
curious to learn more.

  [0] http://vger.kernel.org/bpfconf2019.html#session-2

>
> I ask as maybe debugfs is the best place for this if they are not needed
> on production systems.
>
>
> > >
> > > Current approach relies on a few pieces coming together:
> > > 1. pahole is used to take almost final vmlinux image (modulo .BTF and
> > >     kallsyms) and generate .BTF section by converting DWARF info into
> > >     BTF. This section is not allocated and not mapped to any segment,
> > >     though, so is not yet accessible from inside kernel at runtime.
> > > 2. objcopy dumps .BTF contents into binary file and subsequently
> > >     convert binary file into linkable object file with automatically
> > >     generated symbols _binary__btf_kernel_bin_start and
> > >     _binary__btf_kernel_bin_end, pointing to start and end, respectively,
> > >     of BTF raw data.
> > > 3. final vmlinux image is generated by linking this object file (and
> > >     kallsyms, if necessary). sysfs_btf.c then creates
> > >     /sys/kernel/btf/kernel file and exposes embedded BTF contents through
> > >     it. This allows, e.g., libbpf and bpftool access BTF info at
> > >     well-known location, without resorting to searching for vmlinux image
> > >     on disk (location of which is not standardized and vmlinux image
> > >     might not be even available in some scenarios, e.g., inside qemu
> > >     during testing).
> > >
> > > Alternative approach using .incbin assembler directive to embed BTF
> > > contents directly was attempted but didn't work, because sysfs_proc.o is
> > > not re-compiled during link-vmlinux.sh stage. This is required, though,
> > > to update embedded BTF data (initially empty data is embedded, then
> > > pahole generates BTF info and we need to regenerate sysfs_btf.o with
> > > updated contents, but it's too late at that point).
> > >
> > > If BTF couldn't be generated due to missing or too old pahole,
> > > sysfs_btf.c handles that gracefully by detecting that
> > > _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
> > > /sys/kernel/btf at all.
> > >
> > > v1->v2:
> > > - allow kallsyms stage to re-use vmlinux generated by gen_btf();
> > >
> > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > Cc: Jiri Olsa <jolsa@kernel.org>
> > > Cc: Sam Ravnborg <sam@ravnborg.org>
> > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > > ---
> > >   kernel/bpf/Makefile     |  3 +++
> > >   kernel/bpf/sysfs_btf.c  | 52 ++++++++++++++++++++++++++++++++++++++
> > >   scripts/link-vmlinux.sh | 55 +++++++++++++++++++++++++++--------------
> > >   3 files changed, 91 insertions(+), 19 deletions(-)
> > >   create mode 100644 kernel/bpf/sysfs_btf.c
>
> First rule, you can't create new sysfs files without a matching
> Documentation/ABI/ set of entries.  Please do that for the next version
> of this patch so we can properly check to see if what you are
> documenting lines up with the code.  Otherwise we just have to guess as
> to what the entries you are creating actually do.

Yep, sure, I wasn't aware, will add in v3.

>
> > >
> > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > index 29d781061cd5..e1d9adb212f9 100644
> > > --- a/kernel/bpf/Makefile
> > > +++ b/kernel/bpf/Makefile
> > > @@ -22,3 +22,6 @@ obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > >   ifeq ($(CONFIG_INET),y)
> > >   obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
> > >   endif
> > > +ifeq ($(CONFIG_SYSFS),y)
> > > +obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
> > > +endif
> > > diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c
> > > new file mode 100644
> > > index 000000000000..ac06ce1d62e8
> > > --- /dev/null
> > > +++ b/kernel/bpf/sysfs_btf.c
> > > @@ -0,0 +1,52 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Provide kernel BTF information for introspection and use by eBPF tools.
> > > + */
> > > +#include <linux/kernel.h>
> > > +#include <linux/module.h>
> > > +#include <linux/kobject.h>
> > > +#include <linux/init.h>
> > > +
> > > +/* See scripts/link-vmlinux.sh, gen_btf() func for details */
> > > +extern char __weak _binary__btf_kernel_bin_start[];
> > > +extern char __weak _binary__btf_kernel_bin_end[];
> > > +
> > > +static ssize_t
> > > +btf_kernel_read(struct file *file, struct kobject *kobj,
> > > +           struct bin_attribute *bin_attr,
> > > +           char *buf, loff_t off, size_t len)
> > > +{
> > > +   memcpy(buf, _binary__btf_kernel_bin_start + off, len);
> > > +   return len;
> > > +}
> > > +
> > > +static struct bin_attribute btf_kernel_attr __ro_after_init = {
> > > +   .attr = {
> > > +           .name = "kernel",
> > > +           .mode = 0444,
> > > +   },
> > > +   .read = btf_kernel_read,
> > > +};
>
> BIN_ATTR_RO()?

Ok, will use that.

>
> > > +
> > > +static struct bin_attribute *btf_attrs[] __ro_after_init = {
> > > +   &btf_kernel_attr,
> > > +   NULL,
> > > +};
> > > +
> > > +static struct attribute_group btf_group_attr __ro_after_init = {
> > > +   .name = "btf",
> > > +   .bin_attrs = btf_attrs,
> > > +};
> > > +
> > > +static int __init btf_kernel_init(void)
> > > +{
> > > +   if (!_binary__btf_kernel_bin_start)
> > > +           return 0;
> > > +
> > > +   btf_kernel_attr.size = _binary__btf_kernel_bin_end -
> > > +                          _binary__btf_kernel_bin_start;
> > > +
> > > +   return sysfs_create_group(kernel_kobj, &btf_group_attr);
>
> You are nesting directories here without a "real" kobject in the middle.
> Are you _sure_ you want to do that?  It's going to get really tricky
> later on based on your comments above about creating multiple files in
> that directory over time once "modules" are allowed.

My thinking was that when we have BTF for modules, I'll need to do
some code adjustments anyway, at which point it will be more clear how
we want to structure that. But I can add explicit kobject as static
variable right now, no problems. Later on we probably will just switch
it to be exported, so that modules can self-register/unregister their
BTFs autonomously.

>
> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH] pcan_usb_fd: zero out the common command buffer
From: David Miller @ 2019-08-08 18:03 UTC (permalink / raw)
  To: oneukum; +Cc: netdev, wg, mkl, linux-can
In-Reply-To: <20190808092825.23470-1-oneukum@suse.com>

From: Oliver Neukum <oneukum@suse.com>
Date: Thu,  8 Aug 2019 11:28:25 +0200

> Lest we leak kernel memory to a device we better zero out buffers.
> 
> Reported-by: syzbot+513e4d0985298538bf9b@syzkaller.appspotmail.com
> Signed-off-by: Oliver Neukum <oneukum@suse.com>

Please CC: the CAN subsystem maintainers, as this is clearly listed in the
MAINTAINERS file.

Thank you.

> ---
>  drivers/net/can/usb/peak_usb/pcan_usb_fd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_fd.c b/drivers/net/can/usb/peak_usb/pcan_usb_fd.c
> index 34761c3a6286..47cc1ff5b88e 100644
> --- a/drivers/net/can/usb/peak_usb/pcan_usb_fd.c
> +++ b/drivers/net/can/usb/peak_usb/pcan_usb_fd.c
> @@ -841,7 +841,7 @@ static int pcan_usb_fd_init(struct peak_usb_device *dev)
>  			goto err_out;
>  
>  		/* allocate command buffer once for all for the interface */
> -		pdev->cmd_buffer_addr = kmalloc(PCAN_UFD_CMD_BUFFER_SIZE,
> +		pdev->cmd_buffer_addr = kzalloc(PCAN_UFD_CMD_BUFFER_SIZE,
>  						GFP_KERNEL);
>  		if (!pdev->cmd_buffer_addr)
>  			goto err_out_1;
> -- 
> 2.16.4
> 

^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Jonathan Lemon @ 2019-08-08 18:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, Jiri Pirko, dsahern, netdev, davem, mlxsw,
	f.fainelli, vivien.didelot, mkubecek, stephen, daniel, brouer,
	eric.dumazet
In-Reply-To: <20190806190637.GE17072@lunn.ch>



On 6 Aug 2019, at 12:06, Andrew Lunn wrote:

> On Tue, Aug 06, 2019 at 11:54:49AM -0700, Jakub Kicinski wrote:
>> On Tue, 6 Aug 2019 20:38:41 +0200, Jiri Pirko wrote:
>>>>> So the proposal is to have some new device, say "kernelnet", that
>>>>> would implicitly create per-namespace devlink instance. This 
>>>>> devlink
>>>>> instance would be used to setup resource limits. Like:
>>>>>
>>>>> devlink resource set kernelnet path /IPv4/fib size 96
>>>>> devlink -N ns1name resource set kernelnet path /IPv6/fib size 100
>>>>> devlink -N ns2name resource set kernelnet path /IPv4/fib-rules 
>>>>> size 8
>>>>>
>>>>> To me it sounds a bit odd for kernel namespace to act as a device, 
>>>>> but
>>>>> thinking about it more, it makes sense. Probably better than to 
>>>>> define
>>>>> a new api. User would use the same tool to work with kernel and 
>>>>> hw.
>>>>>
>>>>> Also we can implement other devlink functionality, like dpipe.
>>>>> User would then have visibility of network pipeline, tables,
>>>>> utilization, etc. It is related to the resources too.
>>>>>
>>>>> What do you think?
>>>>
>>>> I'm no expert here but seems counter intuitive that device tables 
>>>> would
>>>> be aware of namespaces in the first place. Are we not reinventing
>>>> cgroup controllers based on a device API? IMHO from a perspective 
>>>> of
>>>> someone unfamiliar with routing offload this seems backwards :)
>>>
>>> Can we use cgroup for fib and other limitations instead?
>>
>> Not sure the question is to me, I don't feel particularly qualified,
>> I've never worked with VDCs or wrote a switch driver.. But I'd see
>> cgroups as a natural fit, and if I read Andrew's reply right so does
>> he..
>
> Hi Jakub
>
> I think there needs to be a clearly reasoned argument why cgroups is
> the wrong answer to this problem. I myself don't know enough to give
> that answer, but i can pose the question.
>
>      Andrew

For the example above, the first question would be why is the 
restriction
based on the number of entries instead of their memory footprint?  The 
resource
being consumed is memory, so I'd think that should be what is monitored.

Quickly scanning the cgroups documentation, it seems there is a device 
controller,
so this isn't just process based.  ISTR that Larry Brakmo was working on 
a network
bandwidth limiter, which is controlled by cgroups.
-- 
Jonathan




^ permalink raw reply

* Re: [PATCH] zd1211rw: remove false assertion from zd_mac_clear()
From: David Miller @ 2019-08-08 18:05 UTC (permalink / raw)
  To: oneukum; +Cc: netdev, dsd, kune, linux-wireless, kvalo
In-Reply-To: <20190808093203.23752-1-oneukum@suse.com>

From: Oliver Neukum <oneukum@suse.com>
Date: Thu,  8 Aug 2019 11:32:03 +0200

> The function is called before the lock which is asserted was ever used.
> Just remove it.
> 
> Reported-by: syzbot+74c65761783d66a9c97c@syzkaller.appspotmail.com
> Signed-off-by: Oliver Neukum <oneukum@suse.com>

Please CC: the appropriate driver maitainers and mailing list as this
is clearly specified in the MAINTAINERS file.

Thank you.

> ---
>  drivers/net/wireless/zydas/zd1211rw/zd_mac.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/zydas/zd1211rw/zd_mac.c b/drivers/net/wireless/zydas/zd1211rw/zd_mac.c
> index da7e63fca9f5..a9999d10ae81 100644
> --- a/drivers/net/wireless/zydas/zd1211rw/zd_mac.c
> +++ b/drivers/net/wireless/zydas/zd1211rw/zd_mac.c
> @@ -223,7 +223,6 @@ void zd_mac_clear(struct zd_mac *mac)
>  {
>  	flush_workqueue(zd_workqueue);
>  	zd_chip_clear(&mac->chip);
> -	lockdep_assert_held(&mac->lock);
>  	ZD_MEMCLEAR(mac, sizeof(struct zd_mac));
>  }
>  
> -- 
> 2.16.4
> 

^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Greg KH @ 2019-08-08 18:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Andrii Nakryiko, bpf@vger.kernel.org,
	netdev@vger.kernel.org, Alexei Starovoitov, daniel@iogearbox.net,
	Kernel Team, Masahiro Yamada, Arnaldo Carvalho de Melo, Jiri Olsa,
	Sam Ravnborg
In-Reply-To: <CAEf4BzaWtumTrc7h1t3w8hA1L8mVo2Cm0B+eLSe4eSghFAu3iw@mail.gmail.com>

On Thu, Aug 08, 2019 at 10:53:44AM -0700, Andrii Nakryiko wrote:
> On Wed, Aug 7, 2019 at 11:08 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Aug 08, 2019 at 04:24:25AM +0000, Yonghong Song wrote:
> > >
> > >
> > > On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
> > > > Make .BTF section allocated and expose its contents through sysfs.
> >
> > Was this original patch not on bpf@vger?  I can't find it in my
> > archive.  Anyway...
> >
> > > > /sys/kernel/btf directory is created to contain all the BTFs present
> > > > inside kernel. Currently there is only kernel's main BTF, represented as
> > > > /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
> > > > each module will expose its BTF as /sys/kernel/btf/<module-name> file.
> >
> > Why are you using sysfs for this?  Who uses "BTF"s?  Are these debugging
> > images that only people working on developing bpf programs are going to
> > need, or are these things that you are going to need on a production
> > system?
> 
> We need it in production system. One immediate and direct use case is
> BPF CO-RE (Compile Once - Run Everywhere), which aims to allow to
> pre-compile BPF applications (even those that read internal kernel
> structures) using any local kernel headers, and then distribute and
> run them in binary form on all target production machines without
> dependencies on kernel headers and having Clang on target machine to
> compile C to BPF IR. Libbpf is doing all those adjustments/relocations
> based on kernel's actual BTF. See [0] for a summary and slides, if you
> curious to learn more.
> 
>   [0] http://vger.kernel.org/bpfconf2019.html#session-2

Ok, then a binary sysfs file is fine, no objection from me.

> > I ask as maybe debugfs is the best place for this if they are not needed
> > on production systems.
> >
> >
> > > >
> > > > Current approach relies on a few pieces coming together:
> > > > 1. pahole is used to take almost final vmlinux image (modulo .BTF and
> > > >     kallsyms) and generate .BTF section by converting DWARF info into
> > > >     BTF. This section is not allocated and not mapped to any segment,
> > > >     though, so is not yet accessible from inside kernel at runtime.
> > > > 2. objcopy dumps .BTF contents into binary file and subsequently
> > > >     convert binary file into linkable object file with automatically
> > > >     generated symbols _binary__btf_kernel_bin_start and
> > > >     _binary__btf_kernel_bin_end, pointing to start and end, respectively,
> > > >     of BTF raw data.
> > > > 3. final vmlinux image is generated by linking this object file (and
> > > >     kallsyms, if necessary). sysfs_btf.c then creates
> > > >     /sys/kernel/btf/kernel file and exposes embedded BTF contents through
> > > >     it. This allows, e.g., libbpf and bpftool access BTF info at
> > > >     well-known location, without resorting to searching for vmlinux image
> > > >     on disk (location of which is not standardized and vmlinux image
> > > >     might not be even available in some scenarios, e.g., inside qemu
> > > >     during testing).
> > > >
> > > > Alternative approach using .incbin assembler directive to embed BTF
> > > > contents directly was attempted but didn't work, because sysfs_proc.o is
> > > > not re-compiled during link-vmlinux.sh stage. This is required, though,
> > > > to update embedded BTF data (initially empty data is embedded, then
> > > > pahole generates BTF info and we need to regenerate sysfs_btf.o with
> > > > updated contents, but it's too late at that point).
> > > >
> > > > If BTF couldn't be generated due to missing or too old pahole,
> > > > sysfs_btf.c handles that gracefully by detecting that
> > > > _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
> > > > /sys/kernel/btf at all.
> > > >
> > > > v1->v2:
> > > > - allow kallsyms stage to re-use vmlinux generated by gen_btf();
> > > >
> > > > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > > > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > Cc: Jiri Olsa <jolsa@kernel.org>
> > > > Cc: Sam Ravnborg <sam@ravnborg.org>
> > > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > > > ---
> > > >   kernel/bpf/Makefile     |  3 +++
> > > >   kernel/bpf/sysfs_btf.c  | 52 ++++++++++++++++++++++++++++++++++++++
> > > >   scripts/link-vmlinux.sh | 55 +++++++++++++++++++++++++++--------------
> > > >   3 files changed, 91 insertions(+), 19 deletions(-)
> > > >   create mode 100644 kernel/bpf/sysfs_btf.c
> >
> > First rule, you can't create new sysfs files without a matching
> > Documentation/ABI/ set of entries.  Please do that for the next version
> > of this patch so we can properly check to see if what you are
> > documenting lines up with the code.  Otherwise we just have to guess as
> > to what the entries you are creating actually do.
> 
> Yep, sure, I wasn't aware, will add in v3.

thanks.

> > > > +static int __init btf_kernel_init(void)
> > > > +{
> > > > +   if (!_binary__btf_kernel_bin_start)
> > > > +           return 0;
> > > > +
> > > > +   btf_kernel_attr.size = _binary__btf_kernel_bin_end -
> > > > +                          _binary__btf_kernel_bin_start;
> > > > +
> > > > +   return sysfs_create_group(kernel_kobj, &btf_group_attr);
> >
> > You are nesting directories here without a "real" kobject in the middle.
> > Are you _sure_ you want to do that?  It's going to get really tricky
> > later on based on your comments above about creating multiple files in
> > that directory over time once "modules" are allowed.
> 
> My thinking was that when we have BTF for modules, I'll need to do
> some code adjustments anyway, at which point it will be more clear how
> we want to structure that. But I can add explicit kobject as static
> variable right now, no problems. Later on we probably will just switch
> it to be exported, so that modules can self-register/unregister their
> BTFs autonomously.

A "real" kobject to start with here would probably be best.  Keeps
things simpler later as well.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH net v3] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Willem de Bruijn @ 2019-08-08 18:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: John Fastabend, David Miller, Network Development, davejwatson,
	borisp, aviadye, Daniel Borkmann, Eric Dumazet,
	Alexei Starovoitov, oss-drivers
In-Reply-To: <20190808103148.164bec9f@cakuba.netronome.com>

On Thu, Aug 8, 2019 at 1:32 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Thu, 8 Aug 2019 11:59:18 -0400, Willem de Bruijn wrote:
> > > diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> > > index 7c0b2b778703..43922d86e510 100644
> > > --- a/net/tls/tls_device.c
> > > +++ b/net/tls/tls_device.c
> > > @@ -373,9 +373,9 @@ static int tls_push_data(struct sock *sk,
> > >         struct tls_context *tls_ctx = tls_get_ctx(sk);
> > >         struct tls_prot_info *prot = &tls_ctx->prot_info;
> > >         struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx);
> > > -       int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> > >         int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> > >         struct tls_record_info *record = ctx->open_record;
> > > +       int tls_push_record_flags;
> > >         struct page_frag *pfrag;
> > >         size_t orig_size = size;
> > >         u32 max_open_record_len;
> > > @@ -390,6 +390,9 @@ static int tls_push_data(struct sock *sk,
> > >         if (sk->sk_err)
> > >                 return -sk->sk_err;
> > >
> > > +       flags |= MSG_SENDPAGE_DECRYPTED;
> > > +       tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> > > +
> >
> > Without being too familiar with this code: can this plaintext flag be
> > set once, closer to the call to do_tcp_sendpages, in tls_push_sg?
> >
> > Instead of two locations with multiple non-trivial codepaths between
> > them and do_tcp_sendpages.
> >
> > Or are there paths where the flag is not set? Which I imagine would
> > imply already passing s/w encrypted ciphertext.
>
> tls_push_sg() is shared with sw path which doesn't have the device
> validation.
>
> Device TLS can read tls_push_sg() via tls_push_partial_record() and
> tls_push_data(). tls_push_data() is addressed directly here,
> tls_push_partial_record() is again shared with SW path, so we have to
> address it by adding the flag in tls_device_write_space().
>
> The alternative is to add a conditional to tls_push_sg() which is
> a little less nice from performance and layering PoV but it is a lot
> simpler..
>
> Should I change?

Not at all. Thanks for the detailed explanation. That answered my last question

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next] r8169: make use of xmit_more
From: Heiner Kallweit @ 2019-08-08 18:17 UTC (permalink / raw)
  To: Holger Hoffstätte, Realtek linux nic maintainers,
	David Miller
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Eric Dumazet
In-Reply-To: <cfb9a1c7-57c8-db04-1081-ac1cb92bb447@applied-asynchrony.com>

On 08.08.2019 17:53, Holger Hoffstätte wrote:
> On 8/8/19 4:37 PM, Holger Hoffstätte wrote:
>>
>> Hello Heiner -
>>
>> On 7/28/19 11:25 AM, Heiner Kallweit wrote:
>>> There was a previous attempt to use xmit_more, but the change had to be
>>> reverted because under load sometimes a transmit timeout occurred [0].
>>> Maybe this was caused by a missing memory barrier, the new attempt
>>> keeps the memory barrier before the call to netif_stop_queue like it
>>> is used by the driver as of today. The new attempt also changes the
>>> order of some calls as suggested by Eric.
>>>
>>> [0] https://lkml.org/lkml/2019/2/10/39
>>>
>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>
>> I decided to take one for the team and merged this into my 5.2.x tree (just
>> fixing up the path) and it has been working fine for the last 2 weeks in two
>> machines..until today, when for the first time in forever some random NFS traffic
>> made this old friend come out from under the couch:
>>
>> [Aug 8 14:13] ------------[ cut here ]------------
>> [  +0.000006] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
>> [  +0.000021] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x21f/0x230
>> [  +0.000001] Modules linked in: lz4 lz4_compress lz4_decompress nfsd auth_rpcgss oid_registry lockd grace sunrpc sch_fq_codel btrfs xor zstd_compress raid6_pq zstd_decompress bfq jitterentropy_rng nct6775 hwmon_vid coretemp hwmon x86_pkg_temp_thermal aesni_intel aes_x86_64 i915 glue_helper crypto_simd cryptd i2c_i801 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper syscopyarea usbhid sysfillrect r8169 sysimgblt fb_sys_fops realtek drm libphy drm_panel_orientation_quirks i2c_core video backlight mq_deadline
>> [  +0.000026] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.2.7 #1
>> [  +0.000001] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
>> [  +0.000004] RIP: 0010:dev_watchdog+0x21f/0x230
>> [  +0.000002] Code: 3b 00 75 ea eb ad 4c 89 ef c6 05 1c 45 bd 00 01 e8 66 35 fc ff 44 89 e1 4c 89 ee 48 c7 c7 e8 5e fc 81 48 89 c2 e8 90 df 92 ff <0f> 0b eb 8e 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 66 66 66 90
>> [  +0.000002] RSP: 0018:ffffc90000118e68 EFLAGS: 00010286
>> [  +0.000002] RAX: 0000000000000000 RBX: ffff8887f7837600 RCX: 0000000000000303
>> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff827a488c
>> [  +0.000001] RBP: ffff8887f9fbc440 R08: 0000000000000303 R09: 0000000000000003
>> [  +0.000001] R10: 000000000001004c R11: 0000000000000001 R12: 0000000000000000
>> [  +0.000009] R13: ffff8887f9fbc000 R14: ffffffff8173aa20 R15: dead000000000200
>> [  +0.000001] FS:  0000000000000000(0000) GS:ffff8887ff580000(0000) knlGS:0000000000000000
>> [  +0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  +0.000001] CR2: 00007f8d1c04d000 CR3: 0000000002209001 CR4: 00000000000606e0
>> [  +0.000000] Call Trace:
>> [  +0.000002]  <IRQ>
>> [  +0.000005]  call_timer_fn+0x2b/0x120
>> [  +0.000002]  expire_timers+0xa4/0x100
>> [  +0.000001]  run_timer_softirq+0x8c/0x170
>> [  +0.000002]  ? __hrtimer_run_queues+0x13a/0x290
>> [  +0.000003]  ? sched_clock_cpu+0xe/0x130
>> [  +0.000003]  __do_softirq+0xeb/0x2de
>> [  +0.000003]  irq_exit+0x9d/0xe0
>> [  +0.000002]  smp_apic_timer_interrupt+0x60/0x110
>> [  +0.000003]  apic_timer_interrupt+0xf/0x20
>> [  +0.000001]  </IRQ>
>> [  +0.000003] RIP: 0010:cpuidle_enter_state+0xad/0x930
>> [  +0.000001] Code: c5 66 66 66 66 90 31 ff e8 90 99 9e ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 39 08 00 00 31 ff e8 e7 26 a2 ff fb 45 85 e4 <0f> 88 34 02 00 00 49 63 cc 4c 2b 2c 24 48 8d 04 49 48 c1 e0 05 8b
>> [  +0.000000] RSP: 0018:ffffc9000008be50 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>> [  +0.000001] RAX: ffff8887ff5a9180 RBX: ffffffff822b6c40 RCX: 000000000000001f
>> [  +0.000001] RDX: 0000000000000000 RSI: 0000000033087154 RDI: 0000000000000000
>> [  +0.000001] RBP: ffff8887ff5b1310 R08: 000030d021fae397 R09: ffff8887ff59c8c0
>> [  +0.000000] R10: ffff8887ff59c8c0 R11: 0000000000000006 R12: 0000000000000004
>> [  +0.000001] R13: 000030d021fae397 R14: 0000000000000004 R15: ffff8887fc281600
>> [  +0.000001]  cpuidle_enter+0x29/0x40
>> [  +0.000002]  do_idle+0x1e5/0x280
>> [  +0.000001]  cpu_startup_entry+0x19/0x20
>> [  +0.000002]  start_secondary+0x186/0x1c0
>> [  +0.000001]  secondary_startup_64+0xa4/0xb0
>> [  +0.000001] ---[ end trace 99493c768580f4fd ]---
>>
>> The device is:
>>
>> Aug  7 23:19:09 tux kernel: libphy: r8169: probed
>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: RTL8168evl/8111evl, c8:60:00:68:33:cc, XID 2c9, IRQ 36
>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
>> Aug  7 23:19:12 tux kernel: RTL8211E Gigabit Ethernet r8169-400:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-400:00, irq=IGNORE)
>> Aug  7 23:19:13 tux kernel: r8169 0000:04:00.0 eth0: No native access to PCI extended config space, falling back to CSI
>>
>> and using fq_codel, of course.
>>
>> This cpuidle hiccup used to be completely gone without xmit_more and this was
>> the first (and so far only) time since merging it (regardless of load).
>> Also, while I'm using BMQ as CPU scheduler, that hasn't made a difference for
>> this particular problem in the past (with MuQSS/PDS) either; way back when I had
>> Eric's previous attempt(s) it also hiccupped with CFS.
>>
>> Revert or wait for more reports when -next is merged in 5.4?
> 
> Another question/data point: I've had the whole basket of offloads activated:
> 
>   ethtool --offload eth0 rx on tx on gro on gso on sg on tso on
> 
> and this caused zero problems without the xmit_more patch. However I just saw
> that net-next has a patch where TSO is disabled due to a known HW defect in
> RTL8168evl, which is of course what I have. Could this be the reason for the
> stall/hiccup when xmit_more has its fingers in the pie? I kind of know what
> xmit_more does, just not how it could interact with a possibly broken TSO that
> nevertheless seems to work fine otherwise..
> 

I was about to ask exactly that, whether you have TSO enabled. I don't know what
can trigger the HW issue, it was just confirmed by Realtek that this chip version
has a problem with TSO. So the logical conclusion is: test w/o TSO, ideally the
linux-next version.

> thanks
> Holger
> 
Heiner

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: John Hubbard @ 2019-08-08 18:18 UTC (permalink / raw)
  To: Weiny, Ira, Michal Hocko
  Cc: Jan Kara, Matthew Wilcox, Andrew Morton, Christoph Hellwig,
	Williams, Dan J, Dave Chinner, Dave Hansen, Jason Gunthorpe,
	Jérôme Glisse, LKML, amd-gfx@lists.freedesktop.org,
	ceph-devel@vger.kernel.org, devel@driverdev.osuosl.org,
	devel@lists.orangefs.org, dri-devel@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-block@vger.kernel.org,
	linux-crypto@vger.kernel.org, linux-fbdev@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-media@vger.kernel.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-rdma@vger.kernel.org, linux-rpi-kernel@lists.infradead.org,
	linux-xfs@vger.kernel.org, netdev@vger.kernel.org,
	rds-devel@oss.oracle.com, sparclinux@vger.kernel.org,
	x86@kernel.org, xen-devel@lists.xenproject.org
In-Reply-To: <2807E5FD2F6FDA4886F6618EAC48510E79E79644@CRSMSX101.amr.corp.intel.com>

On 8/8/19 9:25 AM, Weiny, Ira wrote:
>>
>> On 8/7/19 7:36 PM, Ira Weiny wrote:
>>> On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote:
>>>> On Wed 07-08-19 10:37:26, Jan Kara wrote:
>>>>> On Fri 02-08-19 12:14:09, John Hubbard wrote:
>>>>>> On 8/2/19 7:52 AM, Jan Kara wrote:
>>>>>>> On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
>>>>>>>> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
>>>>>>>>> On Fri 02-08-19 11:12:44, Michal Hocko wrote:
>>>>>>>>>> On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote:
>>   [...]
> Yep I can do this.  I did not realize that Andrew had accepted any of this work.  I'll check out his tree.  But I don't think he is going to accept this series through his tree.  So what is the ETA on that landing in Linus' tree?
> 

I'd expect it to go into 5.4, according to my understanding of how
the release cycles are arranged.


> To that point I'm still not sure who would take all this as I am now touching mm, procfs, rdma, ext4, and xfs.
> 
> I just thought I would chime in with my progress because I'm to a point where things are working and so I can submit the code but I'm not sure what I can/should depend on landing...  Also, now that 0day has run overnight it has found issues with this rebase so I need to clean those up...  Perhaps I will base on Andrew's tree prior to doing that...

I'm certainly not the right person to answer, but in spite of that, I'd think
Andrew's tree is a reasonable place for it. Sort of.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* Re: [PATCH v3 0/2] dt-bindings: net: meson-dwmac: convert to yaml
From: David Miller @ 2019-08-08 18:20 UTC (permalink / raw)
  To: narmstrong
  Cc: robh+dt, martin.blumenstingl, devicetree, netdev, linux-amlogic,
	linux-arm-kernel, linux-kernel
In-Reply-To: <20190808114101.29982-1-narmstrong@baylibre.com>

From: Neil Armstrong <narmstrong@baylibre.com>
Date: Thu,  8 Aug 2019 13:40:59 +0200

> This patchsets converts the Amlogic Meson DWMAC glue bindings over to
> YAML schemas using the already converted dwmac bindings.
> 
> The first patch is needed because the Amlogic glue needs a supplementary
> reg cell to access the DWMAC glue registers.
> 
> Changes since v2:
> - Added review tags
> - Updated allwinner,sun7i-a20-gmac.yaml reg maxItems

Where is this targetted to be merged, an ARM tree?  Or one of my
networking trees?


^ permalink raw reply

* Re: [Linux-decnet-user] [PATCH] Documentation: decnet: remove reference to CONFIG_DECNET_ROUTE_FWMARK
From: David Miller @ 2019-08-08 18:21 UTC (permalink / raw)
  To: emserrat
  Cc: clabbe, corbet, linux-doc, netdev, linux-decnet-user,
	linux-kernel, tgraf
In-Reply-To: <DM5PR22MB03797234267E8B37EA3080BBC4D70@DM5PR22MB0379.namprd22.prod.outlook.com>

From: Eduardo Marcelo Serrat <emserrat@hotmail.com>
Date: Thu, 8 Aug 2019 11:44:14 +0000

> Sorry for using the list for this purpose but we are looking for
> senior engineers with knowledge in OpenVMS/ Tru64 Unix, Solaris,
> HP-UX and of course Linux and familiar with virtualization
> technologies, specially cross platform emulators. We need to fill
> support engineer roles. If anybody interested for positions in the
> US / Europe please send me an email.

Please do not ever use the vger.kernel.org mailing lists for this kind
of solicitation.

It is completely inappropriate.

^ permalink raw reply

* Re: [PATCH v2 00/15] net: phy: adin: add support for Analog Devices PHYs
From: David Miller @ 2019-08-08 18:24 UTC (permalink / raw)
  To: alexandru.ardelean
  Cc: netdev, devicetree, linux-kernel, robh+dt, mark.rutland,
	f.fainelli, hkallweit1, andrew
In-Reply-To: <20190808123026.17382-1-alexandru.ardelean@analog.com>

From: Alexandru Ardelean <alexandru.ardelean@analog.com>
Date: Thu, 8 Aug 2019 15:30:11 +0300

> This changeset adds support for Analog Devices Industrial Ethernet PHYs.
> Particularly the PHYs this driver adds support for:
>  * ADIN1200 - Robust, Industrial, Low Power 10/100 Ethernet PHY
>  * ADIN1300 - Robust, Industrial, Low Latency 10/100/1000 Gigabit
>    Ethernet PHY
> 
> The 2 chips are pin & register compatible with one another. The main
> difference being that ADIN1200 doesn't operate in gigabit mode.
> 
> The chips can be operated by the Generic PHY driver as well via the
> standard IEEE PHY registers (0x0000 - 0x000F) which are supported by the
> kernel as well. This assumes that configuration of the PHY has been done
> completely in HW, according to spec, i.e. no extra SW configuration
> required.
> 
> This changeset also implements the ability to configure the chips via SW
> registers.
> 
> Datasheets:
>   https://www.analog.com/media/en/technical-documentation/data-sheets/ADIN1300.pdf
>   https://www.analog.com/media/en/technical-documentation/data-sheets/ADIN1200.pdf
> 
> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>

I think, at a minimum, the c22 vs. c45 issues need to be discussed more
and even if no code changes occur there is definitely some adjustments
and clairifications that need to occur on this issue in the commit
messages and/or documentation.

^ permalink raw reply

* Re: [PATCH 0/2] pull request for net: batman-adv 2019-08-08
From: David Miller @ 2019-08-08 18:26 UTC (permalink / raw)
  To: sw; +Cc: netdev, b.a.t.m.a.n
In-Reply-To: <20190808130208.2124-1-sw@simonwunderlich.de>

From: Simon Wunderlich <sw@simonwunderlich.de>
Date: Thu,  8 Aug 2019 15:02:06 +0200

> here are some bugfixes which we would like to have integrated into net.
> 
> Please pull or let me know of any problem!

Pulled.

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Martin Lau @ 2019-08-08 18:27 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org,
	davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190808152830.GC2820@mini-arch>

On Thu, Aug 08, 2019 at 08:28:30AM -0700, Stanislav Fomichev wrote:
> On 08/08, Martin Lau wrote:
> > On Wed, Aug 07, 2019 at 08:47:18AM -0700, Stanislav Fomichev wrote:
> > > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > > and call it from bpf_sk_storage_clone. Reuse the gap in
> > > bpf_sk_storage_elem to store clone/non-clone flag.
> > > 
> > > Cc: Martin KaFai Lau <kafai@fb.com>
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  include/net/bpf_sk_storage.h |  10 ++++
> > >  include/uapi/linux/bpf.h     |   1 +
> > >  net/core/bpf_sk_storage.c    | 102 +++++++++++++++++++++++++++++++++--
> > >  net/core/sock.c              |   9 ++--
> > >  4 files changed, 115 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> > > index b9dcb02e756b..8e4f831d2e52 100644
> > > --- a/include/net/bpf_sk_storage.h
> > > +++ b/include/net/bpf_sk_storage.h
> > > @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
> > >  extern const struct bpf_func_proto bpf_sk_storage_get_proto;
> > >  extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
> > >  
> > > +#ifdef CONFIG_BPF_SYSCALL
> > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
> > > +#else
> > > +static inline int bpf_sk_storage_clone(const struct sock *sk,
> > > +				       struct sock *newsk)
> > > +{
> > > +	return 0;
> > > +}
> > > +#endif
> > > +
> > >  #endif /* _BPF_SK_STORAGE_H */
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 4393bd4b2419..00459ca4c8cf 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -2931,6 +2931,7 @@ enum bpf_func_id {
> > >  
> > >  /* BPF_FUNC_sk_storage_get flags */
> > >  #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
> > > +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
> > It is only used in bpf_sk_storage_get().
> > What if the elem is created from bpf_fd_sk_storage_update_elem()
> > i.e. from the syscall API ?
> > 
> > What may be the use case for a map to have both CLONE and non-CLONE
> > elements?  If it is not the case, would it be better to add
> > BPF_F_CLONE to bpf_attr->map_flags?
> I didn't think about putting it on the map itself since the API
> is on a per-element, but it does make sense. I can't come up
> with a use-case for a per-element selective clone/non-clone.
> Thanks, will move to the map itself.
> 
> > >  
> > >  /* Mode for BPF_FUNC_skb_adjust_room helper. */
> > >  enum bpf_adj_room_mode {
> > > diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> > > index 94c7f77ecb6b..b6dea67965bc 100644
> > > --- a/net/core/bpf_sk_storage.c
> > > +++ b/net/core/bpf_sk_storage.c
> > > @@ -12,6 +12,9 @@
> > >  
> > >  static atomic_t cache_idx;
> > >  
> > > +#define BPF_SK_STORAGE_GET_F_MASK	(BPF_SK_STORAGE_GET_F_CREATE | \
> > > +					 BPF_SK_STORAGE_GET_F_CLONE)
> > > +
> > >  struct bucket {
> > >  	struct hlist_head list;
> > >  	raw_spinlock_t lock;
> > > @@ -66,7 +69,8 @@ struct bpf_sk_storage_elem {
> > >  	struct hlist_node snode;	/* Linked to bpf_sk_storage */
> > >  	struct bpf_sk_storage __rcu *sk_storage;
> > >  	struct rcu_head rcu;
> > > -	/* 8 bytes hole */
> > > +	u8 clone:1;
> > > +	/* 7 bytes hole */
> > >  	/* The data is stored in aother cacheline to minimize
> > >  	 * the number of cachelines access during a cache hit.
> > >  	 */
> > > @@ -509,7 +513,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
> > >  	return 0;
> > >  }
> > >  
> > > -/* Called by __sk_destruct() */
> > > +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
> > >  void bpf_sk_storage_free(struct sock *sk)
> > >  {
> > >  	struct bpf_sk_storage_elem *selem;
> > > @@ -739,19 +743,106 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
> > >  	return err;
> > >  }
> > >  
> > > +static struct bpf_sk_storage_elem *
> > > +bpf_sk_storage_clone_elem(struct sock *newsk,
> > > +			  struct bpf_sk_storage_map *smap,
> > > +			  struct bpf_sk_storage_elem *selem)
> > > +{
> > > +	struct bpf_sk_storage_elem *copy_selem;
> > > +
> > > +	copy_selem = selem_alloc(smap, newsk, NULL, true);
> > > +	if (!copy_selem)
> > > +		return ERR_PTR(-ENOMEM);
> > nit.
> > may be just return NULL as selem_alloc() does.
> Sounds good.
> 
> > > +
> > > +	if (map_value_has_spin_lock(&smap->map))
> > > +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
> > > +				      SDATA(selem)->data, true);
> > > +	else
> > > +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
> > > +			       SDATA(selem)->data);
> > > +
> > > +	return copy_selem;
> > > +}
> > > +
> > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > > +{
> > > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > > +	struct bpf_sk_storage *sk_storage;
> > > +	struct bpf_sk_storage_elem *selem;
> > > +	int ret;
> > > +
> > > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > > +
> > > +	rcu_read_lock();
> > > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > > +
> > > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > > +		goto out;
> > > +
> > > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > > +		struct bpf_sk_storage_map *smap;
> > > +		struct bpf_sk_storage_elem *copy_selem;
> > > +
> > > +		if (!selem->clone)
> > > +			continue;
> > > +
> > > +		smap = rcu_dereference(SDATA(selem)->smap);
> > > +		if (!smap)
> > smap should not be NULL.
> I see; you never set it back to NULL and we are guaranteed that the
> map is still around due to rcu. Removed.
> 
> > > +			continue;
> > > +
> > > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > > +		if (IS_ERR(copy_selem)) {
> > > +			ret = PTR_ERR(copy_selem);
> > > +			goto err;
> > > +		}
> > > +
> > > +		if (!new_sk_storage) {
> > > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > > +			if (ret) {
> > > +				kfree(copy_selem);
> > > +				atomic_sub(smap->elem_size,
> > > +					   &newsk->sk_omem_alloc);
> > > +				goto err;
> > > +			}
> > > +
> > > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > > +			continue;
> > > +		}
> > > +
> > > +		raw_spin_lock_bh(&new_sk_storage->lock);
> > > +		selem_link_map(smap, copy_selem);
> > Unlike the existing selem-update use-cases in bpf_sk_storage.c,
> > the smap->map.refcnt has not been held here.  Reading the smap
> > is fine.  However, adding a new selem to a deleting smap is an issue.
> > Hence, I think bpf_map_inc_not_zero() should be done first.
> In this case, I should probably do it after smap = rcu_deref()?
Right.

and bpf_map_put should be called when done.  Becasue of bpf_map_put,
it may be a good idea to add a comment to the first synchronize_rcu()
in bpf_sk_storage_map_free() since this new bpf_sk_storage_clone()
also depends on it now,
which makes it different from other bpf maps.

> 
> > > +		__selem_link_sk(new_sk_storage, copy_selem);
> > > +		raw_spin_unlock_bh(&new_sk_storage->lock);
> > > +	}
> > > +
> > > +out:
> > > +	rcu_read_unlock();
> > > +	return 0;
> > > +
> > > +err:
> > > +	rcu_read_unlock();
> > > +
> > > +	bpf_sk_storage_free(newsk);
> > > +	return ret;
> > > +}
> > > +
> > >  BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> > >  	   void *, value, u64, flags)
> > >  {
> > >  	struct bpf_sk_storage_data *sdata;
> > >  
> > > -	if (flags > BPF_SK_STORAGE_GET_F_CREATE)
> > > +	if (flags & ~BPF_SK_STORAGE_GET_F_MASK)
> > > +		return (unsigned long)NULL;
> > > +
> > > +	if ((flags & BPF_SK_STORAGE_GET_F_CLONE) &&
> > > +	    !(flags & BPF_SK_STORAGE_GET_F_CREATE))
> > >  		return (unsigned long)NULL;
> > >  
> > >  	sdata = sk_storage_lookup(sk, map, true);
> > >  	if (sdata)
> > >  		return (unsigned long)sdata->data;
> > >  
> > > -	if (flags == BPF_SK_STORAGE_GET_F_CREATE &&
> > > +	if ((flags & BPF_SK_STORAGE_GET_F_CREATE) &&
> > >  	    /* Cannot add new elem to a going away sk.
> > >  	     * Otherwise, the new elem may become a leak
> > >  	     * (and also other memory issues during map
> > > @@ -762,6 +853,9 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> > >  		/* sk must be a fullsock (guaranteed by verifier),
> > >  		 * so sock_gen_put() is unnecessary.
> > >  		 */
> > > +		if (!IS_ERR(sdata))
> > > +			SELEM(sdata)->clone =
> > > +				!!(flags & BPF_SK_STORAGE_GET_F_CLONE);
> > >  		sock_put(sk);
> > >  		return IS_ERR(sdata) ?
> > >  			(unsigned long)NULL : (unsigned long)sdata->data;
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index d57b0cc995a0..f5e801a9cea4 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> > >  			goto out;
> > >  		}
> > >  		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> > > -#ifdef CONFIG_BPF_SYSCALL
> > > -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > > -#endif
> > > +
> > > +		if (bpf_sk_storage_clone(sk, newsk)) {
> > > +			sk_free_unlock_clone(newsk);
> > > +			newsk = NULL;
> > > +			goto out;
> > > +		}
> > >  
> > >  		newsk->sk_err	   = 0;
> > >  		newsk->sk_err_soft = 0;
> > > -- 
> > > 2.22.0.770.g0f2c4a37fd-goog
> > > 

^ permalink raw reply

* Re: [PATCH 0/4] pull request for net-next: batman-adv 2019-08-08
From: David Miller @ 2019-08-08 18:29 UTC (permalink / raw)
  To: sw; +Cc: netdev, b.a.t.m.a.n
In-Reply-To: <20190808130619.4481-1-sw@simonwunderlich.de>

From: Simon Wunderlich <sw@simonwunderlich.de>
Date: Thu,  8 Aug 2019 15:06:15 +0200

> here is a small feature/cleanup pull request of batman-adv to go into net-next.
> 
> Please pull or let me know of any problem!

Pulled, thanks.

That lockdep annotation in the 4th patch really helped with the review.

^ permalink raw reply

* Re: [PATCH net-next] taprio: remove unused variable 'entry_list_policy'
From: David Miller @ 2019-08-08 18:38 UTC (permalink / raw)
  To: yuehaibing
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, linux-kernel, netdev
In-Reply-To: <20190808142623.69188-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Thu, 8 Aug 2019 22:26:23 +0800

> net/sched/sch_taprio.c:680:32: warning:
>  entry_list_policy defined but not used [-Wunused-const-variable=]
> 
> It is not used since commit a3d43c0d56f1 ("taprio: Add
> support adding an admin schedule")
> 
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

This is probably unintentional and a bug, we should be using that
policy value to validate that the sched list is indeed a nested
attribute.

I'm not applying this without at least a better and clear commit
message explaining why we shouldn't be using this policy any more.

^ permalink raw reply

* RE: [PATCH v3 1/1] ixgbe: sync the first fragment unconditionally
From: Bowers, AndrewX @ 2019-08-08 18:42 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190808040312.21719-1-firo.yang@suse.com>

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Firo Yang
> Sent: Wednesday, August 7, 2019 9:04 PM
> To: netdev@vger.kernel.org
> Cc: maciejromanfijalkowski@gmail.com; Firo Yang <firo.yang@suse.com>;
> linux-kernel@vger.kernel.org; intel-wired-lan@lists.osuosl.org;
> jian.w.wen@oracle.com; alexander.h.duyck@linux.intel.com;
> davem@davemloft.net
> Subject: [Intel-wired-lan] [PATCH v3 1/1] ixgbe: sync the first fragment
> unconditionally
> 
> In Xen environment, if Xen-swiotlb is enabled, ixgbe driver could possibly
> allocate a page, DMA memory buffer, for the first fragment which is not
> suitable for Xen-swiotlb to do DMA operations.
> Xen-swiotlb have to internally allocate another page for doing DMA
> operations. This mechanism requires syncing the data from the internal page
> to the page which ixgbe sends to upper network stack. However, since
> commit f3213d932173 ("ixgbe: Update driver to make use of DMA attributes
> in Rx path"), the unmap operation is performed with
> DMA_ATTR_SKIP_CPU_SYNC. As a result, the sync is not performed.
> Since the sync isn't performed, the upper network stack could receive a
> incomplete network packet. By incomplete, it means the linear data on the
> first fragment(between skb->head and skb->end) is invalid. So we have to
> copy the data from the internal xen-swiotlb page to the page which ixgbe
> sends to upper network stack through the sync operation.
> 
> More details from Alexander Duyck:
> Specifically since we are mapping the frame with
> DMA_ATTR_SKIP_CPU_SYNC we have to unmap with that as well. As a result
> a sync is not performed on an unmap and must be done manually as we
> skipped it for the first frag. As such we need to always sync before possibly
> performing a page unmap operation.
> 
> Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA attributes in
> Rx path")
> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Signed-off-by: Firo Yang <firo.yang@suse.com>
> ---
> Changes from v2:
>  * Added details on the problem caused by skipping the sync.
>  * Added more explanation from Alexander Duyck.
> 
> Changes from v1:
>  * Imporved the patch description.
>  * Added Reviewed-by: and Fixes: as suggested by Alexander Duyck.
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>



^ permalink raw reply

* Re: [PATCH net-next 5/5] r8152: change rx_frag_head_sz and rx_max_agg_num dynamically
From: Jakub Kicinski @ 2019-08-08 18:43 UTC (permalink / raw)
  To: Hayes Wang
  Cc: Maciej Fijalkowski, netdev@vger.kernel.org, nic_swsd,
	linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org
In-Reply-To: <0835B3720019904CB8F7AA43166CEEB2F18D0F3F@RTITMBSVM03.realtek.com.tw>

On Thu, 8 Aug 2019 12:16:50 +0000, Hayes Wang wrote:
> Maciej Fijalkowski [mailto:maciejromanfijalkowski@gmail.com]
> > Sent: Thursday, August 08, 2019 7:50 PM  
> > > Excuse me again.
> > > I find the kernel supports the copybreak of Ethtool.
> > > However, I couldn't find a command of Ethtool to use it.  
> > 
> > Ummm there's set_tunable ops. Amazon's ena driver is making use of it from
> > what
> > I see. Look at ena_set_tunable() in
> > drivers/net/ethernet/amazon/ena/ena_ethtool.c.  
> 
> The kernel could support it. And I has finished it.
> However, when I want to test it by ethtool, I couldn't find suitable command.
> I couldn't find relative feature in the source code of ethtool, either.

It's possible it's not implemented in the user space tool 🤔

Looks like it got posted here:

https://www.spinics.net/lists/netdev/msg299877.html

But perhaps never finished? 

It should be fairly straightforward to implement by looking at how
phy-tunables are handled.

^ permalink raw reply

* Re: [PATCH 1/1] bpf: introduce new helper udp_flow_src_port
From: Andrii Nakryiko @ 2019-08-08 18:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Y Song, Alexei Starovoitov, Farid Zakaria, Daniel Borkmann,
	netdev, bpf
In-Reply-To: <20190805171036.5a5bf790@cakuba.netronome.com>

On Mon, Aug 5, 2019 at 5:11 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Sat, 3 Aug 2019 23:52:16 -0700, Y Song wrote:
> > >  include/uapi/linux/bpf.h                      | 21 +++++++--
> > >  net/core/filter.c                             | 20 ++++++++
> > >  tools/include/uapi/linux/bpf.h                | 21 +++++++--
> > >  tools/testing/selftests/bpf/bpf_helpers.h     |  2 +
> > >  .../bpf/prog_tests/udp_flow_src_port.c        | 28 +++++++++++
> > >  .../bpf/progs/test_udp_flow_src_port_kern.c   | 47 +++++++++++++++++++
> > >  6 files changed, 131 insertions(+), 8 deletions(-)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/udp_flow_src_port.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/test_udp_flow_src_port_kern.c
> >
> > First, for each review, backport and sync with libbpf repo, in the future,
> > could you break the patch to two patches?
> >    1. kernel changes (net/core/filter.c, include/uapi/linux/bpf.h)
> >    2. tools/include/uapi/linux/bpf.h
> >    3. tools/testing/ changes
>
> A lot of people get caught off by this, could explain why this is
> necessary?

We are using script [0] to sync libbpf sources from linux repo to
Github. It does a lot of things to make this happen, given that Github
structure is not a simple copy/move into subdirectory. Instead it does
a bunch of cherry-picking and tree rewrites, so when there are patches
that touched both libbpf sources (including those tools/include/...
files) and some sources that we don't sync (e.g., just include/...),
then script/git gets confused which breaks the flow and requires more
manual work. Which is why we are asking to split those changes. Hope
this helps to clarify.

  [0] https://github.com/libbpf/libbpf/blob/master/scripts/sync-kernel.sh

>
> git can deal with this scenario without missing a step, format-patch
> takes paths:
>
> $ git show --oneline -s
> 1002f3e955d7 (HEAD) bpf: introduce new helper udp_flow_src_port
>
> $ git format-patch HEAD~ -- tools/include/uapi/linux/bpf.h
> 0001-bpf-introduce-new-helper-udp_flow_src_port.patch
>
> $ grep -B1 changed 0001-bpf-introduce-new-helper-udp_flow_src_port.patch
>  tools/include/uapi/linux/bpf.h | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> $ cd ../libbpf
> $ git am -p2 ../linux/0001-bpf-introduce-new-helper-udp_flow_src_port.patch
> Applying: bpf: introduce new helper udp_flow_src_port
> error: patch failed: include/uapi/linux/bpf.h:2853
> error: include/uapi/linux/bpf.h: patch does not apply
> ...
>
> Well, the patch doesn't apply to libbpf right now, but git finds the
> right paths and all that.
>
> IMO it'd be good to not have this artificial process obstacle and all
> the "sync headers" commits in the tree.

It might be the case that script can be written in some different way
to bypass this limitation, but someone has to dedicate time to write
it and test it. Feel free to contribute.

^ permalink raw reply

* Re: [PATCH] liquidio: Use pcie_flr() instead of reimplementing it
From: Bjorn Helgaas @ 2019-08-08 18:48 UTC (permalink / raw)
  To: Denis Efremov
  Cc: Bjorn Helgaas, Derek Chickles, Satanand Burla, Felix Manlunas,
	netdev, linux-pci, linux-kernel
In-Reply-To: <20190808045753.5474-1-efremov@linux.com>

On Thu, Aug 08, 2019 at 07:57:53AM +0300, Denis Efremov wrote:
> octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR
> bit, which bypasses timing requirements imposed by the PCIe spec.
> This patch fixes the function to use the pcie_flr() interface instead.
> 
> Signed-off-by: Denis Efremov <efremov@linux.com>

Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>

Thanks for doing this, Denis.  When possible it's better to use a PCI
core interface than to fiddle with PCI config space directly from a
driver.

> ---
>  drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
> index 021d99cd1665..614d07be7181 100644
> --- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
> +++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
> @@ -260,9 +260,7 @@ static int octeon_mbox_process_cmd(struct octeon_mbox *mbox,
>  		dev_info(&oct->pci_dev->dev,
>  			 "got a request for FLR from VF that owns DPI ring %u\n",
>  			 mbox->q_no);
> -		pcie_capability_set_word(
> -			oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no],
> -			PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
> +		pcie_flr(oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no]);
>  		break;
>  
>  	case OCTEON_PF_CHANGED_VF_MACADDR:
> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [PATCH bpf 2/2] tools: bpftool: add error message on pin failure
From: Andrii Nakryiko @ 2019-08-08 18:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, bpf, oss-drivers,
	Andy Lutomirski, Quentin Monnet
In-Reply-To: <20190807001923.19483-3-jakub.kicinski@netronome.com>

On Tue, Aug 6, 2019 at 5:21 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> No error message is currently printed if the pin syscall
> itself fails. It got lost in the loadall refactoring.
>
> Fixes: 77380998d91d ("bpftool: add loadall command")
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---

Acked-by: Andrii Nakryiko <andriin@fb.com>

> CC: luto@kernel.org, sdf@google.com
>
>  tools/bpf/bpftool/common.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index c52a6ffb8949..6a71324be628 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -204,7 +204,11 @@ int do_pin_fd(int fd, const char *name)
>         if (err)
>                 return err;
>
> -       return bpf_obj_pin(fd, name);
> +       err = bpf_obj_pin(fd, name);
> +       if (err)
> +               p_err("can't pin the object (%s): %s", name, strerror(errno));
> +
> +       return err;
>  }
>
>  int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32))
> --
> 2.21.0
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox