Netdev List
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
From: Paolo Bonzini @ 2018-04-05 15:31 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Si-Wei Liu, Jiri Pirko, Stephen Hemminger, Alexander Duyck,
	David Miller, Brandeburg, Jesse, Jakub Kicinski, Jason Wang,
	Samudrala, Sridhar, Netdev, virtualization, virtio-dev
In-Reply-To: <CADGSJ23jr7xrq8h9rFHLtXkbpsB+AK8wcUzOQnpeftbhiYdOvA@mail.gmail.com>

On 04/04/2018 10:02, Siwei Liu wrote:
>> pci_bus_num is almost always a bug if not done within
>> a context of a PCI host, bridge, etc.
>>
>> In particular this will not DTRT if run before guest assigns bus
>> numbers.
>>
> I was seeking means to reserve a specific pci bus slot from drivers,
> and update the driver when guest assigns the bus number but it seems
> there's no low-hanging fruits. Because of that reason the bus_num is
> only obtained until it's really needed (during get_config) and I
> assume at that point the pci bus assignment is already done. I know
> the current one is not perfect, but we need that information (PCI
> bus:slot.func number) to name the guest device correctly.

Can you use the -device "id", and look it up as

    devices = container_get(qdev_get_machine(), "/peripheral");
    return object_resolve_path_component(devices, id);

?

Thanks,

Paolo

^ permalink raw reply

* Kernel bug from adding bpf actions in tc
From: Lucas Bates @ 2018-04-05 15:23 UTC (permalink / raw)
  To: dcaratti; +Cc: Linux Kernel Network Developers

Hi Davide,

Our overnight tc test runs of net-next revealed a kernel bug on one of
the BPF tests you submitted, d959.  The add action completes
successfully, but the bug occurs on the verify when tdc does a get of
the action that was just added.  Here's the text of the dump:

[   61.973632] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000020
[   61.974366] PGD 8000000081881067 P4D 8000000081881067 PUD 83784067 PMD 0
[   61.974986] Oops: 0000 [#1] SMP PTI
[   61.975309] Modules linked in: kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
aes_x86_64 crypto_simd psmouse glue_helper cryptd serio_raw
[   61.976800] CPU: 28 PID: 1087 Comm: tc Not tainted 4.16.0+ #28
[   61.977329] RIP: 0010:__bpf_prog_put+0x5/0xe0
[   61.977731] RSP: 0018:ffff9647c4823788 EFLAGS: 00010202
[   61.978204] RAX: 0000000000000000 RBX: ffff9647c48237a0 RCX: 000000000000176c
[   61.978845] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[   61.979484] RBP: 0000000000000000 R08: 0000000000025be0 R09: ffffffffa4794077
[   61.980121] R10: ffffdc1bc2130b00 R11: 0000000000000000 R12: 0000000000000000
[   61.980763] R13: 0000000000000001 R14: ffff8889869969f0 R15: 00000000ffffffea
[   61.981398] FS:  00007faa72489700(0000) GS:ffff888988f00000(0000)
knlGS:0000000000000000
[   61.982114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   61.982627] CR2: 0000000000000020 CR3: 0000000085938004 CR4: 00000000003606a0
[   61.983263] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   61.983897] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   61.984531] Call Trace:
[   61.984766]  tcf_bpf_cfg_cleanup+0x2f/0x40
[   61.985139]  tcf_bpf_cleanup+0x3c/0x50
[   61.985479]  ? uncore_event_cpu_online+0x80/0x3c0
[   61.985922]  __tcf_idr_release+0x72/0x150
[   61.986297]  tcf_bpf_init+0x102/0x3e0
[   61.986637]  ? perf_trace_sched_process_exec+0xf4/0x140
[   61.987108]  tcf_action_init_1+0x36c/0x410
[   61.987482]  ? ___slab_alloc+0x218/0x4b0
[   61.987841]  tcf_action_init+0x106/0x190
[   61.988204]  tc_ctl_action+0x11a/0x220
[   61.988551]  rtnetlink_rcv_msg+0x243/0x2f0
[   61.988931]  ? _cond_resched+0x16/0x40
[   61.989277]  ? __kmalloc_node_track_caller+0x1e6/0x2a0
[   61.989746]  ? rtnl_calcit.isra.29+0xe0/0xe0
[   61.990137]  netlink_rcv_skb+0xde/0x110
[   61.990494]  netlink_unicast+0x16d/0x220
[   61.990858]  netlink_sendmsg+0x293/0x370
[   61.991224]  sock_sendmsg+0x36/0x40
[   61.991559]  ___sys_sendmsg+0x2cb/0x2e0
[   61.991913]  ? pagecache_get_page+0x27/0x220
[   61.992302]  ? filemap_fault+0xa2/0x650
[   61.992651]  ? page_add_file_rmap+0x108/0x200
[   61.993057]  ? alloc_set_pte+0x2aa/0x530
[   61.993419]  ? finish_fault+0x4e/0x70
[   61.993758]  ? __handle_mm_fault+0xbfc/0x1110
[   61.994160]  ? __sys_sendmsg+0x53/0x80
[   61.994506]  __sys_sendmsg+0x53/0x80
[   61.994849]  do_syscall_64+0x6e/0x120
[   61.995189]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   61.995662] RIP: 0033:0x7faa71885bd0
[   61.995991] RSP: 002b:00007ffccf65bf28 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[   61.996693] RAX: ffffffffffffffda RBX: 00007ffccf65c050 RCX: 00007faa71885bd0
[   61.997350] RDX: 0000000000000000 RSI: 00007ffccf65bfa0 RDI: 0000000000000003
[   61.998001] RBP: 000000005ac513e0 R08: 0000000000000001 R09: 0000000000000000
[   61.998649] R10: 00000000000005e7 R11: 0000000000000246 R12: 0000000000000000
[   61.999287] R13: 00007ffccf6600b0 R14: 0000000000000001 R15: 0000000000674600
[   61.999942] Code: c6 72 00 48 8b 43 20 48 c7 c7 d0 61 9b a5 c7 40
10 00 00 00 00 5b e9 1b d9 74 00 90 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <48> 8b 47 20 f0 ff 08 74 01 c3 41 54 41 89 f4 55 48 89 fd
53 66
[   62.001660] RIP: __bpf_prog_put+0x5/0xe0 RSP: ffff9647c4823788
[   62.002183] CR2: 0000000000000020
[   62.002488] ---[ end trace 19b56d1a66dd8e2a ]---



I'm sending this your way because you were the last one to touch this
part of the code.  Have you seen this in your own testing?  (This
can't be replicated by hand, only by running tdc).

- Lucas

^ permalink raw reply

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Andrew Lunn @ 2018-04-05 15:23 UTC (permalink / raw)
  To: Laurentiu Tudor
  Cc: Stuart Yoder, Arnd Bergmann, Ioana Ciornei, gregkh,
	Linux Kernel Mailing List, Ruxandra Ioana Ciocoi Radulescu,
	Razvan Stefanescu, Roy Pledge, Networking
In-Reply-To: <5AC63610.4000504@nxp.com>

On Thu, Apr 05, 2018 at 02:43:29PM +0000, Laurentiu Tudor wrote:
> Hi Andrew,
> 
> On 04/05/2018 03:48 PM, Andrew Lunn wrote:
> >>> Hi Laurentiu
> >>>
> >>> So i can use switchdev without it? I can modprobe the switchdev
> >>> driver, all the physical interfaces will appear, and i can use ip addr
> >>> add etc. I do not need to use a user space tool at all in order to use
> >>> the network functionality?
> >>
> >> Absolutely!
> >
> > Great.
> >
> > Then the easiest way forwards is to simply drop the IOCTL code for the
> > moment. Get the basic support for the hardware into the kernel
> > first. Then come back later to look at dynamic behaviour which needs
> > some form of configuration.
> 
> Hmm, not sure I understand. We already have a fully functional ethernet 
> driver [1] and a switch driver [2] ...

In staging, the tree of crap. You want to get it out of there and into
the main tree. But that effort is being side lined by the discussion
around this IOCTL call.  The best way forward is to to accept Greg is
not going to take this patchset at the moment, and move on. As you
said, it is not needed for the Ethernet and switchdev driver.

What needs to happen before the Ethernet driver can be reviewed for
moving out of staging?

       Andrew

^ permalink raw reply

* [PATCH 9/9] perf tools: The buildid usage in example eBPF program
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

The bpf-samples/bpf-stdout-example.c demonstrates how to put the
buildid data into eBPF program.

Link: http://lkml.kernel.org/n/tip-dq97ddil7h3qbvphbbo8p08c@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/bpf-samples/bpf-stdout-example.c | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
 create mode 100644 tools/perf/bpf-samples/bpf-stdout-example.c

diff --git a/tools/perf/bpf-samples/bpf-stdout-example.c b/tools/perf/bpf-samples/bpf-stdout-example.c
new file mode 100644
index 000000000000..60783442bd5c
--- /dev/null
+++ b/tools/perf/bpf-samples/bpf-stdout-example.c
@@ -0,0 +1,42 @@
+#include <uapi/linux/bpf.h>
+#include <linux/buildid.h>
+
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+char _license[] SEC("license") = "GPL";
+int _version    SEC("version") = LINUX_VERSION_CODE;
+char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
+
+static unsigned long long (*bpf_get_smp_processor_id)(void) =
+	(void *) BPF_FUNC_get_smp_processor_id;
+static int (*bpf_perf_event_output)(void *ctx, void *map,
+                                    unsigned long long flags, void *data,
+                                    int size) =
+        (void *) BPF_FUNC_perf_event_output;
+
+struct bpf_map_def {
+        unsigned int type;
+        unsigned int key_size;
+        unsigned int value_size;
+        unsigned int max_entries;
+        unsigned int map_flags;
+        unsigned int inner_map_idx;
+        unsigned int numa_node;
+};
+
+struct bpf_map_def SEC("maps") __bpf_stdout__ = {
+	.type		= BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(u32),
+	.max_entries	= __NR_CPUS__,
+};
+
+SEC("probe=sys_read")
+int func(void *ctx)
+{
+	char output_str[] = "Raise a BPF event!";
+
+	bpf_perf_event_output(ctx, &__bpf_stdout__, bpf_get_smp_processor_id(),
+			      &output_str, sizeof(output_str));
+	return 0;
+}
-- 
2.13.6

^ permalink raw reply related

* [PATCH 8/9] libbpf: Add support to attach buildid to program load
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

Adding support to retrieve buildid from elf's "buildid"
section and passing it through to the load_program
function to kernel bpf syscall.

Fixing perf use of the bpf_load_program function and
linking in the vsprintf.o into bpftool to have the
scnprintf function in.

Link: http://lkml.kernel.org/n/tip-2pafwtzbyosmf9ftuf0udn54@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/bpf/bpftool/Makefile |  5 ++++-
 tools/lib/bpf/bpf.c        |  6 ++++--
 tools/lib/bpf/bpf.h        |  5 +++--
 tools/lib/bpf/libbpf.c     | 46 ++++++++++++++++++++++++++++++++++++++++------
 tools/perf/tests/bpf.c     |  9 ++++++++-
 5 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 4e69782c4a79..9ac11ea5de1c 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -75,11 +75,14 @@ include $(wildcard $(OUTPUT)*.d)
 all: $(OUTPUT)bpftool
 
 SRCS = $(wildcard *.c)
-OBJS = $(patsubst %.c,$(OUTPUT)%.o,$(SRCS)) $(OUTPUT)disasm.o
+OBJS = $(patsubst %.c,$(OUTPUT)%.o,$(SRCS)) $(OUTPUT)disasm.o $(OUTPUT)vsprintf.o
 
 $(OUTPUT)disasm.o: $(srctree)/kernel/bpf/disasm.c
 	$(QUIET_CC)$(COMPILE.c) -MMD -o $@ $<
 
+$(OUTPUT)vsprintf.o: $(srctree)/tools/lib/vsprintf.c
+	$(QUIET_CC)$(COMPILE.c) -MMD -o $@ $<
+
 $(OUTPUT)bpftool: $(OBJS) $(LIBBPF)
 	$(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index acbb3f8b3bec..8e384db5bbbd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -168,6 +168,7 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 	attr.log_size = 0;
 	attr.log_level = 0;
 	attr.kern_version = load_attr->kern_version;
+	attr.kern_buildid = ptr_to_u64(load_attr->buildid);
 	memcpy(attr.prog_name, load_attr->name,
 	       min(name_len, BPF_OBJ_NAME_LEN - 1));
 
@@ -185,8 +186,8 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 
 int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 		     size_t insns_cnt, const char *license,
-		     __u32 kern_version, char *log_buf,
-		     size_t log_buf_sz)
+		     __u32 kern_version, const char *buildid,
+		     char *log_buf, size_t log_buf_sz)
 {
 	struct bpf_load_program_attr load_attr;
 
@@ -198,6 +199,7 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 	load_attr.insns_cnt = insns_cnt;
 	load_attr.license = license;
 	load_attr.kern_version = kern_version;
+	load_attr.buildid = buildid;
 
 	return bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
 }
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 39f6a0d64a3b..b5ffb178ebdd 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -49,6 +49,7 @@ struct bpf_load_program_attr {
 	size_t insns_cnt;
 	const char *license;
 	__u32 kern_version;
+	const char *buildid;
 };
 
 /* Recommend log buffer size */
@@ -57,8 +58,8 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 			   char *log_buf, size_t log_buf_sz);
 int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 		     size_t insns_cnt, const char *license,
-		     __u32 kern_version, char *log_buf,
-		     size_t log_buf_sz);
+		     __u32 kern_version, const char *buildid,
+		     char *log_buf, size_t log_buf_sz);
 int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 		       size_t insns_cnt, int strict_alignment,
 		       const char *license, __u32 kern_version,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5922443063f0..421f2c2e0ebe 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -220,6 +220,7 @@ static LIST_HEAD(bpf_objects_list);
 
 struct bpf_object {
 	char license[64];
+	char buildid[64];
 	u32 kern_version;
 
 	struct bpf_program *programs;
@@ -599,6 +600,32 @@ bpf_object__init_kversion(struct bpf_object *obj,
 	return 0;
 }
 
+static void buildid_scnprint(char *buf, int n, char *buildid, int size)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < size; i++) {
+		ret += scnprintf(buf + ret, n - ret, "%x",
+				 (unsigned char) buildid[i]);
+	}
+}
+
+static int
+bpf_object__init_buildid(struct bpf_object *obj,
+			 void *data, size_t size)
+{
+	char buf[64];
+
+	if (size > sizeof(obj->buildid))
+		return -LIBBPF_ERRNO__FORMAT;
+
+	memcpy(&obj->buildid, data, size);
+
+	buildid_scnprint(buf, 64, obj->buildid, size);
+	pr_debug("kernel buildid of %s is: %s\n", obj->path, buf);
+	return 0;
+}
+
 static int compare_bpf_map(const void *_a, const void *_b)
 {
 	const struct bpf_map *a = _a;
@@ -817,6 +844,10 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			err = bpf_object__init_kversion(obj,
 							data->d_buf,
 							data->d_size);
+		else if (strcmp(name, "buildid") == 0)
+			err = bpf_object__init_buildid(obj,
+						       data->d_buf,
+						       data->d_size);
 		else if (strcmp(name, "maps") == 0)
 			obj->efile.maps_shndx = idx;
 		else if (sh.sh_type == SHT_SYMTAB) {
@@ -1166,7 +1197,7 @@ static int bpf_object__collect_reloc(struct bpf_object *obj)
 static int
 load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type,
 	     const char *name, struct bpf_insn *insns, int insns_cnt,
-	     char *license, u32 kern_version, int *pfd)
+	     char *license, u32 kern_version, const char *buildid, int *pfd)
 {
 	struct bpf_load_program_attr load_attr;
 	char *log_buf;
@@ -1180,6 +1211,7 @@ load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type,
 	load_attr.insns_cnt = insns_cnt;
 	load_attr.license = license;
 	load_attr.kern_version = kern_version;
+	load_attr.buildid = buildid;
 
 	if (!load_attr.insns || !load_attr.insns_cnt)
 		return -EINVAL;
@@ -1189,7 +1221,6 @@ load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type,
 		pr_warning("Alloc log buffer for bpf loader error, continue without log\n");
 
 	ret = bpf_load_program_xattr(&load_attr, log_buf, BPF_LOG_BUF_SIZE);
-
 	if (ret >= 0) {
 		*pfd = ret;
 		ret = 0;
@@ -1234,7 +1265,8 @@ load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type,
 
 static int
 bpf_program__load(struct bpf_program *prog,
-		  char *license, u32 kern_version)
+		  char *license, u32 kern_version,
+		  const char *buildid)
 {
 	int err = 0, fd, i;
 
@@ -1261,7 +1293,7 @@ bpf_program__load(struct bpf_program *prog,
 		}
 		err = load_program(prog->type, prog->expected_attach_type,
 				   prog->name, prog->insns, prog->insns_cnt,
-				   license, kern_version, &fd);
+				   license, kern_version, buildid, &fd);
 		if (!err)
 			prog->instances.fds[0] = fd;
 		goto out;
@@ -1292,7 +1324,8 @@ bpf_program__load(struct bpf_program *prog,
 		err = load_program(prog->type, prog->expected_attach_type,
 				   prog->name, result.new_insn_ptr,
 				   result.new_insn_cnt,
-				   license, kern_version, &fd);
+				   license, kern_version,
+				   buildid, &fd);
 
 		if (err) {
 			pr_warning("Loading the %dth instance of program '%s' failed\n",
@@ -1324,7 +1357,8 @@ bpf_object__load_progs(struct bpf_object *obj)
 			continue;
 		err = bpf_program__load(&obj->programs[i],
 					obj->license,
-					obj->kern_version);
+					obj->kern_version,
+					obj->buildid);
 		if (err)
 			return err;
 	}
diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c
index 79b54f8ddebf..2a738b7b743a 100644
--- a/tools/perf/tests/bpf.c
+++ b/tools/perf/tests/bpf.c
@@ -298,6 +298,7 @@ static int check_env(void)
 	int err;
 	unsigned int kver_int;
 	char license[] = "GPL";
+	char buildid[64];
 
 	struct bpf_insn insns[] = {
 		BPF_MOV64_IMM(BPF_REG_0, 1),
@@ -310,9 +311,15 @@ static int check_env(void)
 		return err;
 	}
 
+	err = fetch_kernel_buildid(buildid, sizeof(buildid));
+	if (err) {
+		pr_debug("Unable to get kernel buildid\n");
+		return err;
+	}
+
 	err = bpf_load_program(BPF_PROG_TYPE_KPROBE, insns,
 			       sizeof(insns) / sizeof(insns[0]),
-			       license, kver_int, NULL, 0);
+			       license, kver_int, buildid, NULL, 0);
 	if (err < 0) {
 		pr_err("Missing basic BPF support, skip this test: %s\n",
 		       strerror(errno));
-- 
2.13.6

^ permalink raw reply related

* [PATCH 7/9] libbpf: Synchronize uapi bpf.h header
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

Synchronize include/uapi/linux/bpf.h with tools version.

Link: http://lkml.kernel.org/n/tip-gaja0nnet6oku657642nxnaf@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465023a2..17d8d330e6c7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -308,6 +308,8 @@ union bpf_attr {
 		 * (context accesses, allowed helpers, etc).
 		 */
 		__u32		expected_attach_type;
+		/* Checked when prog_type=kprobe and CONFIG_BPF_BUILDID_CHECK. */
+		__aligned_u64	kern_buildid;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -864,6 +866,7 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
 #define BPF_F_DONT_FRAGMENT		(1ULL << 2)
+#define BPF_F_SEQ_NUMBER		(1ULL << 3)
 
 /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
  * BPF_FUNC_perf_event_read_value flags.
-- 
2.13.6

^ permalink raw reply related

* [PATCH 6/9] bpf: Add CONFIG_BPF_BUILDID_CHECK option
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

Adding CONFIG_BPF_BUILDID_CHECK option that forces kernel
to check on provided build id when loading eBPF program.
If the build id does not match the one in kernel the program
fails to load.

Adding new field into struct bpf_attr. The kern_buildid
points to the user memory that contains the buildid.

Kernel expose the notes section via sysfs, but there's
currently no other use for kernel's buildid, so I needed
to add new __init buildid_init function to parse it out.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/uapi/linux/bpf.h |  2 ++
 init/Kconfig             |  9 ++++++
 kernel/bpf/syscall.c     | 84 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..17d8d330e6c7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -308,6 +308,8 @@ union bpf_attr {
 		 * (context accesses, allowed helpers, etc).
 		 */
 		__u32		expected_attach_type;
+		/* Checked when prog_type=kprobe and CONFIG_BPF_BUILDID_CHECK. */
+		__aligned_u64	kern_buildid;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/init/Kconfig b/init/Kconfig
index 572df24dda9b..6d32220de7e0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1406,6 +1406,15 @@ config BPF_JIT_ALWAYS_ON
 	  Enables BPF JIT and removes BPF interpreter to avoid
 	  speculative execution of BPF instructions by the interpreter
 
+config BPF_BUILDID_CHECK
+	bool "Check buildid for kprobe and tracepoint programs"
+	depends on BPF_SYSCALL
+	select BUILDID_H
+	default n
+	help
+	  Enables BPF program load code to check on kernel Build ID
+	  for kprobe programs.
+
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
 	select ANON_INODES
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0244973ee544..d0b3bc0bd9e6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1239,7 +1239,85 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#define	BPF_PROG_LOAD_LAST_FIELD expected_attach_type
+#define	BPF_PROG_LOAD_LAST_FIELD kern_buildid
+
+#ifdef CONFIG_BPF_BUILDID_CHECK
+
+static struct {
+	const char *ptr;
+	int         size;
+} buildid;
+
+#define BUILDID_MAX 40
+
+static int __check_buildid(union bpf_attr *attr)
+{
+	char id[buildid.size];
+
+	/* copy buildid from user space */
+	if (strncpy_from_user(id, u64_to_user_ptr(attr->kern_buildid),
+			      buildid.size) < 0)
+		return -EFAULT;
+
+	return memcmp(id, buildid.ptr, buildid.size);
+}
+
+static int check_buildid(union bpf_attr *attr)
+{
+	return buildid.ptr ? __check_buildid(attr) : 0;
+}
+
+#define NT_ALIGN(n)	(((n) + 3) & ~3)
+#define NT_GNU_BUILD_ID	3
+
+extern const void __start_notes __weak;
+extern const void __stop_notes __weak;
+
+static int __init buildid_init(void)
+{
+	const void *ptr = &__start_notes;
+
+	while (ptr < &__stop_notes) {
+		const Elf64_Nhdr *nhdr = ptr;
+		size_t namesz = NT_ALIGN(nhdr->n_namesz),
+		       descsz = NT_ALIGN(nhdr->n_descsz);
+		const char *name;
+
+		ptr += sizeof(*nhdr);
+		name = ptr;
+		ptr += namesz;
+
+		if (nhdr->n_type != NT_GNU_BUILD_ID ||
+		    nhdr->n_namesz != sizeof("GNU") ||
+		    memcmp(name, "GNU", sizeof("GNU"))) {
+			ptr += descsz;
+			continue;
+		}
+
+		buildid.ptr = ptr;
+		buildid.size = descsz;
+		break;
+	}
+
+	/* Sanity checks on the parsed buildid. */
+	if (!buildid.ptr) {
+		pr_warn("bpf: GNU buildid not found, switching off the check\n");
+	} else if (buildid.size > 64) {
+		pr_warn("bpf: GNU buildid too long, switching off the check\n");
+		buildid.ptr = NULL;
+	}
+
+	return 0;
+}
+
+subsys_initcall(buildid_init);
+
+#else
+static int check_buildid(union bpf_attr *attr __maybe_unused)
+{
+	return 0;
+}
+#endif /* CONFIG_BPF_BUILDID_CHECK */
 
 static int bpf_prog_load(union bpf_attr *attr)
 {
@@ -1271,6 +1349,10 @@ static int bpf_prog_load(union bpf_attr *attr)
 	    attr->kern_version != LINUX_VERSION_CODE)
 		return -EINVAL;
 
+	if (type == BPF_PROG_TYPE_KPROBE &&
+	    check_buildid(attr))
+		return -EINVAL;
+
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
 	    !capable(CAP_SYS_ADMIN))
-- 
2.13.6

^ permalink raw reply related

* [PATCH 5/9] bpf: Add CONFIG_BUILDID_H option
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

Adding CONFIG_BUILDID_H option that forces build to generate
file with GNU build id value:

  include/linux/buildid.h

It contains following macros:

  #define LINUX_BUILDID_DATA "\x6c\x41\x0f\xea\xa9\x5d ...
  #define LINUX_BUILDID_SIZE 20

Those macros will be used in following patches to identify
kernel in more precise way when loading eBPF program that
can touch kernel internal structures.

There's new build output for the check and update
of the buildid.h:

    $ make
    ...
    LD      vmlinux.o
    MODPOST vmlinux.o
    KSYM    .tmp_kallsyms1.o
    KSYM    .tmp_kallsyms2.o
    LD      vmlinux
    SORTEX  vmlinux
    SYSMAP  System.map
    CHK     include/generated/uapi/linux/buildid.h
    UPD     include/generated/uapi/linux/buildid.h
    ...

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 Makefile                  | 12 ++++++++++++
 init/Kconfig              |  3 +++
 scripts/Makefile          |  1 +
 scripts/extract-buildid.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 58 insertions(+)
 create mode 100644 scripts/extract-buildid.c

diff --git a/Makefile b/Makefile
index a65a3919c6ad..92b04d8f08bc 100644
--- a/Makefile
+++ b/Makefile
@@ -1023,6 +1023,15 @@ endif
 include/generated/autoksyms.h: FORCE
 	$(Q)$(CONFIG_SHELL) $(srctree)/scripts/adjust_autoksyms.sh true
 
+ifdef CONFIG_BUILDID_H
+buildid_h := include/linux/buildid.h
+
+define filechk_buildid.h
+	buildid=`readelf -n $@ | grep 'Build ID' | sed -e 's/^.*Build ID: \(.*\)$$/\1/'`; \
+	scripts/extract-buildid $$buildid
+endef
+endif
+
 ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
 
 # Final link of vmlinux with optional arch pass after final link
@@ -1032,6 +1041,9 @@ cmd_link-vmlinux =                                                 \
 
 vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
 	+$(call if_changed,link-vmlinux)
+ifdef CONFIG_BUILDID_H
+	+$(call filechk2,buildid.h,$(buildid_h))
+endif
 
 # Build samples along the rest of the kernel
 ifdef CONFIG_SAMPLES
diff --git a/init/Kconfig b/init/Kconfig
index 2852692d7c9c..572df24dda9b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1386,6 +1386,9 @@ config KALLSYMS_BASE_RELATIVE
 
 # end of the "standard kernel features (expert users)" menu
 
+config BUILDID_H
+	bool
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
diff --git a/scripts/Makefile b/scripts/Makefile
index 25ab143cbe14..fa34eaed6c29 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -19,6 +19,7 @@ hostprogs-$(CONFIG_ASN1)	 += asn1_compiler
 hostprogs-$(CONFIG_MODULE_SIG)	 += sign-file
 hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
 hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
+hostprogs-$(CONFIG_BUILDID_H)	 += extract-buildid
 
 HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
 HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/extract-buildid.c b/scripts/extract-buildid.c
new file mode 100644
index 000000000000..a116723da3ad
--- /dev/null
+++ b/scripts/extract-buildid.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Formats buildid into following macros:
+ *
+ * #define LINUX_BUILDID_DATA "\x6c\x41\x0f\xea\xa9\x5d\x46 ...
+ * #define LINUX_BUILDID_SIZE 20
+ *
+ */
+#include <string.h>
+#include <stdio.h>
+
+int main(int argc, char **argv)
+{
+	char *id;
+	int len, i;
+
+	if (argc != 2) {
+		fprintf(stderr, "usage: %s buildid\n", argv[0]);
+		return -1;
+	}
+
+	id  = argv[1];
+	len = strlen(id);
+
+	printf("#ifndef _LINUX_BUILDID_H\n");
+	printf("#define _LINUX_BUILDID_H\n");
+	printf("\n");
+
+	printf("#define LINUX_BUILDID_DATA \"");
+
+	for (i = 0; i < len; i += 2)
+		printf("\\x%c%c", id[i], id[i + 1]);
+
+	printf("\"\n");
+
+	printf("#define LINUX_BUILDID_SIZE %u\n", len / 2);
+
+	printf("\n");
+	printf("#endif /* _LINUX_BUILDID_H */\n");
+	return 0;
+}
-- 
2.13.6

^ permalink raw reply related

* [PATCH 4/9] kbuild: Add filechk2 function
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

Adding filechk2 function  that has the same semantics
as filechk, but it takes the target file from the 2nd
argument instead of from the '$@' as in the filechk
function.

This function is needed when we can't have separate
target for the file, like in the following patch.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 scripts/Kbuild.include | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 065324a8046f..9775ce2771d4 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -67,6 +67,30 @@ define filechk
 	fi
 endef
 
+# filechk2 is used to check if the content of a generated file is updated.
+# It follows the same logic as the filechk except instead of the $@ target
+# variable, the checked file is passed in 2nd argument.
+#
+# Sample usage:
+# define filechk_sample
+#	echo $KERNELRELEASE
+# endef
+# version.h : Makefile
+#	$(call filechk2,sample,version.h)
+# endef
+define filechk2
+	$(Q)set -e;					\
+	$(kecho) '  CHK     $(2)';			\
+	mkdir -p $(dir $(2));				\
+	$(filechk_$(1)) < $< > $(2).tmp;		\
+	if [ -r $(2) ] && cmp -s $(2) $(2).tmp; then	\
+		rm -f $(2).tmp;				\
+	else						\
+		$(kecho) '  UPD     $(2)';		\
+		mv -f $(2).tmp $(2);			\
+	fi
+endef
+
 ######
 # gcc support functions
 # See documentation in Documentation/kbuild/makefiles.txt
-- 
2.13.6

^ permalink raw reply related

* [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

There's no need to pass LD* arguments to link-vmlinux.sh,
because they are passed as variables. The only argument
the link-vmlinux.sh supports is the 'clean' argument.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index d3300e46f925..a65a3919c6ad 100644
--- a/Makefile
+++ b/Makefile
@@ -1027,7 +1027,7 @@ ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
 
 # Final link of vmlinux with optional arch pass after final link
 cmd_link-vmlinux =                                                 \
-	$(CONFIG_SHELL) $< $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux) ;    \
+	$(CONFIG_SHELL) $< ;                                       \
 	$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)
 
 vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
-- 
2.13.6

^ permalink raw reply related

* [PATCH 2/9] perf tools: Add fetch_kernel_buildid function
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

Adding fetch_kernel_buildid helper function to
retrieve build id from running kernel. It will
be used in following patches.

Link: http://lkml.kernel.org/n/tip-at98orsncas8v2ito61u3qod@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/util/util.c | 18 ++++++++++++++++++
 tools/perf/util/util.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 41fc61d941ef..26f93866bc02 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -495,6 +495,24 @@ int parse_notes_buildid(void *note_data, size_t note_len, void *bf,
 	return -1;
 }
 
+int fetch_kernel_buildid(char *buildid, int size)
+{
+	char path[PATH_MAX], *buf;
+	size_t len;
+	int err;
+
+	scnprintf(path, PATH_MAX, "%s/kernel/notes",
+		  sysfs__mountpoint());
+
+	if (filename__read_str(path, &buf, &len))
+		return -EINVAL;
+
+	err = parse_notes_buildid(buf, len, buildid, size, false);
+
+	free(buf);
+	return err;
+}
+
 const char *perf_tip(const char *dirpath)
 {
 	struct strlist *tips;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 27106548396b..1ccaa957e3ab 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -48,6 +48,8 @@ extern int cacheline_size;
 int fetch_kernel_version(unsigned int *puint,
 			 char *str, size_t str_sz);
 
+int fetch_kernel_buildid(char *buildid, int size);
+
 int parse_notes_buildid(void *note_data, size_t note_len, void *bf,
 			size_t size, bool need_swap);
 
-- 
2.13.6

^ permalink raw reply related

* [PATCH 1/9] perf tools: Make read_build_id function public
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

And renaming it into parse_notes_buildid to be more
precise and usable in following patches.

Link: http://lkml.kernel.org/n/tip-v1mz76rkdxfnbfz2v05fumn6@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/util/symbol-minimal.c | 50 ++--------------------------------------
 tools/perf/util/util.c           | 48 ++++++++++++++++++++++++++++++++++++++
 tools/perf/util/util.h           |  4 ++++
 3 files changed, 54 insertions(+), 48 deletions(-)

diff --git a/tools/perf/util/symbol-minimal.c b/tools/perf/util/symbol-minimal.c
index ff48d0d49584..bd281d3dc508 100644
--- a/tools/perf/util/symbol-minimal.c
+++ b/tools/perf/util/symbol-minimal.c
@@ -24,53 +24,6 @@ static bool check_need_swap(int file_endian)
 	return host_endian != file_endian;
 }
 
-#define NOTE_ALIGN(sz) (((sz) + 3) & ~3)
-
-#define NT_GNU_BUILD_ID	3
-
-static int read_build_id(void *note_data, size_t note_len, void *bf,
-			 size_t size, bool need_swap)
-{
-	struct {
-		u32 n_namesz;
-		u32 n_descsz;
-		u32 n_type;
-	} *nhdr;
-	void *ptr;
-
-	ptr = note_data;
-	while (ptr < (note_data + note_len)) {
-		const char *name;
-		size_t namesz, descsz;
-
-		nhdr = ptr;
-		if (need_swap) {
-			nhdr->n_namesz = bswap_32(nhdr->n_namesz);
-			nhdr->n_descsz = bswap_32(nhdr->n_descsz);
-			nhdr->n_type = bswap_32(nhdr->n_type);
-		}
-
-		namesz = NOTE_ALIGN(nhdr->n_namesz);
-		descsz = NOTE_ALIGN(nhdr->n_descsz);
-
-		ptr += sizeof(*nhdr);
-		name = ptr;
-		ptr += namesz;
-		if (nhdr->n_type == NT_GNU_BUILD_ID &&
-		    nhdr->n_namesz == sizeof("GNU")) {
-			if (memcmp(name, "GNU", sizeof("GNU")) == 0) {
-				size_t sz = min(size, descsz);
-				memcpy(bf, ptr, sz);
-				memset(bf + sz, 0, size - sz);
-				return 0;
-			}
-		}
-		ptr += descsz;
-	}
-
-	return -1;
-}
-
 int filename__read_debuglink(const char *filename __maybe_unused,
 			     char *debuglink __maybe_unused,
 			     size_t size __maybe_unused)
@@ -153,7 +106,8 @@ int filename__read_build_id(const char *filename, void *bf, size_t size)
 			if (fread(buf, buf_size, 1, fp) != 1)
 				goto out_free;
 
-			ret = read_build_id(buf, buf_size, bf, size, need_swap);
+			ret = parse_notes_buildid(buf, buf_size, bf, size,
+						  need_swap);
 			if (ret == 0)
 				ret = size;
 			break;
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 1019bbc5dbd8..41fc61d941ef 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -447,6 +447,54 @@ fetch_kernel_version(unsigned int *puint, char *str,
 	return 0;
 }
 
+#define NOTE_ALIGN(sz) (((sz) + 3) & ~3)
+
+#define NT_GNU_BUILD_ID	3
+
+int parse_notes_buildid(void *note_data, size_t note_len, void *bf,
+			size_t size, bool need_swap)
+{
+	struct {
+		u32 n_namesz;
+		u32 n_descsz;
+		u32 n_type;
+	} *nhdr;
+	void *ptr;
+
+	ptr = note_data;
+	while (ptr < (note_data + note_len)) {
+		const char *name;
+		size_t namesz, descsz;
+
+		nhdr = ptr;
+		if (need_swap) {
+			nhdr->n_namesz = bswap_32(nhdr->n_namesz);
+			nhdr->n_descsz = bswap_32(nhdr->n_descsz);
+			nhdr->n_type = bswap_32(nhdr->n_type);
+		}
+
+		namesz = NOTE_ALIGN(nhdr->n_namesz);
+		descsz = NOTE_ALIGN(nhdr->n_descsz);
+
+		ptr += sizeof(*nhdr);
+		name = ptr;
+		ptr += namesz;
+		if (nhdr->n_type == NT_GNU_BUILD_ID &&
+		    nhdr->n_namesz == sizeof("GNU")) {
+			if (memcmp(name, "GNU", sizeof("GNU")) == 0) {
+				size_t sz = min(size, descsz);
+
+				memcpy(bf, ptr, sz);
+				memset(bf + sz, 0, size - sz);
+				return 0;
+			}
+		}
+		ptr += descsz;
+	}
+
+	return -1;
+}
+
 const char *perf_tip(const char *dirpath)
 {
 	struct strlist *tips;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 9496365da3d7..27106548396b 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -47,6 +47,10 @@ extern int cacheline_size;
 
 int fetch_kernel_version(unsigned int *puint,
 			 char *str, size_t str_sz);
+
+int parse_notes_buildid(void *note_data, size_t note_len, void *bf,
+			size_t size, bool need_swap);
+
 #define KVER_VERSION(x)		(((x) >> 16) & 0xff)
 #define KVER_PATCHLEVEL(x)	(((x) >> 8) & 0xff)
 #define KVER_SUBLEVEL(x)	((x) & 0xff)
-- 
2.13.6

^ permalink raw reply related

* [RFC 0/9] bpf: Add buildid check support
From: Jiri Olsa @ 2018-04-05 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: lkml, netdev, linux-kbuild, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Masahiro Yamada, Michal Marek,
	Jiri Kosina

hi,
eBPF programs loaded for kprobes are allowed to read kernel
internal structures. We check the provided kernel version
to ensure that the program is loaded for the proper kernel. 

The problem is that the version check is not enough, because
it only follows the version setup from kernel's Makefile.
However, the internal kernel structures change based on the
.config data, so in practise we have different kernels with
same version.

The eBPF kprobe program thus then get loaded for different
kernel than it's been built for, get wrong data (silently)
and provide misleading output.

This patchset implements additional check in eBPF loading code
on provided build ID (from kernel's elf image, .notes section
GNU build ID) to ensure we load the eBPF program on correct
kernel.

Also available in here (based on bpf-next/master):
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/checksum

This patchset consists of several changes:

- adding CONFIG_BUILDID_H option that instructs the build
  to generate uapi header file with build ID data, that
  will be included by eBPF program

- adding CONFIG_BPF_BUILDID_CHECK option and new bpf_attr
  field to allow build ID checking when loading the eBPF
  program

- changing libbpf to read and pass build ID to the kernel

- several small side fixes

- example perf eBPF code in bpf-samples/bpf-stdout-example.c
  to show the build ID support/usage.

    # perf record -vv  -e ./bpf-samples/bpf-stdout-example.c kill 2>&1 | grep buildid
    libbpf: section(7) buildid, size 21, link 0, flags 3, type=1
    libbpf: kernel buildid of ./bpf-samples/bpf-stdout-example.c is: 6e25edeb408513184e2753bebad25d42314501a0

  The buildid is provided the same way we provide kernel
  version, in a special "buildid" section:

    # cat ./bpf-samples/bpf-stdout-example.c
    ...
    #include <linux/buildid.h>

    char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
    ...

  where LINUX_BUILDID_DATA is defined in the generated buildid.h.

please note it's an RFC ;-) any comments and suggestions are welcome

thanks,
jirka


---
Jiri Olsa (9):
      perf tools: Make read_build_id function public
      perf tools: Add fetch_kernel_buildid function
      kbuild: Do not pass arguments to link-vmlinux.sh
      kbuild: Add filechk2 function
      bpf: Add CONFIG_BUILDID_H option
      bpf: Add CONFIG_BPF_BUILDID_CHECK option
      libbpf: Synchronize uapi bpf.h header
      libbpf: Add support to attach buildid to program load
      perf tools: The buildid usage in example eBPF program

 Makefile                                    | 14 +++++++++++++-
 include/uapi/linux/bpf.h                    |  2 ++
 init/Kconfig                                | 12 ++++++++++++
 kernel/bpf/syscall.c                        | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 scripts/Kbuild.include                      | 24 ++++++++++++++++++++++++
 scripts/Makefile                            |  1 +
 scripts/extract-buildid.c                   | 42 ++++++++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/Makefile                  |  5 ++++-
 tools/include/uapi/linux/bpf.h              |  3 +++
 tools/lib/bpf/bpf.c                         |  6 ++++--
 tools/lib/bpf/bpf.h                         |  5 +++--
 tools/lib/bpf/libbpf.c                      | 46 ++++++++++++++++++++++++++++++++++++++++------
 tools/perf/bpf-samples/bpf-stdout-example.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 tools/perf/tests/bpf.c                      |  9 ++++++++-
 tools/perf/util/symbol-minimal.c            | 50 ++------------------------------------------------
 tools/perf/util/util.c                      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/util.h                      |  6 ++++++
 17 files changed, 355 insertions(+), 62 deletions(-)
 create mode 100644 scripts/extract-buildid.c
 create mode 100644 tools/perf/bpf-samples/bpf-stdout-example.c

^ permalink raw reply

* Re: [PATCH] net: thunderx: rework mac addresses list to u64 array
From: Christoph Hellwig @ 2018-04-05 15:07 UTC (permalink / raw)
  To: Vadim Lomovtsev
  Cc: sgoutham, sunil.kovvuri, robert.richter, linux-arm-kernel, netdev,
	linux-kernel, davem, dnelson, gustavo, Vadim Lomovtsev
In-Reply-To: <20180405145756.12633-1-Vadim.Lomovtsev@caviumnetworks.com>

>  struct xcast_addr_list {
> -	struct list_head list;
>  	int              count;
> +	u64              mc[0];

Please use the standard C99 syntax here:

	u64              mc[];

> +				mc_list = kmalloc(sizeof(*mc_list) +
> +						  sizeof(u64) * netdev_mc_count(netdev),
> +						  GFP_ATOMIC);

kmalloc_array(), please.

^ permalink raw reply

* Re: [PATCH] net: thunderx: rework mac addresses list to u64 array
From: Vadim Lomovtsev @ 2018-04-05 15:00 UTC (permalink / raw)
  To: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
	linux-kernel, davem
  Cc: dnelson, gustavo, Vadim Lomovtsev
In-Reply-To: <20180405145756.12633-1-Vadim.Lomovtsev@caviumnetworks.com>

[correct Roberts' address]

On Thu, Apr 05, 2018 at 07:57:56AM -0700, Vadim Lomovtsev wrote:
> From: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> 
> It is too expensive to pass u64 values via linked list, instead
> allocate array for them by overall number of mac addresses from netdev.
> 
> This eventually removes multiple kmalloc() calls, aviod memory
> fragmentation and allow to put single null check on kmalloc
> return value in order to prevent a potential null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> ---
>  drivers/net/ethernet/cavium/thunder/nic.h        |  7 +-----
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +++++++++---------------
>  2 files changed, 11 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
> index 5fc46c5a4f36..da052159f014 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
>  
>  struct cavium_ptp;
>  
> -struct xcast_addr {
> -	struct list_head list;
> -	u64              addr;
> -};
> -
>  struct xcast_addr_list {
> -	struct list_head list;
>  	int              count;
> +	u64              mc[0];
>  };
>  
>  struct nicvf_work {
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31fef729..a26d8bc92e01 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  						  work.work);
>  	struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
>  	union nic_mbx mbx = {};
> -	struct xcast_addr *xaddr, *next;
> +	u8 idx = 0;
>  
>  	if (!vf_work)
>  		return;
> @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  	/* check if we have any specific MACs to be added to PF DMAC filter */
>  	if (vf_work->mc) {
>  		/* now go through kernel list of MACs and add them one by one */
> -		list_for_each_entry_safe(xaddr, next,
> -					 &vf_work->mc->list, list) {
> +		for (idx = 0; idx < vf_work->mc->count; idx++) {
>  			mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> -			mbx.xcast.data.mac = xaddr->addr;
> +			mbx.xcast.data.mac = vf_work->mc->mc[idx];
>  			nicvf_send_msg_to_pf(nic, &mbx);
> -
> -			/* after receiving ACK from PF release memory */
> -			list_del(&xaddr->list);
> -			kfree(xaddr);
> -			vf_work->mc->count--;
>  		}
>  		kfree(vf_work->mc);
>  	}
> @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
>  			mode |= BGX_XCAST_MCAST_FILTER;
>  			/* here we need to copy mc addrs */
>  			if (netdev_mc_count(netdev)) {
> -				struct xcast_addr *xaddr;
> -
> -				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> -				INIT_LIST_HEAD(&mc_list->list);
> +				mc_list = kmalloc(sizeof(*mc_list) +
> +						  sizeof(u64) * netdev_mc_count(netdev),
> +						  GFP_ATOMIC);
> +				if (unlikely(!mc_list))
> +					return;
> +				mc_list->count = 0;
>  				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
> -					xaddr = kmalloc(sizeof(*xaddr),
> -							GFP_ATOMIC);
> -					xaddr->addr =
> +					mc_list->mc[mc_list->count] =
>  						ether_addr_to_u64(ha->addr);
> -					list_add_tail(&xaddr->list,
> -						      &mc_list->list);
>  					mc_list->count++;
>  				}
>  			}
> -- 
> 2.14.3
> 

^ permalink raw reply

* [PATCH] net: thunderx: rework mac addresses list to u64 array
From: Vadim Lomovtsev @ 2018-04-05 14:57 UTC (permalink / raw)
  To: sgoutham, sunil.kovvuri, robert.richter, linux-arm-kernel, netdev,
	linux-kernel, davem
  Cc: dnelson, gustavo, Vadim Lomovtsev

From: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/nic.h        |  7 +-----
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +++++++++---------------
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..da052159f014 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
 
 struct cavium_ptp;
 
-struct xcast_addr {
-	struct list_head list;
-	u64              addr;
-};
-
 struct xcast_addr_list {
-	struct list_head list;
 	int              count;
+	u64              mc[0];
 };
 
 struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..a26d8bc92e01 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
 						  work.work);
 	struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
 	union nic_mbx mbx = {};
-	struct xcast_addr *xaddr, *next;
+	u8 idx = 0;
 
 	if (!vf_work)
 		return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
 	/* check if we have any specific MACs to be added to PF DMAC filter */
 	if (vf_work->mc) {
 		/* now go through kernel list of MACs and add them one by one */
-		list_for_each_entry_safe(xaddr, next,
-					 &vf_work->mc->list, list) {
+		for (idx = 0; idx < vf_work->mc->count; idx++) {
 			mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-			mbx.xcast.data.mac = xaddr->addr;
+			mbx.xcast.data.mac = vf_work->mc->mc[idx];
 			nicvf_send_msg_to_pf(nic, &mbx);
-
-			/* after receiving ACK from PF release memory */
-			list_del(&xaddr->list);
-			kfree(xaddr);
-			vf_work->mc->count--;
 		}
 		kfree(vf_work->mc);
 	}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
 			mode |= BGX_XCAST_MCAST_FILTER;
 			/* here we need to copy mc addrs */
 			if (netdev_mc_count(netdev)) {
-				struct xcast_addr *xaddr;
-
-				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-				INIT_LIST_HEAD(&mc_list->list);
+				mc_list = kmalloc(sizeof(*mc_list) +
+						  sizeof(u64) * netdev_mc_count(netdev),
+						  GFP_ATOMIC);
+				if (unlikely(!mc_list))
+					return;
+				mc_list->count = 0;
 				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
-					xaddr = kmalloc(sizeof(*xaddr),
-							GFP_ATOMIC);
-					xaddr->addr =
+					mc_list->mc[mc_list->count] =
 						ether_addr_to_u64(ha->addr);
-					list_add_tail(&xaddr->list,
-						      &mc_list->list);
 					mc_list->count++;
 				}
 			}
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH] net: phy: marvell: Enable interrupt function on LED2 pin
From: Andrew Lunn @ 2018-04-05 14:43 UTC (permalink / raw)
  To: Bhadram Varka
  Cc: Esben Haabendal, Esben Haabendal, Rasmus Villemoes,
	Florian Fainelli, open list, netdev@vger.kernel.org
In-Reply-To: <4d9e0665abf7408988fc4ce20c26a08f@bgmail102.nvidia.com>

> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index 0e0978d8a0eb..f03a510f1247 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -457,6 +457,21 @@ static int marvell_of_reg_init(struct phy_device *phydev)  }  #endif /* CONFIG_OF_MDIO */
>  
> +static int m88e1318_config_intr(struct phy_device *phydev) {
> +	int err;
> +
> +	err = marvell_config_intr(phydev);
> +	if (err)
> +		return err;
> +
> +	/* Setup LED[2] as interrupt pin (active low) */
> +	return phy_modify(phydev, MII_88E1318S_PHY_LED_TCR,
> +			  MII_88E1318S_PHY_LED_TCR_FORCE_INT,
> +			  MII_88E1318S_PHY_LED_TCR_INTn_ENABLE |
> +			  MII_88E1318S_PHY_LED_TCR_INT_ACTIVE_LOW);
> 
> Can we move this part of the code to m88e1121_config_init() ?
> 
> Every time whether we disable or enable the interrupts this part of code will execute.

Yes, doing this once would be better. But please allow the LED pin to
be used as an LED when not using interrupts. phy_interrupt_is_valid()
should be involved somehow.

       Andrew

^ permalink raw reply

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Laurentiu Tudor @ 2018-04-05 14:43 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Stuart Yoder, Arnd Bergmann, Ioana Ciornei, gregkh,
	Linux Kernel Mailing List, Ruxandra Ioana Ciocoi Radulescu,
	Razvan Stefanescu, Roy Pledge, Networking
In-Reply-To: <20180405124810.GE12178@lunn.ch>

Hi Andrew,

On 04/05/2018 03:48 PM, Andrew Lunn wrote:
>>> Hi Laurentiu
>>>
>>> So i can use switchdev without it? I can modprobe the switchdev
>>> driver, all the physical interfaces will appear, and i can use ip addr
>>> add etc. I do not need to use a user space tool at all in order to use
>>> the network functionality?
>>
>> Absolutely!
>
> Great.
>
> Then the easiest way forwards is to simply drop the IOCTL code for the
> moment. Get the basic support for the hardware into the kernel
> first. Then come back later to look at dynamic behaviour which needs
> some form of configuration.

Hmm, not sure I understand. We already have a fully functional ethernet 
driver [1] and a switch driver [2] ...

>> In normal use cases the system designer, depending on the requirements,
>> configures the various devices that it desires through a firmware
>> configuration (think something like a device tree). The devices
>> configured are presented on the mc-bus and probed normally by the
>> kernel. The standard networking linux tools can be used as expected.
>
> So what you should probably do is start a discussion on what this
> device tree binding looks like. But you need to be careful even
> here. Device tree describes the hardware, not how you configure the
> hardware. So maybe DT does not actually fit.

It's not an actual device tree, but a configuration file that happens to 
reuse the DTS format. I guess my analogy with a device tree was not the 
best.
Detailed documentation on the syntax can be found here [3], chapter 22.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/fsl-dpaa2/ethernet
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/fsl-dpaa2/ethsw
[3] https://www.nxp.com/docs/en/user-guide/DPAA2_UM.pdf

---
Best Regards, Laurentiu

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-05 14:41 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christian Brauner, ebiederm, davem, gregkh, netdev, linux-kernel,
	avagin, serge
In-Reply-To: <941de2b9-332f-75fc-f8ac-4059a9b5426f@virtuozzo.com>

On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> On 05.04.2018 17:07, Christian Brauner wrote:
> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >>>
> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >>> Over time the set of uevents that get sent into all network namespaces has
> >>> shrunk. We have now reached the point where hotplug events for all devices
> >>> that carry a namespace tag are filtered according to that namespace.
> >>>
> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >>> does not match the namespace tag of the netlink socket. One example are
> >>> network devices. Uevents for network devices only show up in the network
> >>> namespaces these devices are moved to or created in.
> >>>
> >>> However, any uevent for a kobject that does not have a namespace tag
> >>> associated with it will not be filtered and we will *try* to broadcast it
> >>> into all network namespaces.
> >>>
> >>> The original patchset was written in 2010 before user namespaces were a
> >>> thing. With the introduction of user namespaces sending out uevents became
> >>> partially isolated as they were filtered by user namespaces:
> >>>
> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >>>
> >>> if (!net_eq(sock_net(sk), p->net)) {
> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >>>                 return;
> >>>
> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >>>                 return;
> >>>
> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >>>                              CAP_NET_BROADCAST))
> >>>         j       return;
> >>> }
> >>>
> >>> The file_ns_capable() check will check whether the caller had
> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >>> namespace of interest. This check is fine in general but seems insufficient
> >>> to me when paired with uevents. The reason is that devices always belong to
> >>> the initial user namespace so uevents for kobjects that do not carry a
> >>> namespace tag should never be sent into another user namespace. This has
> >>> been the intention all along. But there's one case where this breaks,
> >>> namely if a new user namespace is created by root on the host and an
> >>> identity mapping is established between root on the host and root in the
> >>> new user namespace. Here's a reproducer:
> >>>
> >>>  sudo unshare -U --map-root
> >>>  udevadm monitor -k
> >>>  # Now change to initial user namespace and e.g. do
> >>>  modprobe kvm
> >>>  # or
> >>>  rmmod kvm
> >>>
> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >>> host. This seems very anecdotal given that in the general case user
> >>> namespaces do not see any uevents and also can't really do anything useful
> >>> with them.
> >>>
> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >>> namespace of the network namespace of the netlink socket) userspace process
> >>> make a decision what uevents should be sent.
> >>>
> >>> This makes me think that we should simply ensure that uevents for kobjects
> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >>> in kobj_bcast_filter(). Specifically:
> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >>>   event will always be filtered.
> >>> - If the network namespace the uevent socket belongs to was created in the
> >>>   initial user namespace but was opened from a non-initial user namespace
> >>>   the event will be filtered as well.
> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >>> always only sent to the initial user namespace. The regression potential
> >>> for this is near to non-existent since user namespaces can't really do
> >>> anything with interesting devices.
> >>>
> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >>> ---
> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >>> --- a/lib/kobject_uevent.c
> >>> +++ b/lib/kobject_uevent.c
> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >>>  		return sock_ns != ns;
> >>>  	}
> >>>  
> >>> -	return 0;
> >>> +	/*
> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >>> +	 * namespace below.
> >>> +	 */
> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >>> +		return 1;
> >>> +
> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >>>  }
> >>>  #endif
> >>
> >> So, this prohibits to listen events of all devices except network-related
> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> > 
> > No, this is not correct: As it is right now *without my patch* no
> > non-initial user namespace is receiving *any uevents* but those
> > specifically namespaced such as those for network devices. This patch
> > doesn't change that at all. The commit message outlines this in detail
> > how this comes about.
> > There is only one case where this currently breaks and this is as I
> > outlined explicitly in my commit message when you create a new user
> > namespace and map container(0) -> host(0). This patch fixes this.
> 
> Could you please point the place, where non-initial user namespaces are filtered?
> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> Now it will return 1 sometimes.

Oh sure, it's in the commit message though. The callchain is
lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
net/netlink/af_netlink.c:do_one_broadcast():

This codepiece will check whether the openened socket holds
CAP_NET_BROADCAST in the user namespace of the target network namespace
which it won't because we don't have device namespaces and all devices
belong to the initial set of namespaces.

        if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
                             CAP_NET_BROADCAST))
        j       return;

The only exception are network devices but they will pass the preceeding
if (!net_eq(sock_net(sk), p->net)) for any netlink uevent socket opened
in the same network namespace.

> 
> >> net-devices-only related interface and it's used for all devices in system.
> >> People may want to delegate block devices to nested user_ns, for example.
> > 
> > That's fine but that's why I added uevent injection in a previous patch
> > series: I repeat no non-initial user namespace will by default receive
> > uevents.
> 
> Thanks,
> Kirill

^ permalink raw reply

* Re: [PATCH net] arp: fix arp_filter on l3slave devices
From: David Ahern @ 2018-04-05 14:40 UTC (permalink / raw)
  To: Miguel Fadon Perlines, netdev, David Miller
In-Reply-To: <1522916738-192046-1-git-send-email-mfadon@teldat.com>

On 4/5/18 2:25 AM, Miguel Fadon Perlines wrote:
> arp_filter performs an ip_route_output search for arp source address and
> checks if output device is the same where the arp request was received,
> if it is not, the arp request is not answered.
> 
> This route lookup is always done on main route table so l3slave devices
> never find the proper route and arp is not answered.
> 
> Passing l3mdev_master_ifindex_rcu(dev) return value as oif fixes the
> lookup for l3slave devices while maintaining same behavior for non
> l3slave devices as this function returns 0 in that case.
> 
> Signed-off-by: Miguel Fadon Perlines <mfadon@teldat.com>
> ---
>  net/ipv4/arp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Acked-by: David Ahern <dsa@cumulusnetworks.com>

DaveM: this is a day 1 bug for VRF. Best guess at a Fixes tag would be:

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")

It would be good to get this into stable releases 4.9 and up. Thanks,

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Kirill Tkhai @ 2018-04-05 14:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: ebiederm, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180405140709.GA1697@gmail.com>

On 05.04.2018 17:07, Christian Brauner wrote:
> On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> On 04.04.2018 22:48, Christian Brauner wrote:
>>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>>>
>>> enabled sending hotplug events into all network namespaces back in 2010.
>>> Over time the set of uevents that get sent into all network namespaces has
>>> shrunk. We have now reached the point where hotplug events for all devices
>>> that carry a namespace tag are filtered according to that namespace.
>>>
>>> Specifically, they are filtered whenever the namespace tag of the kobject
>>> does not match the namespace tag of the netlink socket. One example are
>>> network devices. Uevents for network devices only show up in the network
>>> namespaces these devices are moved to or created in.
>>>
>>> However, any uevent for a kobject that does not have a namespace tag
>>> associated with it will not be filtered and we will *try* to broadcast it
>>> into all network namespaces.
>>>
>>> The original patchset was written in 2010 before user namespaces were a
>>> thing. With the introduction of user namespaces sending out uevents became
>>> partially isolated as they were filtered by user namespaces:
>>>
>>> net/netlink/af_netlink.c:do_one_broadcast()
>>>
>>> if (!net_eq(sock_net(sk), p->net)) {
>>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>>>                 return;
>>>
>>>         if (!peernet_has_id(sock_net(sk), p->net))
>>>                 return;
>>>
>>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>>>                              CAP_NET_BROADCAST))
>>>         j       return;
>>> }
>>>
>>> The file_ns_capable() check will check whether the caller had
>>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>>> namespace of interest. This check is fine in general but seems insufficient
>>> to me when paired with uevents. The reason is that devices always belong to
>>> the initial user namespace so uevents for kobjects that do not carry a
>>> namespace tag should never be sent into another user namespace. This has
>>> been the intention all along. But there's one case where this breaks,
>>> namely if a new user namespace is created by root on the host and an
>>> identity mapping is established between root on the host and root in the
>>> new user namespace. Here's a reproducer:
>>>
>>>  sudo unshare -U --map-root
>>>  udevadm monitor -k
>>>  # Now change to initial user namespace and e.g. do
>>>  modprobe kvm
>>>  # or
>>>  rmmod kvm
>>>
>>> will allow the non-initial user namespace to retrieve all uevents from the
>>> host. This seems very anecdotal given that in the general case user
>>> namespaces do not see any uevents and also can't really do anything useful
>>> with them.
>>>
>>> Additionally, it is now possible to send uevents from userspace. As such we
>>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>>> namespace of the network namespace of the netlink socket) userspace process
>>> make a decision what uevents should be sent.
>>>
>>> This makes me think that we should simply ensure that uevents for kobjects
>>> that do not carry a namespace tag are *always* filtered by user namespace
>>> in kobj_bcast_filter(). Specifically:
>>> - If the owning user namespace of the uevent socket is not init_user_ns the
>>>   event will always be filtered.
>>> - If the network namespace the uevent socket belongs to was created in the
>>>   initial user namespace but was opened from a non-initial user namespace
>>>   the event will be filtered as well.
>>> Put another way, uevents for kobjects not carrying a namespace tag are now
>>> always only sent to the initial user namespace. The regression potential
>>> for this is near to non-existent since user namespaces can't really do
>>> anything with interesting devices.
>>>
>>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>>> ---
>>>  lib/kobject_uevent.c | 10 +++++++++-
>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 15ea216a67ce..cb98cddb6e3b 100644
>>> --- a/lib/kobject_uevent.c
>>> +++ b/lib/kobject_uevent.c
>>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>>>  		return sock_ns != ns;
>>>  	}
>>>  
>>> -	return 0;
>>> +	/*
>>> +	 * The kobject does not carry a namespace tag so filter by user
>>> +	 * namespace below.
>>> +	 */
>>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>>> +		return 1;
>>> +
>>> +	/* Check if socket was opened from non-initial user namespace. */
>>> +	return sk_user_ns(dsk) != &init_user_ns;
>>>  }
>>>  #endif
>>
>> So, this prohibits to listen events of all devices except network-related
>> in containers? If it's so, I don't think it's a good solution. Uevents is not
> 
> No, this is not correct: As it is right now *without my patch* no
> non-initial user namespace is receiving *any uevents* but those
> specifically namespaced such as those for network devices. This patch
> doesn't change that at all. The commit message outlines this in detail
> how this comes about.
> There is only one case where this currently breaks and this is as I
> outlined explicitly in my commit message when you create a new user
> namespace and map container(0) -> host(0). This patch fixes this.

Could you please point the place, where non-initial user namespaces are filtered?
I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
Now it will return 1 sometimes.

>> net-devices-only related interface and it's used for all devices in system.
>> People may want to delegate block devices to nested user_ns, for example.
> 
> That's fine but that's why I added uevent injection in a previous patch
> series: I repeat no non-initial user namespace will by default receive
> uevents.

Thanks,
Kirill

^ permalink raw reply

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: gregkh @ 2018-04-05 14:19 UTC (permalink / raw)
  To: Laurentiu Tudor
  Cc: Andrew Lunn, Stuart Yoder, Arnd Bergmann, Ioana Ciornei,
	Linux Kernel Mailing List, Ruxandra Ioana Ciocoi Radulescu,
	Razvan Stefanescu, Roy Pledge, Networking
In-Reply-To: <5AC62E2A.90900@nxp.com>

On Thu, Apr 05, 2018 at 02:09:47PM +0000, Laurentiu Tudor wrote:
> Hi Greg,
> 
> On 04/05/2018 03:30 PM, gregkh wrote:
> > On Thu, Apr 05, 2018 at 10:30:01AM +0000, Laurentiu Tudor wrote:
> >> Hello,
> >>
> >> My 2c below.
> >>
> >> On 04/04/2018 03:42 PM, Andrew Lunn wrote:
> >>>> I hear you.  It is more complicated this way...having all these individual
> >>>> objects vs just a single "bundle" of them that represents a NIC.  But, that's
> >>>> the way the DPAA2 hardware is, and we're implementing kernel support for
> >>>> the hardware as it is.
> >>>
> >>> Hi Stuart
> >>>
> >>> I see we are not making any progress here.
> >>>
> >>> So what i suggest is you post the kernel code and configuration tool
> >>> concept to netdev for a full review. You want reviews from David
> >>> Miller, Jiri Pirko, Jakub Kicinski, David Ahern, etc.
> >>>
> >>
> >> I think that the discussion steered too much towards networking related
> >> topics, while this ioctl doesn't have much to do with networking.
> >> It's just an ioctl for our mc-bus bus driver that is used to manage the
> >> devices on this bus through userspace tools.
> >> In addition, I'd drop any mention of our reference user space app
> >> (restool) to emphasize that this ioctl is not added just for a
> >> particular user space app. I think Stuart also mentioned this.
> >
> > I'm not going to take a "generic device configuration ioctl" patch
> > unless it is documented to all exactly what it does, and why it is
> > there.
> 
> The ioctl() is just a simple pass-through interface to the firmware.

Ah, so a new syscall?  :)

> It passes commands to the firmware and returns the response back to the 
> userspace. Thus the ABI used by the firmware applies for this ioctl() 
> and it is documented in detail here:
> 
> https://www.nxp.com/docs/en/user-guide/DPAA2_UM.pdf

Let's wait on this until people all agree that it's ok to expose this
directly.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Laurentiu Tudor @ 2018-04-05 14:09 UTC (permalink / raw)
  To: gregkh
  Cc: Andrew Lunn, Stuart Yoder, Arnd Bergmann, Ioana Ciornei,
	Linux Kernel Mailing List, Ruxandra Ioana Ciocoi Radulescu,
	Razvan Stefanescu, Roy Pledge, Networking
In-Reply-To: <20180405123018.GA17751@kroah.com>

Hi Greg,

On 04/05/2018 03:30 PM, gregkh wrote:
> On Thu, Apr 05, 2018 at 10:30:01AM +0000, Laurentiu Tudor wrote:
>> Hello,
>>
>> My 2c below.
>>
>> On 04/04/2018 03:42 PM, Andrew Lunn wrote:
>>>> I hear you.  It is more complicated this way...having all these individual
>>>> objects vs just a single "bundle" of them that represents a NIC.  But, that's
>>>> the way the DPAA2 hardware is, and we're implementing kernel support for
>>>> the hardware as it is.
>>>
>>> Hi Stuart
>>>
>>> I see we are not making any progress here.
>>>
>>> So what i suggest is you post the kernel code and configuration tool
>>> concept to netdev for a full review. You want reviews from David
>>> Miller, Jiri Pirko, Jakub Kicinski, David Ahern, etc.
>>>
>>
>> I think that the discussion steered too much towards networking related
>> topics, while this ioctl doesn't have much to do with networking.
>> It's just an ioctl for our mc-bus bus driver that is used to manage the
>> devices on this bus through userspace tools.
>> In addition, I'd drop any mention of our reference user space app
>> (restool) to emphasize that this ioctl is not added just for a
>> particular user space app. I think Stuart also mentioned this.
>
> I'm not going to take a "generic device configuration ioctl" patch
> unless it is documented to all exactly what it does, and why it is
> there.

The ioctl() is just a simple pass-through interface to the firmware.
It passes commands to the firmware and returns the response back to the 
userspace. Thus the ABI used by the firmware applies for this ioctl() 
and it is documented in detail here:

https://www.nxp.com/docs/en/user-guide/DPAA2_UM.pdf

---
Best Regards, Laurentiu

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-05 14:07 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: ebiederm, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <442e89b8-e947-6eeb-1bcb-fa28f22a25f0@virtuozzo.com>

On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> On 04.04.2018 22:48, Christian Brauner wrote:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> > 
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk. We have now reached the point where hotplug events for all devices
> > that carry a namespace tag are filtered according to that namespace.
> > 
> > Specifically, they are filtered whenever the namespace tag of the kobject
> > does not match the namespace tag of the netlink socket. One example are
> > network devices. Uevents for network devices only show up in the network
> > namespaces these devices are moved to or created in.
> > 
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will *try* to broadcast it
> > into all network namespaces.
> > 
> > The original patchset was written in 2010 before user namespaces were a
> > thing. With the introduction of user namespaces sending out uevents became
> > partially isolated as they were filtered by user namespaces:
> > 
> > net/netlink/af_netlink.c:do_one_broadcast()
> > 
> > if (!net_eq(sock_net(sk), p->net)) {
> >         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >                 return;
> > 
> >         if (!peernet_has_id(sock_net(sk), p->net))
> >                 return;
> > 
> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >                              CAP_NET_BROADCAST))
> >         j       return;
> > }
> > 
> > The file_ns_capable() check will check whether the caller had
> > CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> > namespace of interest. This check is fine in general but seems insufficient
> > to me when paired with uevents. The reason is that devices always belong to
> > the initial user namespace so uevents for kobjects that do not carry a
> > namespace tag should never be sent into another user namespace. This has
> > been the intention all along. But there's one case where this breaks,
> > namely if a new user namespace is created by root on the host and an
> > identity mapping is established between root on the host and root in the
> > new user namespace. Here's a reproducer:
> > 
> >  sudo unshare -U --map-root
> >  udevadm monitor -k
> >  # Now change to initial user namespace and e.g. do
> >  modprobe kvm
> >  # or
> >  rmmod kvm
> > 
> > will allow the non-initial user namespace to retrieve all uevents from the
> > host. This seems very anecdotal given that in the general case user
> > namespaces do not see any uevents and also can't really do anything useful
> > with them.
> > 
> > Additionally, it is now possible to send uevents from userspace. As such we
> > can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> > namespace of the network namespace of the netlink socket) userspace process
> > make a decision what uevents should be sent.
> > 
> > This makes me think that we should simply ensure that uevents for kobjects
> > that do not carry a namespace tag are *always* filtered by user namespace
> > in kobj_bcast_filter(). Specifically:
> > - If the owning user namespace of the uevent socket is not init_user_ns the
> >   event will always be filtered.
> > - If the network namespace the uevent socket belongs to was created in the
> >   initial user namespace but was opened from a non-initial user namespace
> >   the event will be filtered as well.
> > Put another way, uevents for kobjects not carrying a namespace tag are now
> > always only sent to the initial user namespace. The regression potential
> > for this is near to non-existent since user namespaces can't really do
> > anything with interesting devices.
> > 
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  lib/kobject_uevent.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..cb98cddb6e3b 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >  		return sock_ns != ns;
> >  	}
> >  
> > -	return 0;
> > +	/*
> > +	 * The kobject does not carry a namespace tag so filter by user
> > +	 * namespace below.
> > +	 */
> > +	if (sock_net(dsk)->user_ns != &init_user_ns)
> > +		return 1;
> > +
> > +	/* Check if socket was opened from non-initial user namespace. */
> > +	return sk_user_ns(dsk) != &init_user_ns;
> >  }
> >  #endif
> 
> So, this prohibits to listen events of all devices except network-related
> in containers? If it's so, I don't think it's a good solution. Uevents is not

No, this is not correct: As it is right now *without my patch* no
non-initial user namespace is receiving *any uevents* but those
specifically namespaced such as those for network devices. This patch
doesn't change that at all. The commit message outlines this in detail
how this comes about.
There is only one case where this currently breaks and this is as I
outlined explicitly in my commit message when you create a new user
namespace and map container(0) -> host(0). This patch fixes this.

> net-devices-only related interface and it's used for all devices in system.
> People may want to delegate block devices to nested user_ns, for example.

That's fine but that's why I added uevent injection in a previous patch
series: I repeat no non-initial user namespace will by default receive
uevents.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH 00/12] Ethernet: Add and use ether_<type>_addr globals
From: Felix Fietkau @ 2018-04-05 14:05 UTC (permalink / raw)
  To: Joe Perches, netdev, linux-wireless, b43-dev, bridge,
	netfilter-devel, coreteam
  Cc: linux-kernel, brcm80211-dev-list.pdl, brcm80211-dev-list
In-Reply-To: <1522936292.11185.15.camel@perches.com>

On 2018-04-05 15:51, Joe Perches wrote:
>>  You have to factor in
>> not just the .text size, but the fact that referencing an exported
>> symbol needs a .reloc entry as well, which also eats up some space (at
>> least when the code is being built as module).
> 
> Thanks, the modules I built got smaller.
Please post some numbers to show this. By the way, on other
architectures the numbers will probably be different, especially on
ARM/MIPS.
>> In my opinion, your series probably causes more bloat in common
>> configurations instead of reducing it.
>> 
>> You're also touching several places that could easily use
>> eth_broadcast_addr and eth_zero_addr. I think making those changes would
>> be more productive than what you did in this series.
> 
> Doubtful. AFAIK: possible unaligned addresses.
Those two are just memset calls, alignment does not matter.

- Felix

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox