Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf 0/2] bpf: fix bug in x64 JIT
From: Alexei Starovoitov @ 2019-07-31  1:38 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team

Fix bug in computation of the jump offset to 1st instruction in BPF JIT
and add 2 tests.

Alexei Starovoitov (2):
  bpf: fix x64 JIT code generation for jmp to 1st insn
  selftests/bpf: tests for jmp to 1st insn

 arch/x86/net/bpf_jit_comp.c                   |  7 +++--
 tools/testing/selftests/bpf/verifier/loops1.c | 28 +++++++++++++++++++
 2 files changed, 32 insertions(+), 3 deletions(-)

-- 
2.20.0


^ permalink raw reply

* [PATCH bpf 2/2] selftests/bpf: tests for jmp to 1st insn
From: Alexei Starovoitov @ 2019-07-31  1:38 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team
In-Reply-To: <20190731013827.2445262-1-ast@kernel.org>

Add 2 tests that check JIT code generation to jumps to 1st insn.
1st test is similar to syzbot reproducer.
The backwards branch is never taken at runtime.
2nd test has branch to 1st insn that executes.
The test is written as two bpf functions, since it's not possible
to construct valid single bpf program that jumps to 1st insn.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/verifier/loops1.c | 28 +++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/testing/selftests/bpf/verifier/loops1.c b/tools/testing/selftests/bpf/verifier/loops1.c
index 5e980a5ab69d..1fc4e61e9f9f 100644
--- a/tools/testing/selftests/bpf/verifier/loops1.c
+++ b/tools/testing/selftests/bpf/verifier/loops1.c
@@ -159,3 +159,31 @@
 	.errstr = "loop detected",
 	.prog_type = BPF_PROG_TYPE_TRACEPOINT,
 },
+{
+	"not-taken loop with back jump to 1st insn",
+	.insns = {
+	BPF_MOV64_IMM(BPF_REG_0, 123),
+	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 4, -2),
+	BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_XDP,
+	.retval = 123,
+},
+{
+	"taken loop with back jump to 1st insn",
+	.insns = {
+	BPF_MOV64_IMM(BPF_REG_1, 10),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_1),
+	BPF_ALU64_IMM(BPF_SUB, BPF_REG_1, 1),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -3),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+	BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_XDP,
+	.retval = 55,
+},
-- 
2.20.0


^ permalink raw reply related

* [PATCH bpf 1/2] bpf: fix x64 JIT code generation for jmp to 1st insn
From: Alexei Starovoitov @ 2019-07-31  1:38 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team
In-Reply-To: <20190731013827.2445262-1-ast@kernel.org>

Introduction of bounded loops exposed old bug in x64 JIT.
JIT maintains the array of offsets to the end of all instructions to
compute jmp offsets.
addrs[0] - offset of the end of the 1st insn (that includes prologue).
addrs[1] - offset of the end of the 2nd insn.
JIT didn't keep the offset of the beginning of the 1st insn,
since classic BPF didn't have backward jumps and valid extended BPF
couldn't have a branch to 1st insn, because it didn't allow loops.
With bounded loops it's possible to construct a valid program that
jumps backwards to the 1st insn.
Fix JIT by computing:
addrs[0] - offset of the end of prologue == start of the 1st insn.
addrs[1] - offset of the end of 1st insn.

Reported-by: syzbot+35101610ff3e83119b1b@syzkaller.appspotmail.com
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index eaaed5bfc4a4..a56c95805732 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -390,8 +390,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog));
+	addrs[0] = prog - temp;
 
-	for (i = 0; i < insn_cnt; i++, insn++) {
+	for (i = 1; i <= insn_cnt; i++, insn++) {
 		const s32 imm32 = insn->imm;
 		u32 dst_reg = insn->dst_reg;
 		u32 src_reg = insn->src_reg;
@@ -1105,7 +1106,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		extra_pass = true;
 		goto skip_init_addrs;
 	}
-	addrs = kmalloc_array(prog->len, sizeof(*addrs), GFP_KERNEL);
+	addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL);
 	if (!addrs) {
 		prog = orig_prog;
 		goto out_addrs;
@@ -1115,7 +1116,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	 * Before first pass, make a rough estimation of addrs[]
 	 * each BPF instruction is translated to less than 64 bytes
 	 */
-	for (proglen = 0, i = 0; i < prog->len; i++) {
+	for (proglen = 0, i = 0; i <= prog->len; i++) {
 		proglen += 64;
 		addrs[i] = proglen;
 	}
-- 
2.20.0


^ permalink raw reply related

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Andrew Lunn @ 2019-07-31  1:36 UTC (permalink / raw)
  To: Tao Ren
  Cc: Vladimir Oltean, Florian Fainelli, Heiner Kallweit,
	David S . Miller, Arun Parameswaran, Justin Chen, netdev, lkml,
	Andrew Jeffery, openbmc@lists.ozlabs.org
In-Reply-To: <e8f85ef3-1216-8efb-a54d-84426234fe82@fb.com>

> The INTF_SEL pins report correct mode (RGMII-Fiber) on my machine,
> but there are 2 "sub-modes" (1000Base-X and 100Base-FX) and I
> couldn't find a proper/safe way to auto-detect which "sub-mode" is
> active. The datasheet just describes instructions to enable a
> specific mode, but it doesn't say 1000Base-X/100Base-FX mode will be
> auto-selected. And that's why I came up with the patch to specify
> 1000Base-X mode.

Fibre does not perform any sort of auto-negotiation. I assume you have
an SFP connected? When using PHYLINK, the sfp driver will get the
supported baud rate from SFP EEPROM to determine what speed could be
used. However, there is currently no mainline support for having a
chain MAC-PHY-SFP. For that you need Russells out of tree patches.

      Andrew

^ permalink raw reply

* [PATCH v2 net] hv_sock: Fix hang when a connection is closed
From: Dexuan Cui @ 2019-07-31  1:25 UTC (permalink / raw)
  To: Sunil Muthuswamy, David Miller, netdev@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com

There is a race condition for an established connection that is being closed
by the guest: the refcnt is 4 at the end of hvs_release() (Note: here the
'remove_sock' is false):

1 for the initial value;
1 for the sk being in the bound list;
1 for the sk being in the connected list;
1 for the delayed close_work.

After hvs_release() finishes, __vsock_release() -> sock_put(sk) *may*
decrease the refcnt to 3.

Concurrently, hvs_close_connection() runs in another thread:
  calls vsock_remove_sock() to decrease the refcnt by 2;
  call sock_put() to decrease the refcnt to 0, and free the sk;
  next, the "release_sock(sk)" may hang due to use-after-free.

In the above, after hvs_release() finishes, if hvs_close_connection() runs
faster than "__vsock_release() -> sock_put(sk)", then there is not any issue,
because at the beginning of hvs_close_connection(), the refcnt is still 4.

The issue can be resolved if an extra reference is taken when the
connection is established.

Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org

---

Changes in v2: 
  Changed the location of the sock_hold() call. 
  Updated the changelog accordingly.

  Thanks Sunil for the suggestion!

With the proper kernel debugging options enabled, first a warning can
appear:

kworker/1:0/4467 is freeing memory ..., with a lock still held there!
stack backtrace:
Workqueue: events vmbus_onmessage_work [hv_vmbus]
Call Trace:
 dump_stack+0x67/0x90
 debug_check_no_locks_freed.cold.52+0x78/0x7d
 slab_free_freelist_hook+0x85/0x140
 kmem_cache_free+0xa5/0x380
 __sk_destruct+0x150/0x260
 hvs_close_connection+0x24/0x30 [hv_sock]
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

and then the following release_sock(sk) can hang:

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
...
irq event stamp: 62890
CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: G        W         5.2.0+ #39
Workqueue: events vmbus_onmessage_work [hv_vmbus]
RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
...
Call Trace:
 do_raw_spin_lock+0xab/0xb0
 release_sock+0x19/0xb0
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

 net/vmw_vsock/hyperv_transport.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index f2084e3f7aa4..9d864ebeb7b3 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -312,6 +312,11 @@ static void hvs_close_connection(struct vmbus_channel *chan)
 	lock_sock(sk);
 	hvs_do_close_lock_held(vsock_sk(sk), true);
 	release_sock(sk);
+
+	/* Release the refcnt for the channel that's opened in
+	 * hvs_open_connection().
+	 */
+	sock_put(sk);
 }

 static void hvs_open_connection(struct vmbus_channel *chan)
@@ -407,6 +412,9 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 	}

 	set_per_channel_state(chan, conn_from_host ? new : sk);
+
+	/* This reference will be dropped by hvs_close_connection(). */
+	sock_hold(conn_from_host ? new : sk);
 	vmbus_set_chn_rescind_callback(chan, hvs_close_connection);

 	/* Set the pending send size to max packet size to always get
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jakub Kicinski @ 2019-07-31  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, David Miller, netdev
In-Reply-To: <20190731002338.d4lp2grsmm3aaav3@ast-mbp>

On Tue, 30 Jul 2019 17:23:39 -0700, Alexei Starovoitov wrote:
> On Tue, Jul 30, 2019 at 05:07:25PM -0700, Jakub Kicinski wrote:
> > Nothing meaning you disagree it's duplicated effort and unnecessary 
> > LoC the community has to maintain, review, test..?  
> 
> I don't see duplicated effort.

Could you walk me through a scenario where, say, a new xdp attachment
flag is added? How can there be support in both bpftool and iproute2 
for specifying it without modifying both?

How can we say we want to make iproute2 use libbpf instead of it's own
library to avoid duplicated effort on the back end, and at the same time
claim having two tools do the same thing is no duplication 🤯

> > Duplicating the same features in bpftool will only diminish the
> > incentive for moving iproute2 to libbpf.   
> 
> not at all. why do you think so?

Because iproute2's BPF has fallen behind so the simplest thing is to
just contribute to bpftool. But iproute2 is the tool set for Linux
networking, we can't let it bit rot :(

Admittedly that's just me reading the tea leaves.

> > And for folks who deal
> > with a wide variety of customers, often novices maintaining two
> > ways of doing the same thing is a hassle :(  
> 
> It's not the same thing.
> I'm talking about my experience dealing with 'wide variety of bpf customers'.
> They only have a fraction of their time to learn one tool.
> Making every bpf customer learn ten tools is not an option.

IMHO vaguely competent users of Linux networking will know ip link.
If they are not vaguely competent, they should not attach XDP programs
to interfaces by hand...

> > Let's just put the two commands next to each other:
> > 
> >        ip link set xdp $PROG dev $DEV
> > 
> > bpftool net attach xdp $PROG dev $DEV
> > 
> > Are they that different?  
> 
> yes.
> they're different tools with they own upgrade/rollout cycle

I think this came up when bpftool net was added and I never really
understood it.

> > > If bpftool was taught to do equivalent of 'ip link' that would be
> > > very different story and I would be opposed to that.  
> > 
> > Yes, that'd be pretty clear cut, only the XDP stuff is a bit more 
> > of a judgement call.  
> 
> bpftool must be able to introspect every aspect of bpf programming.
> That includes detaching and attaching anywhere.
> Anyone doing 'bpftool p s' should be able to switch off particular
> prog id without learning ten different other tools.

I think the fact that we already have an implementation in iproute2,
which is at the risk of bit rot is more important to me that the
hypothetical scenario where everyone knows to just use bpftool (for
XDP, for TC it's still iproute2 unless there's someone crazy enough 
to reimplement the TC functionality :))

I'm not sure we can settle our differences over email :)
I have tremendous respect for all the maintainers I CCed here, 
if nobody steps up to agree with me I'll concede the bpftool net
battle entirely :)

^ permalink raw reply

* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Shannon Nelson @ 2019-07-31  1:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem
In-Reply-To: <20190730163947.3e730f25@hermes.lan>

On 7/30/19 4:39 PM, Stephen Hemminger wrote:
> On Mon, 22 Jul 2019 14:40:19 -0700
> Shannon Nelson <snelson@pensando.io> wrote:
>
>> +
>> +static void ionic_lif_set_netdev_info(struct lif *lif)
>> +{
>> +	struct ionic_admin_ctx ctx = {
>> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
>> +		.cmd.lif_setattr = {
>> +			.opcode = CMD_OPCODE_LIF_SETATTR,
>> +			.index = cpu_to_le16(lif->index),
>> +			.attr = IONIC_LIF_ATTR_NAME,
>> +		},
>> +	};
>> +
>> +	strlcpy(ctx.cmd.lif_setattr.name, lif->netdev->name,
>> +		sizeof(ctx.cmd.lif_setattr.name));
>> +
>> +	dev_info(lif->ionic->dev, "NETDEV_CHANGENAME %s %s\n",
>> +		 lif->name, ctx.cmd.lif_setattr.name);
>> +
> There is already a kernel message for this. Repeating the same thing in the
> driver is redundant.

True, except for the lif name information.  But since that really is 
debugging information, and I was getting tired of seeing those redundant 
messages, I was planning to remove them soon anyway. Thanks for the 
extra nudge.

sln


^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-07-31  1:00 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <4AB53FC1-5390-4BC7-83B4-7DDBAFD78ABC@fb.com>

On Tue, Jul 30, 2019 at 5:39 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > This patch implements the core logic for BPF CO-RE offsets relocations.
> > Every instruction that needs to be relocated has corresponding
> > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > to match recorded "local" relocation spec against potentially many
> > compatible "target" types, creating corresponding spec. Details of the
> > algorithm are noted in corresponding comments in the code.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> > tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> > tools/lib/bpf/libbpf.h |   1 +
> > 2 files changed, 909 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c

[...]

Please trim irrelevant parts. It doesn't matter with desktop Gmail,
but pretty much everywhere else is very hard to work with.

> > +
> > +     for (i = 1; i < spec->raw_len; i++) {
> > +             t = skip_mods_and_typedefs(btf, id, &id);
> > +             if (!t)
> > +                     return -EINVAL;
> > +
> > +             access_idx = spec->raw_spec[i];
> > +
> > +             if (btf_is_composite(t)) {
> > +                     const struct btf_member *m = (void *)(t + 1);
>
> Why (void *) instead of (const struct btf_member *)? There are a few more
> in the rest of the patch.
>

I just picked the most succinct and non-repetitive form. It's
immediately apparent which type it's implicitly converted to, so I
felt there is no need to repeat it. Also, just (void *) is much
shorter. :)


> > +                     __u32 offset;
> > +
> > +                     if (access_idx >= BTF_INFO_VLEN(t->info))
> > +                             return -EINVAL;
> > +

[...]

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31  0:39 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, andrii.nakryiko@gmail.com,
	Kernel Team
In-Reply-To: <20190730195408.670063-3-andriin@fb.com>



> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> 
> This patch implements the core logic for BPF CO-RE offsets relocations.
> Every instruction that needs to be relocated has corresponding
> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> to match recorded "local" relocation spec against potentially many
> compatible "target" types, creating corresponding spec. Details of the
> algorithm are noted in corresponding comments in the code.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> tools/lib/bpf/libbpf.h |   1 +
> 2 files changed, 909 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ead915aec349..75da90928257 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -38,6 +38,7 @@
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <sys/vfs.h>
> +#include <sys/utsname.h>
> #include <tools/libc_compat.h>
> #include <libelf.h>
> #include <gelf.h>
> @@ -47,6 +48,7 @@
> #include "btf.h"
> #include "str_error.h"
> #include "libbpf_internal.h"
> +#include "hashmap.h"
> 
> #ifndef EM_BPF
> #define EM_BPF 247
> @@ -1015,17 +1017,22 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
> 	return 0;
> }
> 
> -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> -						     __u32 id)
> +static const struct btf_type *
> +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
> {
> 	const struct btf_type *t = btf__type_by_id(btf, id);
> 
> +	if (res_id)
> +		*res_id = id;
> +
> 	while (true) {
> 		switch (BTF_INFO_KIND(t->info)) {
> 		case BTF_KIND_VOLATILE:
> 		case BTF_KIND_CONST:
> 		case BTF_KIND_RESTRICT:
> 		case BTF_KIND_TYPEDEF:
> +			if (res_id)
> +				*res_id = t->type;
> 			t = btf__type_by_id(btf, t->type);
> 			break;
> 		default:
> @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> static bool get_map_field_int(const char *map_name, const struct btf *btf,
> 			      const struct btf_type *def,
> 			      const struct btf_member *m, __u32 *res) {
> -	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type);
> +	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> 	const char *name = btf__name_by_offset(btf, m->name_off);
> 	const struct btf_array *arr_info;
> 	const struct btf_type *arr_t;
> @@ -1110,7 +1117,7 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj,
> 		return -EOPNOTSUPP;
> 	}
> 
> -	def = skip_mods_and_typedefs(obj->btf, var->type);
> +	def = skip_mods_and_typedefs(obj->btf, var->type, NULL);
> 	if (BTF_INFO_KIND(def->info) != BTF_KIND_STRUCT) {
> 		pr_warning("map '%s': unexpected def kind %u.\n",
> 			   map_name, BTF_INFO_KIND(var->info));
> @@ -2292,6 +2299,893 @@ bpf_program_reloc_btf_ext(struct bpf_program *prog, struct bpf_object *obj,
> 	return 0;
> }
> 
> +#define BPF_CORE_SPEC_MAX_LEN 64
> +
> +/* represents BPF CO-RE field or array element accessor */
> +struct bpf_core_accessor {
> +	__u32 type_id;		/* struct/union type or array element type */
> +	__u32 idx;		/* field index or array index */
> +	const char *name;	/* field name or NULL for array accessor */
> +};
> +
> +struct bpf_core_spec {
> +	const struct btf *btf;
> +	/* high-level spec: named fields and array indices only */
> +	struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
> +	/* high-level spec length */
> +	int len;
> +	/* raw, low-level spec: 1-to-1 with accessor spec string */
> +	int raw_spec[BPF_CORE_SPEC_MAX_LEN];
> +	/* raw spec length */
> +	int raw_len;
> +	/* field byte offset represented by spec */
> +	__u32 offset;
> +};
> +
> +static bool str_is_empty(const char *s)
> +{
> +	return !s || !s[0];
> +}
> +
> +static int btf_kind(const struct btf_type *t)
> +{
> +	return BTF_INFO_KIND(t->info);
> +}
> +
> +static bool btf_is_composite(const struct btf_type *t)
> +{
> +	int kind = btf_kind(t);
> +
> +	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> +}
> +
> +static bool btf_is_array(const struct btf_type *t)
> +{
> +	return btf_kind(t) == BTF_KIND_ARRAY;
> +}
> +
> +/* 
> + * Turn bpf_offset_reloc into a low- and high-level spec representation,
> + * validating correctness along the way, as well as calculating resulting
> + * field offset (in bytes), specified by accessor string. Low-level spec
> + * captures every single level of nestedness, including traversing anonymous
> + * struct/union members. High-level one only captures semantically meaningful
> + * "turning points": named fields and array indicies.
> + * E.g., for this case:
> + *
> + *   struct sample {
> + *       int __unimportant;
> + *       struct {
> + *           int __1;
> + *           int __2;
> + *           int a[7];
> + *       };
> + *   };
> + *
> + *   struct sample *s = ...;
> + *
> + *   int x = &s->a[3]; // access string = '0:1:2:3'
> + *
> + * Low-level spec has 1:1 mapping with each element of access string (it's
> + * just a parsed access string representation): [0, 1, 2, 3].
> + *
> + * High-level spec will capture only 3 points:
> + *   - intial zero-index access by pointer (&s->... is the same as &s[0]...);
> + *   - field 'a' access (corresponds to '2' in low-level spec);
> + *   - array element #3 access (corresponds to '3' in low-level spec).
> + *
> + */
> +static int bpf_core_spec_parse(const struct btf *btf,
> +			       __u32 type_id,
> +			       const char *spec_str,
> +			       struct bpf_core_spec *spec)
> +{
> +	int access_idx, parsed_len, i;
> +	const struct btf_type *t;
> +	const char *name;
> +	__u32 id;
> +	__s64 sz;
> +
> +	if (str_is_empty(spec_str) || *spec_str == ':')
> +		return -EINVAL;
> +
> +	memset(spec, 0, sizeof(*spec));
> +	spec->btf = btf;
> +
> +	/* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
> +	while (*spec_str) {
> +		if (*spec_str == ':')
> +			++spec_str;
> +		if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
> +			return -EINVAL;
> +		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +			return -E2BIG;
> +		spec_str += parsed_len;
> +		spec->raw_spec[spec->raw_len++] = access_idx;
> +	}
> +
> +	if (spec->raw_len == 0)
> +		return -EINVAL;
> +
> +	/* first spec value is always reloc type array index */
> +	t = skip_mods_and_typedefs(btf, type_id, &id);
> +	if (!t)
> +		return -EINVAL;
> +
> +	access_idx = spec->raw_spec[0];
> +	spec->spec[0].type_id = id;
> +	spec->spec[0].idx = access_idx;
> +	spec->len++;
> +
> +	sz = btf__resolve_size(btf, id);
> +	if (sz < 0)
> +		return sz;
> +	spec->offset = access_idx * sz;
> +
> +	for (i = 1; i < spec->raw_len; i++) {
> +		t = skip_mods_and_typedefs(btf, id, &id);
> +		if (!t)
> +			return -EINVAL;
> +
> +		access_idx = spec->raw_spec[i];
> +
> +		if (btf_is_composite(t)) {
> +			const struct btf_member *m = (void *)(t + 1);

Why (void *) instead of (const struct btf_member *)? There are a few more
in the rest of the patch. 

> +			__u32 offset;
> +
> +			if (access_idx >= BTF_INFO_VLEN(t->info))
> +				return -EINVAL;
> +
> +			m = &m[access_idx];
> +
> +			if (BTF_INFO_KFLAG(t->info)) {
> +				if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> +					return -EINVAL;
> +				offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> +			} else {
> +				offset = m->offset;
> +			}
> +			if (m->offset % 8)
> +				return -EINVAL;
> +			spec->offset += offset / 8;
> +
> +			if (m->name_off) {
> +				name = btf__name_by_offset(btf, m->name_off);
> +				if (str_is_empty(name))
> +					return -EINVAL;
> +
> +				spec->spec[spec->len].type_id = id;
> +				spec->spec[spec->len].idx = access_idx;
> +				spec->spec[spec->len].name = name;
> +				spec->len++;
> +			}
> +
> +			id = m->type;
> +		} else if (btf_is_array(t)) {
> +			const struct btf_array *a = (void *)(t + 1);
> +
> +			t = skip_mods_and_typedefs(btf, a->type, &id);
> +			if (!t || access_idx >= a->nelems)
> +				return -EINVAL;
> +
> +			spec->spec[spec->len].type_id = id;
> +			spec->spec[spec->len].idx = access_idx;
> +			spec->len++;
> +
> +			sz = btf__resolve_size(btf, id);
> +			if (sz < 0)
> +				return sz;
> +			spec->offset += access_idx * sz;
> +		} else {
> +			pr_warning("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %d\n",
> +				   type_id, spec_str, i, id, btf_kind(t));
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/* Given 'some_struct_name___with_flavor' return the length of a name prefix
> + * before last triple underscore. Struct name part after last triple
> + * underscore is ignored by BPF CO-RE relocation during relocation matching.
> + */
> +static size_t bpf_core_essential_name_len(const char *name)
> +{
> +	size_t n = strlen(name);
> +	int i = n - 3;
> +
> +	while (i > 0) {
> +		if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> +			return i;
> +		i--;
> +	}
> +	return n;
> +}
> +
> +/* dynamically sized list of type IDs */
> +struct ids_vec {
> +	__u32 *data;
> +	int len;
> +};
> +
> +static void bpf_core_free_cands(struct ids_vec *cand_ids)
> +{
> +	free(cand_ids->data);
> +	free(cand_ids);
> +}
> +
> +static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf,
> +					   __u32 local_type_id,
> +					   const struct btf *targ_btf)
> +{
> +	size_t local_essent_len, targ_essent_len;
> +	const char *local_name, *targ_name;
> +	const struct btf_type *t;
> +	struct ids_vec *cand_ids;
> +	__u32 *new_ids;
> +	int i, err, n;
> +
> +	t = btf__type_by_id(local_btf, local_type_id);
> +	if (!t)
> +		return ERR_PTR(-EINVAL);
> +
> +	local_name = btf__name_by_offset(local_btf, t->name_off);
> +	if (str_is_empty(local_name))
> +		return ERR_PTR(-EINVAL);
> +	local_essent_len = bpf_core_essential_name_len(local_name);
> +
> +	cand_ids = calloc(1, sizeof(*cand_ids));
> +	if (!cand_ids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	n = btf__get_nr_types(targ_btf);
> +	for (i = 1; i <= n; i++) {
> +		t = btf__type_by_id(targ_btf, i);
> +		targ_name = btf__name_by_offset(targ_btf, t->name_off);
> +		if (str_is_empty(targ_name))
> +			continue;
> +
> +		targ_essent_len = bpf_core_essential_name_len(targ_name);
> +		if (targ_essent_len != local_essent_len)
> +			continue;
> +
> +		if (strncmp(local_name, targ_name, local_essent_len) == 0) {
> +			pr_debug("[%d] (%s): found candidate [%d] (%s)\n",
> +				 local_type_id, local_name, i, targ_name);
> +			new_ids = realloc(cand_ids->data, cand_ids->len + 1);
> +			if (!new_ids) {
> +				err = -ENOMEM;
> +				goto err_out;
> +			}
> +			cand_ids->data = new_ids;
> +			cand_ids->data[cand_ids->len++] = i;
> +		}
> +	}
> +	return cand_ids;
> +err_out:
> +	bpf_core_free_cands(cand_ids);
> +	return ERR_PTR(err);
> +}
> +
> +/* Check two types for compatibility, skipping const/volatile/restrict and
> + * typedefs, to ensure we are relocating offset to the compatible entities:
> + *   - any two STRUCTs/UNIONs are compatible and can be mixed;
> + *   - any two FWDs are compatible;
> + *   - any two PTRs are always compatible;
> + *   - for ENUMs, check sizes, names are ignored;
> + *   - for INT, size and bitness should match, signedness is ignored;
> + *   - for ARRAY, dimensionality is ignored, element types are checked for
> + *     compatibility recursively;
> + *   - everything else shouldn't be ever a target of relocation.
> + * These rules are not set in stone and probably will be adjusted as we get
> + * more experience with using BPF CO-RE relocations.
> + */
> +static int bpf_core_fields_are_compat(const struct btf *local_btf,
> +				      __u32 local_id,
> +				      const struct btf *targ_btf,
> +				      __u32 targ_id)
> +{
> +	const struct btf_type *local_type, *targ_type;
> +	__u16 kind;
> +
> +recur:
> +	local_type = skip_mods_and_typedefs(local_btf, local_id, &local_id);
> +	targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> +	if (!local_type || !targ_type)
> +		return -EINVAL;
> +
> +	if (btf_is_composite(local_type) && btf_is_composite(targ_type))
> +		return 1;
> +	if (BTF_INFO_KIND(local_type->info) != BTF_INFO_KIND(targ_type->info))
> +		return 0;
> +
> +	kind = BTF_INFO_KIND(local_type->info);
> +	switch (kind) {
> +	case BTF_KIND_FWD:
> +	case BTF_KIND_PTR:
> +		return 1;
> +	case BTF_KIND_ENUM:
> +		return local_type->size == targ_type->size;
> +	case BTF_KIND_INT: {
> +		__u32 loc_int = *(__u32 *)(local_type + 1);
> +		__u32 targ_int = *(__u32 *)(targ_type + 1);
> +
> +		return BTF_INT_OFFSET(loc_int) == 0 &&
> +		       BTF_INT_OFFSET(targ_int) == 0 &&
> +		       local_type->size == targ_type->size &&
> +		       BTF_INT_BITS(loc_int) == BTF_INT_BITS(targ_int);
> +	}
> +	case BTF_KIND_ARRAY: {
> +		const struct btf_array *loc_a, *targ_a;
> +
> +		loc_a = (void *)(local_type + 1);
> +		targ_a = (void *)(targ_type + 1);
> +		local_id = loc_a->type;
> +		targ_id = targ_a->type;
> +		goto recur;
> +	}
> +	default:
> +		pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> +			   kind, local_id, targ_id);
> +		return 0;
> +	}
> +}
> +
> +/* 
> + * Given single high-level named field accessor in local type, find
> + * corresponding high-level accessor for a target type. Along the way,
> + * maintain low-level spec for target as well. Also keep updating target
> + * offset.
> + *
> + * Searching is performed through recursive exhaustive enumeration of all
> + * fields of a struct/union. If there are any anonymous (embedded)
> + * structs/unions, they are recursively searched as well. If field with
> + * desired name is found, check compatibility between local and target types,
> + * before returning result.
> + *
> + * 1 is returned, if field is found.
> + * 0 is returned if no compatible field is found.
> + * <0 is returned on error.
> + */
> +static int bpf_core_match_member(const struct btf *local_btf,
> +				 const struct bpf_core_accessor *local_acc,
> +				 const struct btf *targ_btf,
> +				 __u32 targ_id,
> +				 struct bpf_core_spec *spec,
> +				 __u32 *next_targ_id)
> +{
> +	const struct btf_type *local_type, *targ_type;
> +	const struct btf_member *local_member, *m;
> +	const char *local_name, *targ_name;
> +	__u32 local_id;
> +	int i, n, found;
> +
> +	targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> +	if (!targ_type)
> +		return -EINVAL;
> +	if (!btf_is_composite(targ_type))
> +		return 0;
> +
> +	local_id = local_acc->type_id;
> +	local_type = btf__type_by_id(local_btf, local_id);
> +	local_member = (void *)(local_type + 1);
> +	local_member += local_acc->idx;
> +	local_name = btf__name_by_offset(local_btf, local_member->name_off);
> +
> +	n = BTF_INFO_VLEN(targ_type->info);
> +	m = (void *)(targ_type + 1);
> +	for (i = 0; i < n; i++, m++) {
> +		__u32 offset;
> +
> +		/* bitfield relocations not supported */
> +		if (BTF_INFO_KFLAG(targ_type->info)) {
> +			if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> +				continue;
> +			offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> +		} else {
> +			offset = m->offset;
> +		}
> +		if (offset % 8)
> +			continue;
> +
> +		/* too deep struct/union/array nesting */
> +		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +			return -E2BIG;
> +
> +		/* speculate this member will be the good one */
> +		spec->offset += offset / 8;
> +		spec->raw_spec[spec->raw_len++] = i;
> +
> +		targ_name = btf__name_by_offset(targ_btf, m->name_off);
> +		if (str_is_empty(targ_name)) {
> +			/* embedded struct/union, we need to go deeper */
> +			found = bpf_core_match_member(local_btf, local_acc,
> +						      targ_btf, m->type,
> +						      spec, next_targ_id);
> +			if (found) /* either found or error */
> +				return found;
> +		} else if (strcmp(local_name, targ_name) == 0) {
> +			/* matching named field */
> +			struct bpf_core_accessor *targ_acc;
> +
> +			targ_acc = &spec->spec[spec->len++];
> +			targ_acc->type_id = targ_id;
> +			targ_acc->idx = i;
> +			targ_acc->name = targ_name;
> +
> +			*next_targ_id = m->type;
> +			found = bpf_core_fields_are_compat(local_btf,
> +							   local_member->type,
> +							   targ_btf, m->type);
> +			if (!found)
> +				spec->len--; /* pop accessor */
> +			return found;
> +		}
> +		/* member turned out not to be what we looked for */
> +		spec->offset -= offset / 8;
> +		spec->raw_len--;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Try to match local spec to a target type and, if successful, produce full
> + * target spec (high-level, low-level + offset).
> + */
> +static int bpf_core_spec_match(struct bpf_core_spec *local_spec,
> +			       const struct btf *targ_btf, __u32 targ_id,
> +			       struct bpf_core_spec *targ_spec)
> +{
> +	const struct btf_type *targ_type;
> +	const struct bpf_core_accessor *local_acc;
> +	struct bpf_core_accessor *targ_acc;
> +	int i, sz, matched;
> +
> +	memset(targ_spec, 0, sizeof(*targ_spec));
> +	targ_spec->btf = targ_btf;
> +
> +	local_acc = &local_spec->spec[0];
> +	targ_acc = &targ_spec->spec[0];
> +
> +	for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> +		targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> +						   &targ_id);
> +		if (!targ_type)
> +			return -EINVAL;
> +
> +		if (local_acc->name) {
> +			matched = bpf_core_match_member(local_spec->btf,
> +							local_acc,
> +							targ_btf, targ_id,
> +							targ_spec, &targ_id);
> +			if (matched <= 0)
> +				return matched;
> +		} else {
> +			/* for i=0, targ_id is already treated as array element
> +			 * type (because it's the original struct), for others
> +			 * we should find array element type first
> +			 */
> +			if (i > 0) {
> +				const struct btf_array *a;
> +
> +				if (!btf_is_array(targ_type))
> +					return 0;
> +
> +				a = (void *)(targ_type + 1);
> +				if (local_acc->idx >= a->nelems)
> +					return 0;
> +				if (!skip_mods_and_typedefs(targ_btf, a->type,
> +							    &targ_id))
> +					return -EINVAL;
> +			}
> +
> +			/* too deep struct/union/array nesting */
> +			if (targ_spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +				return -E2BIG;
> +
> +			targ_acc->type_id = targ_id;
> +			targ_acc->idx = local_acc->idx;
> +			targ_acc->name = NULL;
> +			targ_spec->len++;
> +			targ_spec->raw_spec[targ_spec->raw_len] = targ_acc->idx;
> +			targ_spec->raw_len++;
> +
> +			sz = btf__resolve_size(targ_btf, targ_id);
> +			if (sz < 0)
> +				return sz;
> +			targ_spec->offset += local_acc->idx * sz;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
> +/*
> + * Patch relocatable BPF instruction.
> + * Expected insn->imm value is provided for validation, as well as the new
> + * relocated value.
> + *
> + * Currently three kinds of BPF instructions are supported:
> + * 1. rX = <imm> (assignment with immediate operand);
> + * 2. rX += <imm> (arithmetic operations with immediate operand);
> + * 3. *(rX) = <imm> (indirect memory assignment with immediate operand).
> + *
> + * If actual insn->imm value is wrong, bail out.
> + */
> +static int bpf_core_reloc_insn(struct bpf_program *prog, int insn_off,
> +			       __u32 orig_off, __u32 new_off)
> +{
> +	struct bpf_insn *insn;
> +	int insn_idx;
> +	__u8 class;
> +
> +	if (insn_off % sizeof(struct bpf_insn))
> +		return -EINVAL;
> +	insn_idx = insn_off / sizeof(struct bpf_insn);
> +
> +	insn = &prog->insns[insn_idx];
> +	class = BPF_CLASS(insn->code);
> +
> +	if (class == BPF_ALU || class == BPF_ALU64) {
> +		if (BPF_SRC(insn->code) != BPF_K)
> +			return -EINVAL;
> +		if (insn->imm != orig_off)
> +			return -EINVAL;
> +		insn->imm = new_off;
> +		pr_debug("prog '%s': patched insn #%d (ALU/ALU64) imm %d -> %d\n",
> +			 bpf_program__title(prog, false),
> +			 insn_idx, orig_off, new_off);
> +	} else {
> +		pr_warning("prog '%s': trying to relocate unrecognized insn #%d, code:%x, src:%x, dst:%x, off:%x, imm:%x\n",
> +			   bpf_program__title(prog, false),
> +			   insn_idx, insn->code, insn->src_reg, insn->dst_reg,
> +			   insn->off, insn->imm);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> + * data out of it to use for target BTF.
> + */
> +static struct btf *bpf_core_find_kernel_btf(void)
> +{
> +	const char *locations[] = {
> +		"/lib/modules/%1$s/vmlinux-%1$s",
> +		"/usr/lib/modules/%1$s/kernel/vmlinux",
> +	};
> +	char path[PATH_MAX + 1];
> +	struct utsname buf;
> +	struct btf *btf;
> +	int i, err;
> +
> +	err = uname(&buf);
> +	if (err) {
> +		pr_warning("failed to uname(): %d\n", err);
> +		return ERR_PTR(err);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(locations); i++) {
> +		snprintf(path, PATH_MAX, locations[i], buf.release);
> +		pr_debug("attempting to load kernel BTF from '%s'\n", path);
> +
> +		if (access(path, R_OK))
> +			continue;
> +
> +		btf = btf__parse_elf(path, NULL);
> +		if (IS_ERR(btf))
> +			continue;
> +
> +		pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> +		return btf;
> +	}
> +
> +	pr_warning("failed to find valid kernel BTF\n");
> +	return ERR_PTR(-ESRCH);
> +}
> +
> +/* Output spec definition in the format:
> + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> + */
> +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> +{
> +	const struct btf_type *t;
> +	const char *s;
> +	__u32 type_id;
> +	int i;
> +
> +	type_id = spec->spec[0].type_id;
> +	t = btf__type_by_id(spec->btf, type_id);
> +	s = btf__name_by_offset(spec->btf, t->name_off);
> +	libbpf_print(level, "[%u] (%s) + ", type_id, s);
> +
> +	for (i = 0; i < spec->raw_len; i++)
> +		libbpf_print(level, "%d%s", spec->raw_spec[i],
> +			     i == spec->raw_len - 1 ? " => " : ":");
> +
> +	libbpf_print(level, "%u @ &x", spec->offset);
> +
> +	for (i = 0; i < spec->len; i++) {
> +		if (spec->spec[i].name)
> +			libbpf_print(level, ".%s", spec->spec[i].name);
> +		else
> +			libbpf_print(level, "[%u]", spec->spec[i].idx);
> +	}
> +
> +}
> +
> +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> +{
> +	return (size_t)key;
> +}
> +
> +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> +{
> +	return k1 == k2;
> +}
> +
> +static void *u32_to_ptr(__u32 x)
> +{
> +	return (void *)(uintptr_t)x;
> +}
> +
> +/* 
> + * CO-RE relocate single instruction.
> + *
> + * The outline and important points of the algorithm:
> + * 1. For given local type, find corresponding candidate target types.
> + *    Candidate type is a type with the same "essential" name, ignoring
> + *    everything after last triple underscore (___). E.g., `sample`,
> + *    `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> + *    for each other. Names with triple underscore are referred to as
> + *    "flavors" and are useful, among other things, to allow to
> + *    specify/support incompatible variations of the same kernel struct, which
> + *    might differ between different kernel versions and/or build
> + *    configurations.
> + *
> + *    N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> + *    converter, when deduplicated BTF of a kernel still contains more than
> + *    one different types with the same name. In that case, ___2, ___3, etc
> + *    are appended starting from second name conflict. But start flavors are
> + *    also useful to be defined "locally", in BPF program, to extract same
> + *    data from incompatible changes between different kernel
> + *    versions/configurations. For instance, to handle field renames between
> + *    kernel versions, one can use two flavors of the struct name with the
> + *    same common name and use conditional relocations to extract that field,
> + *    depending on target kernel version.
> + * 2. For each candidate type, try to match local specification to this
> + *    candidate target type. Matching involves finding corresponding
> + *    high-level spec accessors, meaning that all named fields should match,
> + *    as well as all array accesses should be within the actual bounds. Also,
> + *    types should be compatible (see bpf_core_fields_are_compat for details).
> + * 3. It is supported and expected that there might be multiple flavors
> + *    matching the spec. As long as all the specs resolve to the same set of
> + *    offsets across all candidates, there is not error. If there is any
> + *    ambiguity, CO-RE relocation will fail. This is necessary to accomodate
> + *    imprefection of BTF deduplication, which can cause slight duplication of
> + *    the same BTF type, if some directly or indirectly referenced (by
> + *    pointer) type gets resolved to different actual types in different
> + *    object files. If such situation occurs, deduplicated BTF will end up
> + *    with two (or more) structurally identical types, which differ only in
> + *    types they refer to through pointer. This should be OK in most cases and
> + *    is not an error.
> + * 4. Candidate types search is performed by linearly scanning through all
> + *    types in target BTF. It is anticipated that this is overall more
> + *    efficient memory-wise and not significantly worse (if not better)
> + *    CPU-wise compared to prebuilding a map from all local type names to
> + *    a list of candidate type names. It's also sped up by caching resolved
> + *    list of matching candidates per each local "root" type ID, that has at
> + *    least one bpf_offset_reloc associated with it. This list is shared
> + *    between multiple relocations for the same type ID and is updated as some
> + *    of the candidates are pruned due to structural incompatibility.
> + */
> +static int bpf_core_reloc_offset(struct bpf_program *prog,
> +				 const struct bpf_offset_reloc *relo,
> +				 int relo_idx,
> +				 const struct btf *local_btf,
> +				 const struct btf *targ_btf,
> +				 struct hashmap *cand_cache)
> +{
> +	const char *prog_name = bpf_program__title(prog, false);
> +	struct bpf_core_spec local_spec, cand_spec, targ_spec;
> +	const void *type_key = u32_to_ptr(relo->type_id);
> +	const struct btf_type *local_type, *cand_type;
> +	const char *local_name, *cand_name;
> +	struct ids_vec *cand_ids;
> +	__u32 local_id, cand_id;
> +	const char *spec_str;
> +	int i, j, err;
> +
> +	local_id = relo->type_id;
> +	local_type = btf__type_by_id(local_btf, local_id);
> +	if (!local_type)
> +		return -EINVAL;
> +
> +	local_name = btf__name_by_offset(local_btf, local_type->name_off);
> +	if (str_is_empty(local_name))
> +		return -EINVAL;
> +
> +	spec_str = btf__name_by_offset(local_btf, relo->access_str_off);
> +	if (str_is_empty(spec_str))
> +		return -EINVAL;
> +
> +	err = bpf_core_spec_parse(local_btf, local_id, spec_str, &local_spec);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: parsing [%d] (%s) + %s failed: %d\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str,
> +			   err);
> +		return -EINVAL;
> +	}
> +
> +	pr_debug("prog '%s': relo #%d: spec is ", prog_name, relo_idx);
> +	bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> +	libbpf_print(LIBBPF_DEBUG, "\n");
> +
> +	if (!hashmap__find(cand_cache, type_key, (void **)&cand_ids)) {
> +		cand_ids = bpf_core_find_cands(local_btf, local_id, targ_btf);
> +		if (IS_ERR(cand_ids)) {
> +			pr_warning("prog '%s': relo #%d: target candidate search failed for [%d] (%s): %ld",
> +				   prog_name, relo_idx, local_id, local_name,
> +				   PTR_ERR(cand_ids));
> +			return PTR_ERR(cand_ids);
> +		}
> +		err = hashmap__set(cand_cache, type_key, cand_ids, NULL, NULL);
> +		if (err) {
> +			bpf_core_free_cands(cand_ids);
> +			return err;
> +		}
> +	}
> +
> +	for (i = 0, j = 0; i < cand_ids->len; i++) {
> +		cand_id = cand_ids->data[i];
> +		cand_type = btf__type_by_id(targ_btf, cand_id);
> +		cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> +
> +		err = bpf_core_spec_match(&local_spec, targ_btf,
> +					  cand_id, &cand_spec);
> +		if (err < 0) {
> +			pr_warning("prog '%s': relo #%d: failed to match spec ",
> +				   prog_name, relo_idx);
> +			bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> +			libbpf_print(LIBBPF_WARN,
> +				     " to candidate #%d [%d] (%s): %d\n",
> +				     i, cand_id, cand_name, err);
> +			return err;
> +		}
> +		if (err == 0) {
> +			pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> +				 prog_name, relo_idx, i, cand_id, cand_name);
> +			bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> +			libbpf_print(LIBBPF_DEBUG, "\n");
> +			continue;
> +		}
> +
> +		pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> +			 prog_name, relo_idx, i);
> +		bpf_core_dump_spec(LIBBPF_DEBUG, &cand_spec);
> +		libbpf_print(LIBBPF_DEBUG, "\n");
> +
> +		if (j == 0) {
> +			targ_spec = cand_spec;
> +		} else if (cand_spec.offset != targ_spec.offset) {
> +			/* if there are many candidates, they should all
> +			 * resolve to the same offset
> +			 */
> +			pr_warning("prog '%s': relo #%d: candidate #%d spec ",
> +				   prog_name, relo_idx, i);
> +			bpf_core_dump_spec(LIBBPF_WARN, &cand_spec);
> +			libbpf_print(LIBBPF_WARN,
> +				     " conflicts with target spec ");
> +			bpf_core_dump_spec(LIBBPF_WARN, &targ_spec);
> +			libbpf_print(LIBBPF_WARN, "\n");
> +			return -EINVAL;
> +		}
> +
> +		cand_ids->data[j++] = cand_spec.spec[0].type_id;
> +	}
> +
> +	cand_ids->len = j;
> +	if (cand_ids->len == 0) {
> +		pr_warning("prog '%s': relo #%d: no matching targets found for [%d] (%s) + %s\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str);
> +		return -ESRCH;
> +	}
> +
> +	err = bpf_core_reloc_insn(prog, relo->insn_off,
> +				  local_spec.offset, targ_spec.offset);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: failed to patch insn at offset %d: %d\n",
> +			   prog_name, relo_idx, relo->insn_off, err);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +bpf_core_reloc_offsets(struct bpf_object *obj, const char *targ_btf_path)
> +{
> +	const struct btf_ext_info_sec *sec;
> +	const struct bpf_offset_reloc *rec;
> +	const struct btf_ext_info *seg;
> +	struct hashmap_entry *entry;
> +	struct hashmap *cand_cache = NULL;
> +	struct bpf_program *prog;
> +	struct btf *targ_btf;
> +	const char *sec_name;
> +	int i, err = 0;
> +
> +	if (targ_btf_path)
> +		targ_btf = btf__parse_elf(targ_btf_path, NULL);
> +	else
> +		targ_btf = bpf_core_find_kernel_btf();
> +	if (IS_ERR(targ_btf)) {
> +		pr_warning("failed to get target BTF: %ld\n",
> +			   PTR_ERR(targ_btf));
> +		return PTR_ERR(targ_btf);
> +	}
> +
> +	cand_cache = hashmap__new(bpf_core_hash_fn, bpf_core_equal_fn, NULL);
> +	if (IS_ERR(cand_cache)) {
> +		err = PTR_ERR(cand_cache);
> +		goto out;
> +	}
> +
> +	seg = &obj->btf_ext->offset_reloc_info;
> +	for_each_btf_ext_sec(seg, sec) {
> +		sec_name = btf__name_by_offset(obj->btf, sec->sec_name_off);
> +		if (str_is_empty(sec_name)) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +		prog = bpf_object__find_program_by_title(obj, sec_name);
> +		if (!prog) {
> +			pr_warning("failed to find program '%s' for CO-RE offset relocation\n",
> +				   sec_name);
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		pr_debug("prog '%s': performing %d CO-RE offset relocs\n",
> +			 sec_name, sec->num_info);
> +
> +		for_each_btf_ext_rec(seg, sec, i, rec) {
> +			err = bpf_core_reloc_offset(prog, rec, i, obj->btf,
> +						    targ_btf, cand_cache);
> +			if (err) {
> +				pr_warning("prog '%s': relo #%d: failed to relocate: %d\n",
> +					   sec_name, i, err);
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	btf__free(targ_btf);
> +	if (!IS_ERR_OR_NULL(cand_cache)) {
> +		hashmap__for_each_entry(cand_cache, entry, i) {
> +			bpf_core_free_cands(entry->value);
> +		}
> +		hashmap__free(cand_cache);
> +	}
> +	return err;
> +}
> +
> +static int
> +bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path)
> +{
> +	int err = 0;
> +
> +	if (obj->btf_ext->offset_reloc_info.len)
> +		err = bpf_core_reloc_offsets(obj, targ_btf_path);
> +
> +	return err;
> +}
> +
> static int
> bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj,
> 			struct reloc_desc *relo)
> @@ -2399,14 +3293,21 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
> 	return 0;
> }
> 
> -
> static int
> -bpf_object__relocate(struct bpf_object *obj)
> +bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path)
> {
> 	struct bpf_program *prog;
> 	size_t i;
> 	int err;
> 
> +	if (obj->btf_ext) {
> +		err = bpf_object__relocate_core(obj, targ_btf_path);
> +		if (err) {
> +			pr_warning("failed to perform CO-RE relocations: %d\n",
> +				   err);
> +			return err;
> +		}
> +	}
> 	for (i = 0; i < obj->nr_programs; i++) {
> 		prog = &obj->programs[i];
> 
> @@ -2807,7 +3708,7 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
> 	obj->loaded = true;
> 
> 	CHECK_ERR(bpf_object__create_maps(obj), err, out);
> -	CHECK_ERR(bpf_object__relocate(obj), err, out);
> +	CHECK_ERR(bpf_object__relocate(obj, attr->target_btf_path), err, out);
> 	CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out);
> 
> 	return 0;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 8a9d462a6f6d..e8f70977d137 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -92,6 +92,7 @@ LIBBPF_API void bpf_object__close(struct bpf_object *object);
> struct bpf_object_load_attr {
> 	struct bpf_object *obj;
> 	int log_level;
> +	const char *target_btf_path;
> };
> 
> /* Load/unload object into/from kernel */
> -- 
> 2.17.1
> 


^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Alexei Starovoitov @ 2019-07-31  0:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, netdev
In-Reply-To: <20190730170725.279761e7@cakuba.netronome.com>

On Tue, Jul 30, 2019 at 05:07:25PM -0700, Jakub Kicinski wrote:
> On Tue, 30 Jul 2019 16:17:56 -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 30, 2019 at 03:59:15PM -0700, Jakub Kicinski wrote:
> > > On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:  
> > > > Currently, bpftool net only supports dumping progs loaded on the
> > > > interface. To load XDP prog on interface, user must use other tool
> > > > (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> > > > (un)load XDP prog on interface.  
> > > 
> > > I don't understand why using another tool is a bad thing :(
> > > What happened to the Unix philosophy?
> > > 
> > > I remain opposed to duplicating iproute2's functionality under 
> > > bpftool net :( The way to attach bpf programs in the networking
> > > subsystem is through the iproute2 commends - ip and tc.. 
> > > 
> > > It seems easy enough to add a feature to bpftool but from 
> > > a perspective of someone adding a new feature to the kernel, 
> > > and wanting to update user space components it's quite painful :(
> > > 
> > > So could you describe to me in more detail why this is a good idea?
> > > Perhaps others can chime in?  
> > 
> > I don't think it has anything to do with 'unix philosophy'.
> > Here the proposal to teach bpftool to attach xdp progs.
> > I see nothing wrong with that.
> 
> Nothing meaning you disagree it's duplicated effort and unnecessary 
> LoC the community has to maintain, review, test..?

I don't see duplicated effort.

> > Another reason is iproute2 is still far away from adopting libbpf.
> > So all the latest goodness like BTF, introspection, etc will not
> > be available to iproute2 users for some time.
> 
> Duplicating the same features in bpftool will only diminish the
> incentive for moving iproute2 to libbpf. 

not at all. why do you think so?

> And for folks who deal
> with a wide variety of customers, often novices maintaining two
> ways of doing the same thing is a hassle :(

It's not the same thing.
I'm talking about my experience dealing with 'wide variety of bpf customers'.
They only have a fraction of their time to learn one tool.
Making every bpf customer learn ten tools is not an option.

> > Even when iproute2 is ready it would be convenient for folks like me
> > (who need to debug stuff in production) to remember cmd line of
> > bpftool only to introspect the server. Debugging often includes
> > detaching/attaching progs. Not only doing 'bpftool p s'.
> 
> Let's just put the two commands next to each other:
> 
>        ip link set xdp $PROG dev $DEV
> 
> bpftool net attach xdp $PROG dev $DEV
> 
> Are they that different?

yes.
they're different tools with they own upgrade/rollout cycle

> 
> > If bpftool was taught to do equivalent of 'ip link' that would be
> > very different story and I would be opposed to that.
> 
> Yes, that'd be pretty clear cut, only the XDP stuff is a bit more 
> of a judgement call.

bpftool must be able to introspect every aspect of bpf programming.
That includes detaching and attaching anywhere.
Anyone doing 'bpftool p s' should be able to switch off particular
prog id without learning ten different other tools.
iproute2 is a small bit of it. There is cgroup and tracing too.
bpftool should be one tool to do everything directly related to bpf.


^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: map basechain priority to hardware priority
From: Jakub Kicinski @ 2019-07-31  0:21 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, wenxu, jiri, marcelo.leitner, saeedm, gerlitz.or,
	paulb, netdev
In-Reply-To: <20190730105417.14538-1-pablo@netfilter.org>

On Tue, 30 Jul 2019 12:54:17 +0200, Pablo Neira Ayuso wrote:
> This patch maps basechain netfilter priorities from -8192 to 8191 to
> hardware priority 0xC000 + 1. tcf_auto_prio() uses 0xC000 if the user
> specifies no priority, then it subtract 1 for each new tcf_proto object.
> This patch uses the hardware priority range from 0xC000 to 0xFFFF for
> netfilter.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> This follows a rather conservative approach, I could just expose the
> 2^16 hardware priority range, but we may need to split this priority
> range among the ethtool_rx, tc and netfilter subsystems to start with
> and it should be possible to extend the priority range later on.
> 
> By netfilter priority, I'm refering to the basechain priority:
> 
> 	add chain x y { type filter hook ingress device eth0 priority 0; }
>                                                              ^^^^^^^^^^^
> 
> This is no transparently mapped to hardware, this patch shifts it to
> make it fit into the 0xC000 + 1 .. 0xFFFF hardware priority range.

Mmm.. so the ordering of tables is intended to be decided by priority
and not block type (nft, tc, ethtool)?  I was always expecting we 
would just follow the software order when it comes to inter-subsystem
decisions.  So ethtool first, then XDP, then TC, then nft, then
bridging etc. TC vs NFT based on:

static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
				    struct packet_type **ppt_prev)
{
...
	if (static_branch_unlikely(&ingress_needed_key)) {
		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev);
		if (!skb)
			goto out;

		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
			goto out;
	}

Are they solid use cases for choosing the ordering arbitrarily?

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31  0:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4BzZpcP1aBwrz8DbToQ=nVUukPwiG-PBCFGZNb2wXg_msnA@mail.gmail.com>



> On Jul 30, 2019, at 4:55 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Jul 30, 2019 at 4:44 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>> 
>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>> Every instruction that needs to be relocated has corresponding
>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
>>> to match recorded "local" relocation spec against potentially many
>>> compatible "target" types, creating corresponding spec. Details of the
>>> algorithm are noted in corresponding comments in the code.
>>> 
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
>>> tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
>>> tools/lib/bpf/libbpf.h |   1 +
>>> 2 files changed, 909 insertions(+), 7 deletions(-)
>>> 
>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>> index ead915aec349..75da90928257 100644
>>> --- a/tools/lib/bpf/libbpf.c
>>> +++ b/tools/lib/bpf/libbpf.c
>>> @@ -38,6 +38,7 @@
> 
> [...]
> 
>>> 
>>> -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>> -                                                  __u32 id)
>>> +static const struct btf_type *
>>> +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
>>> {
>>>      const struct btf_type *t = btf__type_by_id(btf, id);
>>> 
>>> +     if (res_id)
>>> +             *res_id = id;
>>> +
>>>      while (true) {
>>>              switch (BTF_INFO_KIND(t->info)) {
>>>              case BTF_KIND_VOLATILE:
>>>              case BTF_KIND_CONST:
>>>              case BTF_KIND_RESTRICT:
>>>              case BTF_KIND_TYPEDEF:
>>> +                     if (res_id)
>>> +                             *res_id = t->type;
>>>                      t = btf__type_by_id(btf, t->type);
>> 
>> So btf->types[*res_id] == retval, right? Then with retval and btf, we can
>> calculate *res_id without this change?
> 
> Unless I'm missing something very clever here, no. btf->types is array
> of pointers (it's an index into a variable-sized types). This function
> returns `struct btf_type *`, which is one of the **values** stored in
> that array. You are claiming that by having value of one of array
> elements you can easily find element's index? If it was possible to do
> in O(1), we wouldn't have so many algorithms and data structures for
> search and indexing. You can do that only with linear search, not some
> clever pointer arithmetic or at least binary search. So I'm not sure
> what you are proposing here...

oops.. Clearly, I made some silly mistake. Sorry for the noise. 

Song

> 
> The way BTF is defined, struct btf_type doesn't know its own type ID,
> which is often inconvenient and requires to keep track of that ID, if
> it's necessary, but that's how it is.
> 
> But then again, what are we trying to achieve here? Eliminate
> returning id and pointer? I could always return id and easily look up
> pointer, but having both is super convenient and makes code simpler
> and shorter, so I'd like to keep it.
> 
>> 
>>>                      break;
>>>              default:
>>> @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>> static bool get_map_field_int(const char *map_name, const struct btf *btf,
>>>                            const struct btf_type *def,
>>>                            const struct btf_member *m, __u32 *res) {
> 
> [...]


^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-31  0:15 UTC (permalink / raw)
  To: Andrew Lunn, Vladimir Oltean
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml, Andrew Jeffery,
	openbmc@lists.ozlabs.org
In-Reply-To: <20190730131719.GA28552@lunn.ch>

On 7/30/19 6:17 AM, Andrew Lunn wrote:
>> Again, I don't think Linux has generic support for overwriting (or
>> even describing) the operating mode of a PHY, although maybe that's a
>> direction we would want to push the discussion towards. RGMII to
>> copper, RGMII to fiber, SGMII to copper, copper to fiber (media
>> converter), even RGMII to SGMII (RTL8211FS supports this) - lots of
>> modes, and this is only for gigabit PHYs...
> 
> This is something Russell King has PHYLINK patches for, which have not
> yet been merged. There are some boards which use a PHY as a media
> converter, placed between the MAC and an SFP.

Thank you Andrew. Let me see if the operating mode can be auto-detected on BCM54616S chip, and I will update back soon.

Thanks,

Tao

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: broadcom: set features explicitly for BCM54616S
From: Tao Ren @ 2019-07-31  0:12 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn
  Cc: Florian Fainelli, David S . Miller, Arun Parameswaran,
	Justin Chen, Vladimir Oltean, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Andrew Jeffery,
	openbmc@lists.ozlabs.org
In-Reply-To: <bdfe07d3-66b4-061a-a149-aa2aef94b9b7@gmail.com>

On 7/29/19 11:00 PM, Heiner Kallweit wrote:
> On 30.07.2019 07:05, Tao Ren wrote:
>> On 7/29/19 8:35 PM, Andrew Lunn wrote:
>>> On Mon, Jul 29, 2019 at 05:25:32PM -0700, Tao Ren wrote:
>>>> BCM54616S feature "PHY_GBIT_FEATURES" was removed by commit dcdecdcfe1fc
>>>> ("net: phy: switch drivers to use dynamic feature detection"). As dynamic
>>>> feature detection doesn't work when BCM54616S is working in RGMII-Fiber
>>>> mode (different sets of MII Control/Status registers being used), let's
>>>> set "PHY_GBIT_FEATURES" for BCM54616S explicitly.
>>>
>>> Hi Tao
>>>
>>> What exactly does it get wrong?
>>>
>>>      Thanks
>>> 	Andrew
>>
>> Hi Andrew,
>>
>> BCM54616S is set to RGMII-Fiber (1000Base-X) mode on my platform, and none of the features (1000BaseT/100BaseT/10BaseT) can be detected by genphy_read_abilities(), because the PHY only reports 1000BaseX_Full|Half ability in this mode.
>>
> Are you going to use the PHY in copper or fibre mode?
> In case you use fibre mode, why do you need the copper modes set as supported?
> Or does the PHY just start in fibre mode and you want to switch it to copper mode?

Hi Heiner,

The phy starts in fiber mode and that's the mode I want.
My observation is: phydev->link is always 0 (Link status bit is never set in MII_BMSR) by using dynamic ability detection on my machine. I checked phydev->supported and it's set to "AutoNeg | TP | MII | Pause | Asym_Pause" by dynamic ability detection. Is it normal/expected? Or maybe the fix should go to different places? Thank you for your help.


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jakub Kicinski @ 2019-07-31  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, netdev
In-Reply-To: <20190730231754.efh3fj4mnsbv445l@ast-mbp>

On Tue, 30 Jul 2019 16:17:56 -0700, Alexei Starovoitov wrote:
> On Tue, Jul 30, 2019 at 03:59:15PM -0700, Jakub Kicinski wrote:
> > On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:  
> > > Currently, bpftool net only supports dumping progs loaded on the
> > > interface. To load XDP prog on interface, user must use other tool
> > > (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> > > (un)load XDP prog on interface.  
> > 
> > I don't understand why using another tool is a bad thing :(
> > What happened to the Unix philosophy?
> > 
> > I remain opposed to duplicating iproute2's functionality under 
> > bpftool net :( The way to attach bpf programs in the networking
> > subsystem is through the iproute2 commends - ip and tc.. 
> > 
> > It seems easy enough to add a feature to bpftool but from 
> > a perspective of someone adding a new feature to the kernel, 
> > and wanting to update user space components it's quite painful :(
> > 
> > So could you describe to me in more detail why this is a good idea?
> > Perhaps others can chime in?  
> 
> I don't think it has anything to do with 'unix philosophy'.
> Here the proposal to teach bpftool to attach xdp progs.
> I see nothing wrong with that.

Nothing meaning you disagree it's duplicated effort and unnecessary 
LoC the community has to maintain, review, test..?

> Another reason is iproute2 is still far away from adopting libbpf.
> So all the latest goodness like BTF, introspection, etc will not
> be available to iproute2 users for some time.

Duplicating the same features in bpftool will only diminish the
incentive for moving iproute2 to libbpf. And for folks who deal
with a wide variety of customers, often novices maintaining two
ways of doing the same thing is a hassle :(

> Even when iproute2 is ready it would be convenient for folks like me
> (who need to debug stuff in production) to remember cmd line of
> bpftool only to introspect the server. Debugging often includes
> detaching/attaching progs. Not only doing 'bpftool p s'.

Let's just put the two commands next to each other:

       ip link set xdp $PROG dev $DEV

bpftool net attach xdp $PROG dev $DEV

Are they that different?

> If bpftool was taught to do equivalent of 'ip link' that would be
> very different story and I would be opposed to that.

Yes, that'd be pretty clear cut, only the XDP stuff is a bit more 
of a judgement call.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-07-30 23:55 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <9C1CFF6F-F661-46F4-B6EB-B42D7F4204F0@fb.com>

On Tue, Jul 30, 2019 at 4:44 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > This patch implements the core logic for BPF CO-RE offsets relocations.
> > Every instruction that needs to be relocated has corresponding
> > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > to match recorded "local" relocation spec against potentially many
> > compatible "target" types, creating corresponding spec. Details of the
> > algorithm are noted in corresponding comments in the code.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> > tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> > tools/lib/bpf/libbpf.h |   1 +
> > 2 files changed, 909 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index ead915aec349..75da90928257 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -38,6 +38,7 @@

[...]

> >
> > -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> > -                                                  __u32 id)
> > +static const struct btf_type *
> > +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
> > {
> >       const struct btf_type *t = btf__type_by_id(btf, id);
> >
> > +     if (res_id)
> > +             *res_id = id;
> > +
> >       while (true) {
> >               switch (BTF_INFO_KIND(t->info)) {
> >               case BTF_KIND_VOLATILE:
> >               case BTF_KIND_CONST:
> >               case BTF_KIND_RESTRICT:
> >               case BTF_KIND_TYPEDEF:
> > +                     if (res_id)
> > +                             *res_id = t->type;
> >                       t = btf__type_by_id(btf, t->type);
>
> So btf->types[*res_id] == retval, right? Then with retval and btf, we can
> calculate *res_id without this change?

Unless I'm missing something very clever here, no. btf->types is array
of pointers (it's an index into a variable-sized types). This function
returns `struct btf_type *`, which is one of the **values** stored in
that array. You are claiming that by having value of one of array
elements you can easily find element's index? If it was possible to do
in O(1), we wouldn't have so many algorithms and data structures for
search and indexing. You can do that only with linear search, not some
clever pointer arithmetic or at least binary search. So I'm not sure
what you are proposing here...

The way BTF is defined, struct btf_type doesn't know its own type ID,
which is often inconvenient and requires to keep track of that ID, if
it's necessary, but that's how it is.

But then again, what are we trying to achieve here? Eliminate
returning id and pointer? I could always return id and easily look up
pointer, but having both is super convenient and makes code simpler
and shorter, so I'd like to keep it.

>
> >                       break;
> >               default:
> > @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> > static bool get_map_field_int(const char *map_name, const struct btf *btf,
> >                             const struct btf_type *def,
> >                             const struct btf_member *m, __u32 *res) {

[...]

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-30 23:44 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml, Andrew Jeffery,
	openbmc@lists.ozlabs.org
In-Reply-To: <CA+h21ho1KOGS3WsNBHzfHkpSyE4k5HTE1tV9wUtnkZhjUZGeUw@mail.gmail.com>

On 7/30/19 3:15 AM, Vladimir Oltean wrote:
> On Tue, 30 Jul 2019 at 07:52, Tao Ren <taoren@fb.com> wrote:
>>
>> On 7/29/19 6:32 PM, Vladimir Oltean wrote:
>>> Hi Tao,
>>>
>>> On Tue, 30 Jul 2019 at 03:31, Tao Ren <taoren@fb.com> wrote:
>>>>
>>>> Configure the BCM54616S for 1000Base-X mode when "brcm-phy-mode-1000bx"
>>>> is set in device tree. This is needed when the PHY is used for fiber and
>>>> backplane connections.
>>>>
>>>> The patch is inspired by commit cd9af3dac6d1 ("PHYLIB: Add 1000Base-X
>>>> support for Broadcom bcm5482").
>>>
>>> As far as I can see, for the commit you referenced,
>>> PHY_BCM_FLAGS_MODE_1000BX is referenced from nowhere in the entire
>>> mainline kernel:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_latest_ident_PHY-5FBCM-5FFLAGS-5FMODE-5F1000BX&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=iYElT7HC77pRZ3byVvW8ng&m=gy6Y-3Ylme-_GQcGF4fvOX10irgAT4xh253Weo0np38&s=KL__E2bvsmvUL-hBL9hUmOS5vyPQ92EMj6fEfByn8t8&e=
>>> (it is supposed to be put by the MAC driver in phydev->dev_flags prior
>>> to calling phy_connect). But I don't see the point to this - can't you
>>> check for phydev->interface == PHY_INTERFACE_MODE_1000BASEX?
>>> This has the advantage that no MAC driver will need to know that it's
>>> talking to a Broadcom PHY. Additionally, no custom DT bindings are
>>> needed.
>>> Also, for backplane connections you probably want 1000Base-KX which
>>> has its own AN/LT, not plain 1000Base-X.
>>
>> Thank you Vladimir for the quick review!
>> Perhaps I misunderstood the purpose of phydev->interface, and I thought it was usually used to defined the interface between MAC and PHY. For example, if I need to pass both "rgmii-id" and "1000base-x" from MAC to PHY driver, what would be the preferred way?
>>
> 
> Ohhhhhh, now I understand what you're trying to do, sorry, somehow I
> was too tired and I thought of something totally unrelated.

Thank you for spending time on the patch, and I really appreciate it.

> Let me see if I can explain: you've got the INTF_SEL pin strapping
> configured for something else (like RGMII to copper mode) and then
> you're changing the operating mode at runtime through MDIO? Is this
> intended to be for production code, or is it just some quick hack to
> fix a bad board design?

The INTF_SEL pins report correct mode (RGMII-Fiber) on my machine, but there are 2 "sub-modes" (1000Base-X and 100Base-FX) and I couldn't find a proper/safe way to auto-detect which "sub-mode" is active. The datasheet just describes instructions to enable a specific mode, but it doesn't say 1000Base-X/100Base-FX mode will be auto-selected. And that's why I came up with the patch to specify 1000Base-X mode.

> I think what's supposed to happen (Heiner can comment) is that
> genphy_config_init will automatically read the out-of-reset PHY
> registers and figure out which link modes are supported. This includes
> the 1000Base-X media type, *if* the PHY is strapped correctly.
> But you are changing the strapping configuration too late (again, in
> .config_init), so phylib doesn't pick up the new Base-X modes. What
> happens if you do the switchover from the .probe callback of the
> driver, instead of .config_init?

I checked the 1000base-x mode control bit and it's already set on my machine when genphy_config_init is executed, means I'm writing the same value to the register.
I write the register explicitly because I'm suspecting the mode control bit is set by my boot loader and the value is not changed by software reset.
Let me see if I can disable net/phy in boot loader and see what happens. If the bit is still on, maybe we can use the bit to auto-detect sub-mode (although the datasheet doesn't mention it)?
Anyways, let me move the logic to .probe callback. Thank you.


> I think what got me confused was your "add support for 1000Base-X"
> commit message. If I understand correctly, you're not adding support,
> you're just forcing it.
> Again, I don't think Linux has generic support for overwriting (or
> even describing) the operating mode of a PHY, although maybe that's a
> direction we would want to push the discussion towards. RGMII to
> copper, RGMII to fiber, SGMII to copper, copper to fiber (media
> converter), even RGMII to SGMII (RTL8211FS supports this) - lots of
> modes, and this is only for gigabit PHYs...
> 
>>>> Signed-off-by: Tao Ren <taoren@fb.com>
>>>> ---
>>>>  drivers/net/phy/broadcom.c | 58 +++++++++++++++++++++++++++++++++++---
>>>>  include/linux/brcmphy.h    |  4 +--
>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
>>>> index 2b4e41a9d35a..6c22ac3a844b 100644
>>>> --- a/drivers/net/phy/broadcom.c
>>>> +++ b/drivers/net/phy/broadcom.c
>>>> @@ -383,9 +383,9 @@ static int bcm5482_config_init(struct phy_device *phydev)
>>>>                 /*
>>>>                  * Select 1000BASE-X register set (primary SerDes)
>>>>                  */
>>>> -               reg = bcm_phy_read_shadow(phydev, BCM5482_SHD_MODE);
>>>> -               bcm_phy_write_shadow(phydev, BCM5482_SHD_MODE,
>>>> -                                    reg | BCM5482_SHD_MODE_1000BX);
>>>> +               reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>>>> +               bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>>>> +                                    reg | BCM54XX_SHD_MODE_1000BX);
>>>>
>>>>                 /*
>>>>                  * LED1=ACTIVITYLED, LED3=LINKSPD[2]
>>>> @@ -451,6 +451,34 @@ static int bcm5481_config_aneg(struct phy_device *phydev)
>>>>         return ret;
>>>>  }
>>>>
>>>> +static int bcm54616s_config_init(struct phy_device *phydev)
>>>> +{
>>>> +       int err, reg;
>>>> +       struct device_node *np = phydev->mdio.dev.of_node;
>>>> +
>>>> +       err = bcm54xx_config_init(phydev);
>>>> +
>>>> +       if (of_property_read_bool(np, "brcm-phy-mode-1000bx")) {
>>>> +               /* Select 1000BASE-X register set. */
>>>> +               reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>>>> +               bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>>>> +                                    reg | BCM54XX_SHD_MODE_1000BX);
>>>> +
>>>> +               /* Auto-negotiation doesn't seem to work quite right
>>>> +                * in this mode, so we disable it and force it to the
>>>> +                * right speed/duplex setting.  Only 'link status'
>>>> +                * is important.
>>>> +                */
>>>> +               phydev->autoneg = AUTONEG_DISABLE;
>>>> +               phydev->speed = SPEED_1000;
>>>> +               phydev->duplex = DUPLEX_FULL;
>>>> +
>>>
>>> 1000Base-X AN does not include speed negotiation, so hardcoding
>>> SPEED_1000 is probably correct.
>>> What is wrong with the AN of duplex settings?
>>
>> FULL_DUPLEX bit is set on my platform by default. Let me enable AN and test it out; will share you results tomorrow.

Duplex setting is correct when AN is enabled. So I will just hardcode speed settings.

>>>> +               phydev->dev_flags |= PHY_BCM_FLAGS_MODE_1000BX;
>>>> +       }
>>>> +
>>>> +       return err;
>>>> +}
>>>> +
>>>>  static int bcm54616s_config_aneg(struct phy_device *phydev)
>>>>  {
>>>>         int ret;
>>>> @@ -464,6 +492,27 @@ static int bcm54616s_config_aneg(struct phy_device *phydev)
>>>>         return ret;
>>>>  }
>>>>
>>>> +static int bcm54616s_read_status(struct phy_device *phydev)
>>>> +{
>>>> +       int ret;
>>>> +
>>>> +       ret = genphy_read_status(phydev);
>>>> +       if (ret < 0)
>>>> +               return ret;
>>>> +
>>>> +       if (phydev->dev_flags & PHY_BCM_FLAGS_MODE_1000BX) {
>>>> +               /* Only link status matters for 1000Base-X mode, so force
>>>> +                * 1000 Mbit/s full-duplex status.
>>>> +                */
>>>> +               if (phydev->link) {
>>>> +                       phydev->speed = SPEED_1000;
>>>> +                       phydev->duplex = DUPLEX_FULL;
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>>  static int brcm_phy_setbits(struct phy_device *phydev, int reg, int set)
>>>>  {
>>>>         int val;
>>>> @@ -651,8 +700,9 @@ static struct phy_driver broadcom_drivers[] = {
>>>>         .phy_id_mask    = 0xfffffff0,
>>>>         .name           = "Broadcom BCM54616S",
>>>>         .features       = PHY_GBIT_FEATURES,
>>>> -       .config_init    = bcm54xx_config_init,
>>>> +       .config_init    = bcm54616s_config_init,
>>>>         .config_aneg    = bcm54616s_config_aneg,
>>>> +       .read_status    = bcm54616s_read_status,
>>>>         .ack_interrupt  = bcm_phy_ack_intr,
>>>>         .config_intr    = bcm_phy_config_intr,
>>>>  }, {
>>>> diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
>>>> index 6db2d9a6e503..82030155558c 100644
>>>> --- a/include/linux/brcmphy.h
>>>> +++ b/include/linux/brcmphy.h
>>>> @@ -200,8 +200,8 @@
>>>>  #define BCM5482_SHD_SSD                0x14    /* 10100: Secondary SerDes control */
>>>>  #define BCM5482_SHD_SSD_LEDM   0x0008  /* SSD LED Mode enable */
>>>>  #define BCM5482_SHD_SSD_EN     0x0001  /* SSD enable */
>>>> -#define BCM5482_SHD_MODE       0x1f    /* 11111: Mode Control Register */
>>>> -#define BCM5482_SHD_MODE_1000BX        0x0001  /* Enable 1000BASE-X registers */
>>>> +#define BCM54XX_SHD_MODE       0x1f    /* 11111: Mode Control Register */
>>>> +#define BCM54XX_SHD_MODE_1000BX        0x0001  /* Enable 1000BASE-X registers */
>>>
>>> These registers are also present on my BCM5464, probably safe to
>>> assume they're generic for the entire family.
>>> So if you make the registers definitions common, you can probably make
>>> the 1000Base-X configuration common as well.
>>
>> If I understand correctly, your recommendation is to add a common function (such as "bcm54xx_config_1000bx") so it can be used by other BCM chips? Sure, I will take care of it.
>>
>>
>> Thanks,
>>
>> Tao
> 
> Regards,
> -Vladimir
> 

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-30 23:43 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, andrii.nakryiko@gmail.com,
	Kernel Team
In-Reply-To: <20190730195408.670063-3-andriin@fb.com>



> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> 
> This patch implements the core logic for BPF CO-RE offsets relocations.
> Every instruction that needs to be relocated has corresponding
> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> to match recorded "local" relocation spec against potentially many
> compatible "target" types, creating corresponding spec. Details of the
> algorithm are noted in corresponding comments in the code.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> tools/lib/bpf/libbpf.h |   1 +
> 2 files changed, 909 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ead915aec349..75da90928257 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -38,6 +38,7 @@
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <sys/vfs.h>
> +#include <sys/utsname.h>
> #include <tools/libc_compat.h>
> #include <libelf.h>
> #include <gelf.h>
> @@ -47,6 +48,7 @@
> #include "btf.h"
> #include "str_error.h"
> #include "libbpf_internal.h"
> +#include "hashmap.h"
> 
> #ifndef EM_BPF
> #define EM_BPF 247
> @@ -1015,17 +1017,22 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
> 	return 0;
> }
> 
> -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> -						     __u32 id)
> +static const struct btf_type *
> +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
> {
> 	const struct btf_type *t = btf__type_by_id(btf, id);
> 
> +	if (res_id)
> +		*res_id = id;
> +
> 	while (true) {
> 		switch (BTF_INFO_KIND(t->info)) {
> 		case BTF_KIND_VOLATILE:
> 		case BTF_KIND_CONST:
> 		case BTF_KIND_RESTRICT:
> 		case BTF_KIND_TYPEDEF:
> +			if (res_id)
> +				*res_id = t->type;
> 			t = btf__type_by_id(btf, t->type);

So btf->types[*res_id] == retval, right? Then with retval and btf, we can
calculate *res_id without this change?

> 			break;
> 		default:
> @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> static bool get_map_field_int(const char *map_name, const struct btf *btf,
> 			      const struct btf_type *def,
> 			      const struct btf_member *m, __u32 *res) {
> -	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type);
> +	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> 	const char *name = btf__name_by_offset(btf, m->name_off);
> 	const struct btf_array *arr_info;
> 	const struct btf_type *arr_t;
> @@ -1110,7 +1117,7 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj,
> 		return -EOPNOTSUPP;
> 	}
> 
> -	def = skip_mods_and_typedefs(obj->btf, var->type);
> +	def = skip_mods_and_typedefs(obj->btf, var->type, NULL);
> 	if (BTF_INFO_KIND(def->info) != BTF_KIND_STRUCT) {
> 		pr_warning("map '%s': unexpected def kind %u.\n",
> 			   map_name, BTF_INFO_KIND(var->info));
> @@ -2292,6 +2299,893 @@ bpf_program_reloc_btf_ext(struct bpf_program *prog, struct bpf_object *obj,
> 	return 0;
> }
> 
> +#define BPF_CORE_SPEC_MAX_LEN 64
> +
> +/* represents BPF CO-RE field or array element accessor */
> +struct bpf_core_accessor {
> +	__u32 type_id;		/* struct/union type or array element type */
> +	__u32 idx;		/* field index or array index */
> +	const char *name;	/* field name or NULL for array accessor */
> +};
> +
> +struct bpf_core_spec {
> +	const struct btf *btf;
> +	/* high-level spec: named fields and array indices only */
> +	struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
> +	/* high-level spec length */
> +	int len;
> +	/* raw, low-level spec: 1-to-1 with accessor spec string */
> +	int raw_spec[BPF_CORE_SPEC_MAX_LEN];
> +	/* raw spec length */
> +	int raw_len;
> +	/* field byte offset represented by spec */
> +	__u32 offset;
> +};
> +
> +static bool str_is_empty(const char *s)
> +{
> +	return !s || !s[0];
> +}
> +
> +static int btf_kind(const struct btf_type *t)
> +{
> +	return BTF_INFO_KIND(t->info);
> +}
> +
> +static bool btf_is_composite(const struct btf_type *t)
> +{
> +	int kind = btf_kind(t);
> +
> +	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> +}
> +
> +static bool btf_is_array(const struct btf_type *t)
> +{
> +	return btf_kind(t) == BTF_KIND_ARRAY;
> +}
> +
> +/* 
> + * Turn bpf_offset_reloc into a low- and high-level spec representation,
> + * validating correctness along the way, as well as calculating resulting
> + * field offset (in bytes), specified by accessor string. Low-level spec
> + * captures every single level of nestedness, including traversing anonymous
> + * struct/union members. High-level one only captures semantically meaningful
> + * "turning points": named fields and array indicies.
> + * E.g., for this case:
> + *
> + *   struct sample {
> + *       int __unimportant;
> + *       struct {
> + *           int __1;
> + *           int __2;
> + *           int a[7];
> + *       };
> + *   };
> + *
> + *   struct sample *s = ...;
> + *
> + *   int x = &s->a[3]; // access string = '0:1:2:3'
> + *
> + * Low-level spec has 1:1 mapping with each element of access string (it's
> + * just a parsed access string representation): [0, 1, 2, 3].
> + *
> + * High-level spec will capture only 3 points:
> + *   - intial zero-index access by pointer (&s->... is the same as &s[0]...);
> + *   - field 'a' access (corresponds to '2' in low-level spec);
> + *   - array element #3 access (corresponds to '3' in low-level spec).
> + *
> + */
> +static int bpf_core_spec_parse(const struct btf *btf,
> +			       __u32 type_id,
> +			       const char *spec_str,
> +			       struct bpf_core_spec *spec)
> +{
> +	int access_idx, parsed_len, i;
> +	const struct btf_type *t;
> +	const char *name;
> +	__u32 id;
> +	__s64 sz;
> +
> +	if (str_is_empty(spec_str) || *spec_str == ':')
> +		return -EINVAL;
> +
> +	memset(spec, 0, sizeof(*spec));
> +	spec->btf = btf;
> +
> +	/* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
> +	while (*spec_str) {
> +		if (*spec_str == ':')
> +			++spec_str;
> +		if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
> +			return -EINVAL;
> +		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +			return -E2BIG;
> +		spec_str += parsed_len;
> +		spec->raw_spec[spec->raw_len++] = access_idx;
> +	}
> +
> +	if (spec->raw_len == 0)
> +		return -EINVAL;
> +
> +	/* first spec value is always reloc type array index */
> +	t = skip_mods_and_typedefs(btf, type_id, &id);
> +	if (!t)
> +		return -EINVAL;
> +
> +	access_idx = spec->raw_spec[0];
> +	spec->spec[0].type_id = id;
> +	spec->spec[0].idx = access_idx;
> +	spec->len++;
> +
> +	sz = btf__resolve_size(btf, id);
> +	if (sz < 0)
> +		return sz;
> +	spec->offset = access_idx * sz;
> +
> +	for (i = 1; i < spec->raw_len; i++) {
> +		t = skip_mods_and_typedefs(btf, id, &id);
> +		if (!t)
> +			return -EINVAL;
> +
> +		access_idx = spec->raw_spec[i];
> +
> +		if (btf_is_composite(t)) {
> +			const struct btf_member *m = (void *)(t + 1);
> +			__u32 offset;
> +
> +			if (access_idx >= BTF_INFO_VLEN(t->info))
> +				return -EINVAL;
> +
> +			m = &m[access_idx];
> +
> +			if (BTF_INFO_KFLAG(t->info)) {
> +				if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> +					return -EINVAL;
> +				offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> +			} else {
> +				offset = m->offset;
> +			}
> +			if (m->offset % 8)
> +				return -EINVAL;
> +			spec->offset += offset / 8;
> +
> +			if (m->name_off) {
> +				name = btf__name_by_offset(btf, m->name_off);
> +				if (str_is_empty(name))
> +					return -EINVAL;
> +
> +				spec->spec[spec->len].type_id = id;
> +				spec->spec[spec->len].idx = access_idx;
> +				spec->spec[spec->len].name = name;
> +				spec->len++;
> +			}
> +
> +			id = m->type;
> +		} else if (btf_is_array(t)) {
> +			const struct btf_array *a = (void *)(t + 1);
> +
> +			t = skip_mods_and_typedefs(btf, a->type, &id);
> +			if (!t || access_idx >= a->nelems)
> +				return -EINVAL;
> +
> +			spec->spec[spec->len].type_id = id;
> +			spec->spec[spec->len].idx = access_idx;
> +			spec->len++;
> +
> +			sz = btf__resolve_size(btf, id);
> +			if (sz < 0)
> +				return sz;
> +			spec->offset += access_idx * sz;
> +		} else {
> +			pr_warning("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %d\n",
> +				   type_id, spec_str, i, id, btf_kind(t));
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/* Given 'some_struct_name___with_flavor' return the length of a name prefix
> + * before last triple underscore. Struct name part after last triple
> + * underscore is ignored by BPF CO-RE relocation during relocation matching.
> + */
> +static size_t bpf_core_essential_name_len(const char *name)
> +{
> +	size_t n = strlen(name);
> +	int i = n - 3;
> +
> +	while (i > 0) {
> +		if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> +			return i;
> +		i--;
> +	}
> +	return n;
> +}
> +
> +/* dynamically sized list of type IDs */
> +struct ids_vec {
> +	__u32 *data;
> +	int len;
> +};
> +
> +static void bpf_core_free_cands(struct ids_vec *cand_ids)
> +{
> +	free(cand_ids->data);
> +	free(cand_ids);
> +}
> +
> +static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf,
> +					   __u32 local_type_id,
> +					   const struct btf *targ_btf)
> +{
> +	size_t local_essent_len, targ_essent_len;
> +	const char *local_name, *targ_name;
> +	const struct btf_type *t;
> +	struct ids_vec *cand_ids;
> +	__u32 *new_ids;
> +	int i, err, n;
> +
> +	t = btf__type_by_id(local_btf, local_type_id);
> +	if (!t)
> +		return ERR_PTR(-EINVAL);
> +
> +	local_name = btf__name_by_offset(local_btf, t->name_off);
> +	if (str_is_empty(local_name))
> +		return ERR_PTR(-EINVAL);
> +	local_essent_len = bpf_core_essential_name_len(local_name);
> +
> +	cand_ids = calloc(1, sizeof(*cand_ids));
> +	if (!cand_ids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	n = btf__get_nr_types(targ_btf);
> +	for (i = 1; i <= n; i++) {
> +		t = btf__type_by_id(targ_btf, i);
> +		targ_name = btf__name_by_offset(targ_btf, t->name_off);
> +		if (str_is_empty(targ_name))
> +			continue;
> +
> +		targ_essent_len = bpf_core_essential_name_len(targ_name);
> +		if (targ_essent_len != local_essent_len)
> +			continue;
> +
> +		if (strncmp(local_name, targ_name, local_essent_len) == 0) {
> +			pr_debug("[%d] (%s): found candidate [%d] (%s)\n",
> +				 local_type_id, local_name, i, targ_name);
> +			new_ids = realloc(cand_ids->data, cand_ids->len + 1);
> +			if (!new_ids) {
> +				err = -ENOMEM;
> +				goto err_out;
> +			}
> +			cand_ids->data = new_ids;
> +			cand_ids->data[cand_ids->len++] = i;
> +		}
> +	}
> +	return cand_ids;
> +err_out:
> +	bpf_core_free_cands(cand_ids);
> +	return ERR_PTR(err);
> +}
> +
> +/* Check two types for compatibility, skipping const/volatile/restrict and
> + * typedefs, to ensure we are relocating offset to the compatible entities:
> + *   - any two STRUCTs/UNIONs are compatible and can be mixed;
> + *   - any two FWDs are compatible;
> + *   - any two PTRs are always compatible;
> + *   - for ENUMs, check sizes, names are ignored;
> + *   - for INT, size and bitness should match, signedness is ignored;
> + *   - for ARRAY, dimensionality is ignored, element types are checked for
> + *     compatibility recursively;
> + *   - everything else shouldn't be ever a target of relocation.
> + * These rules are not set in stone and probably will be adjusted as we get
> + * more experience with using BPF CO-RE relocations.
> + */
> +static int bpf_core_fields_are_compat(const struct btf *local_btf,
> +				      __u32 local_id,
> +				      const struct btf *targ_btf,
> +				      __u32 targ_id)
> +{
> +	const struct btf_type *local_type, *targ_type;
> +	__u16 kind;
> +
> +recur:
> +	local_type = skip_mods_and_typedefs(local_btf, local_id, &local_id);
> +	targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> +	if (!local_type || !targ_type)
> +		return -EINVAL;
> +
> +	if (btf_is_composite(local_type) && btf_is_composite(targ_type))
> +		return 1;
> +	if (BTF_INFO_KIND(local_type->info) != BTF_INFO_KIND(targ_type->info))
> +		return 0;
> +
> +	kind = BTF_INFO_KIND(local_type->info);
> +	switch (kind) {
> +	case BTF_KIND_FWD:
> +	case BTF_KIND_PTR:
> +		return 1;
> +	case BTF_KIND_ENUM:
> +		return local_type->size == targ_type->size;
> +	case BTF_KIND_INT: {
> +		__u32 loc_int = *(__u32 *)(local_type + 1);
> +		__u32 targ_int = *(__u32 *)(targ_type + 1);
> +
> +		return BTF_INT_OFFSET(loc_int) == 0 &&
> +		       BTF_INT_OFFSET(targ_int) == 0 &&
> +		       local_type->size == targ_type->size &&
> +		       BTF_INT_BITS(loc_int) == BTF_INT_BITS(targ_int);
> +	}
> +	case BTF_KIND_ARRAY: {
> +		const struct btf_array *loc_a, *targ_a;
> +
> +		loc_a = (void *)(local_type + 1);
> +		targ_a = (void *)(targ_type + 1);
> +		local_id = loc_a->type;
> +		targ_id = targ_a->type;
> +		goto recur;
> +	}
> +	default:
> +		pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> +			   kind, local_id, targ_id);
> +		return 0;
> +	}
> +}
> +
> +/* 
> + * Given single high-level named field accessor in local type, find
> + * corresponding high-level accessor for a target type. Along the way,
> + * maintain low-level spec for target as well. Also keep updating target
> + * offset.
> + *
> + * Searching is performed through recursive exhaustive enumeration of all
> + * fields of a struct/union. If there are any anonymous (embedded)
> + * structs/unions, they are recursively searched as well. If field with
> + * desired name is found, check compatibility between local and target types,
> + * before returning result.
> + *
> + * 1 is returned, if field is found.
> + * 0 is returned if no compatible field is found.
> + * <0 is returned on error.
> + */
> +static int bpf_core_match_member(const struct btf *local_btf,
> +				 const struct bpf_core_accessor *local_acc,
> +				 const struct btf *targ_btf,
> +				 __u32 targ_id,
> +				 struct bpf_core_spec *spec,
> +				 __u32 *next_targ_id)
> +{
> +	const struct btf_type *local_type, *targ_type;
> +	const struct btf_member *local_member, *m;
> +	const char *local_name, *targ_name;
> +	__u32 local_id;
> +	int i, n, found;
> +
> +	targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> +	if (!targ_type)
> +		return -EINVAL;
> +	if (!btf_is_composite(targ_type))
> +		return 0;
> +
> +	local_id = local_acc->type_id;
> +	local_type = btf__type_by_id(local_btf, local_id);
> +	local_member = (void *)(local_type + 1);
> +	local_member += local_acc->idx;
> +	local_name = btf__name_by_offset(local_btf, local_member->name_off);
> +
> +	n = BTF_INFO_VLEN(targ_type->info);
> +	m = (void *)(targ_type + 1);
> +	for (i = 0; i < n; i++, m++) {
> +		__u32 offset;
> +
> +		/* bitfield relocations not supported */
> +		if (BTF_INFO_KFLAG(targ_type->info)) {
> +			if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> +				continue;
> +			offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> +		} else {
> +			offset = m->offset;
> +		}
> +		if (offset % 8)
> +			continue;
> +
> +		/* too deep struct/union/array nesting */
> +		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +			return -E2BIG;
> +
> +		/* speculate this member will be the good one */
> +		spec->offset += offset / 8;
> +		spec->raw_spec[spec->raw_len++] = i;
> +
> +		targ_name = btf__name_by_offset(targ_btf, m->name_off);
> +		if (str_is_empty(targ_name)) {
> +			/* embedded struct/union, we need to go deeper */
> +			found = bpf_core_match_member(local_btf, local_acc,
> +						      targ_btf, m->type,
> +						      spec, next_targ_id);
> +			if (found) /* either found or error */
> +				return found;
> +		} else if (strcmp(local_name, targ_name) == 0) {
> +			/* matching named field */
> +			struct bpf_core_accessor *targ_acc;
> +
> +			targ_acc = &spec->spec[spec->len++];
> +			targ_acc->type_id = targ_id;
> +			targ_acc->idx = i;
> +			targ_acc->name = targ_name;
> +
> +			*next_targ_id = m->type;
> +			found = bpf_core_fields_are_compat(local_btf,
> +							   local_member->type,
> +							   targ_btf, m->type);
> +			if (!found)
> +				spec->len--; /* pop accessor */
> +			return found;
> +		}
> +		/* member turned out not to be what we looked for */
> +		spec->offset -= offset / 8;
> +		spec->raw_len--;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Try to match local spec to a target type and, if successful, produce full
> + * target spec (high-level, low-level + offset).
> + */
> +static int bpf_core_spec_match(struct bpf_core_spec *local_spec,
> +			       const struct btf *targ_btf, __u32 targ_id,
> +			       struct bpf_core_spec *targ_spec)
> +{
> +	const struct btf_type *targ_type;
> +	const struct bpf_core_accessor *local_acc;
> +	struct bpf_core_accessor *targ_acc;
> +	int i, sz, matched;
> +
> +	memset(targ_spec, 0, sizeof(*targ_spec));
> +	targ_spec->btf = targ_btf;
> +
> +	local_acc = &local_spec->spec[0];
> +	targ_acc = &targ_spec->spec[0];
> +
> +	for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> +		targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> +						   &targ_id);
> +		if (!targ_type)
> +			return -EINVAL;
> +
> +		if (local_acc->name) {
> +			matched = bpf_core_match_member(local_spec->btf,
> +							local_acc,
> +							targ_btf, targ_id,
> +							targ_spec, &targ_id);
> +			if (matched <= 0)
> +				return matched;
> +		} else {
> +			/* for i=0, targ_id is already treated as array element
> +			 * type (because it's the original struct), for others
> +			 * we should find array element type first
> +			 */
> +			if (i > 0) {
> +				const struct btf_array *a;
> +
> +				if (!btf_is_array(targ_type))
> +					return 0;
> +
> +				a = (void *)(targ_type + 1);
> +				if (local_acc->idx >= a->nelems)
> +					return 0;
> +				if (!skip_mods_and_typedefs(targ_btf, a->type,
> +							    &targ_id))
> +					return -EINVAL;
> +			}
> +
> +			/* too deep struct/union/array nesting */
> +			if (targ_spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +				return -E2BIG;
> +
> +			targ_acc->type_id = targ_id;
> +			targ_acc->idx = local_acc->idx;
> +			targ_acc->name = NULL;
> +			targ_spec->len++;
> +			targ_spec->raw_spec[targ_spec->raw_len] = targ_acc->idx;
> +			targ_spec->raw_len++;
> +
> +			sz = btf__resolve_size(targ_btf, targ_id);
> +			if (sz < 0)
> +				return sz;
> +			targ_spec->offset += local_acc->idx * sz;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
> +/*
> + * Patch relocatable BPF instruction.
> + * Expected insn->imm value is provided for validation, as well as the new
> + * relocated value.
> + *
> + * Currently three kinds of BPF instructions are supported:
> + * 1. rX = <imm> (assignment with immediate operand);
> + * 2. rX += <imm> (arithmetic operations with immediate operand);
> + * 3. *(rX) = <imm> (indirect memory assignment with immediate operand).
> + *
> + * If actual insn->imm value is wrong, bail out.
> + */
> +static int bpf_core_reloc_insn(struct bpf_program *prog, int insn_off,
> +			       __u32 orig_off, __u32 new_off)
> +{
> +	struct bpf_insn *insn;
> +	int insn_idx;
> +	__u8 class;
> +
> +	if (insn_off % sizeof(struct bpf_insn))
> +		return -EINVAL;
> +	insn_idx = insn_off / sizeof(struct bpf_insn);
> +
> +	insn = &prog->insns[insn_idx];
> +	class = BPF_CLASS(insn->code);
> +
> +	if (class == BPF_ALU || class == BPF_ALU64) {
> +		if (BPF_SRC(insn->code) != BPF_K)
> +			return -EINVAL;
> +		if (insn->imm != orig_off)
> +			return -EINVAL;
> +		insn->imm = new_off;
> +		pr_debug("prog '%s': patched insn #%d (ALU/ALU64) imm %d -> %d\n",
> +			 bpf_program__title(prog, false),
> +			 insn_idx, orig_off, new_off);
> +	} else {
> +		pr_warning("prog '%s': trying to relocate unrecognized insn #%d, code:%x, src:%x, dst:%x, off:%x, imm:%x\n",
> +			   bpf_program__title(prog, false),
> +			   insn_idx, insn->code, insn->src_reg, insn->dst_reg,
> +			   insn->off, insn->imm);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> + * data out of it to use for target BTF.
> + */
> +static struct btf *bpf_core_find_kernel_btf(void)
> +{
> +	const char *locations[] = {
> +		"/lib/modules/%1$s/vmlinux-%1$s",
> +		"/usr/lib/modules/%1$s/kernel/vmlinux",
> +	};
> +	char path[PATH_MAX + 1];
> +	struct utsname buf;
> +	struct btf *btf;
> +	int i, err;
> +
> +	err = uname(&buf);
> +	if (err) {
> +		pr_warning("failed to uname(): %d\n", err);
> +		return ERR_PTR(err);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(locations); i++) {
> +		snprintf(path, PATH_MAX, locations[i], buf.release);
> +		pr_debug("attempting to load kernel BTF from '%s'\n", path);
> +
> +		if (access(path, R_OK))
> +			continue;
> +
> +		btf = btf__parse_elf(path, NULL);
> +		if (IS_ERR(btf))
> +			continue;
> +
> +		pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> +		return btf;
> +	}
> +
> +	pr_warning("failed to find valid kernel BTF\n");
> +	return ERR_PTR(-ESRCH);
> +}
> +
> +/* Output spec definition in the format:
> + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> + */
> +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> +{
> +	const struct btf_type *t;
> +	const char *s;
> +	__u32 type_id;
> +	int i;
> +
> +	type_id = spec->spec[0].type_id;
> +	t = btf__type_by_id(spec->btf, type_id);
> +	s = btf__name_by_offset(spec->btf, t->name_off);
> +	libbpf_print(level, "[%u] (%s) + ", type_id, s);
> +
> +	for (i = 0; i < spec->raw_len; i++)
> +		libbpf_print(level, "%d%s", spec->raw_spec[i],
> +			     i == spec->raw_len - 1 ? " => " : ":");
> +
> +	libbpf_print(level, "%u @ &x", spec->offset);
> +
> +	for (i = 0; i < spec->len; i++) {
> +		if (spec->spec[i].name)
> +			libbpf_print(level, ".%s", spec->spec[i].name);
> +		else
> +			libbpf_print(level, "[%u]", spec->spec[i].idx);
> +	}
> +
> +}
> +
> +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> +{
> +	return (size_t)key;
> +}
> +
> +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> +{
> +	return k1 == k2;
> +}
> +
> +static void *u32_to_ptr(__u32 x)
> +{
> +	return (void *)(uintptr_t)x;
> +}
> +
> +/* 
> + * CO-RE relocate single instruction.
> + *
> + * The outline and important points of the algorithm:
> + * 1. For given local type, find corresponding candidate target types.
> + *    Candidate type is a type with the same "essential" name, ignoring
> + *    everything after last triple underscore (___). E.g., `sample`,
> + *    `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> + *    for each other. Names with triple underscore are referred to as
> + *    "flavors" and are useful, among other things, to allow to
> + *    specify/support incompatible variations of the same kernel struct, which
> + *    might differ between different kernel versions and/or build
> + *    configurations.
> + *
> + *    N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> + *    converter, when deduplicated BTF of a kernel still contains more than
> + *    one different types with the same name. In that case, ___2, ___3, etc
> + *    are appended starting from second name conflict. But start flavors are
> + *    also useful to be defined "locally", in BPF program, to extract same
> + *    data from incompatible changes between different kernel
> + *    versions/configurations. For instance, to handle field renames between
> + *    kernel versions, one can use two flavors of the struct name with the
> + *    same common name and use conditional relocations to extract that field,
> + *    depending on target kernel version.
> + * 2. For each candidate type, try to match local specification to this
> + *    candidate target type. Matching involves finding corresponding
> + *    high-level spec accessors, meaning that all named fields should match,
> + *    as well as all array accesses should be within the actual bounds. Also,
> + *    types should be compatible (see bpf_core_fields_are_compat for details).
> + * 3. It is supported and expected that there might be multiple flavors
> + *    matching the spec. As long as all the specs resolve to the same set of
> + *    offsets across all candidates, there is not error. If there is any
> + *    ambiguity, CO-RE relocation will fail. This is necessary to accomodate
> + *    imprefection of BTF deduplication, which can cause slight duplication of
> + *    the same BTF type, if some directly or indirectly referenced (by
> + *    pointer) type gets resolved to different actual types in different
> + *    object files. If such situation occurs, deduplicated BTF will end up
> + *    with two (or more) structurally identical types, which differ only in
> + *    types they refer to through pointer. This should be OK in most cases and
> + *    is not an error.
> + * 4. Candidate types search is performed by linearly scanning through all
> + *    types in target BTF. It is anticipated that this is overall more
> + *    efficient memory-wise and not significantly worse (if not better)
> + *    CPU-wise compared to prebuilding a map from all local type names to
> + *    a list of candidate type names. It's also sped up by caching resolved
> + *    list of matching candidates per each local "root" type ID, that has at
> + *    least one bpf_offset_reloc associated with it. This list is shared
> + *    between multiple relocations for the same type ID and is updated as some
> + *    of the candidates are pruned due to structural incompatibility.
> + */
> +static int bpf_core_reloc_offset(struct bpf_program *prog,
> +				 const struct bpf_offset_reloc *relo,
> +				 int relo_idx,
> +				 const struct btf *local_btf,
> +				 const struct btf *targ_btf,
> +				 struct hashmap *cand_cache)
> +{
> +	const char *prog_name = bpf_program__title(prog, false);
> +	struct bpf_core_spec local_spec, cand_spec, targ_spec;
> +	const void *type_key = u32_to_ptr(relo->type_id);
> +	const struct btf_type *local_type, *cand_type;
> +	const char *local_name, *cand_name;
> +	struct ids_vec *cand_ids;
> +	__u32 local_id, cand_id;
> +	const char *spec_str;
> +	int i, j, err;
> +
> +	local_id = relo->type_id;
> +	local_type = btf__type_by_id(local_btf, local_id);
> +	if (!local_type)
> +		return -EINVAL;
> +
> +	local_name = btf__name_by_offset(local_btf, local_type->name_off);
> +	if (str_is_empty(local_name))
> +		return -EINVAL;
> +
> +	spec_str = btf__name_by_offset(local_btf, relo->access_str_off);
> +	if (str_is_empty(spec_str))
> +		return -EINVAL;
> +
> +	err = bpf_core_spec_parse(local_btf, local_id, spec_str, &local_spec);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: parsing [%d] (%s) + %s failed: %d\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str,
> +			   err);
> +		return -EINVAL;
> +	}
> +
> +	pr_debug("prog '%s': relo #%d: spec is ", prog_name, relo_idx);
> +	bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> +	libbpf_print(LIBBPF_DEBUG, "\n");
> +
> +	if (!hashmap__find(cand_cache, type_key, (void **)&cand_ids)) {
> +		cand_ids = bpf_core_find_cands(local_btf, local_id, targ_btf);
> +		if (IS_ERR(cand_ids)) {
> +			pr_warning("prog '%s': relo #%d: target candidate search failed for [%d] (%s): %ld",
> +				   prog_name, relo_idx, local_id, local_name,
> +				   PTR_ERR(cand_ids));
> +			return PTR_ERR(cand_ids);
> +		}
> +		err = hashmap__set(cand_cache, type_key, cand_ids, NULL, NULL);
> +		if (err) {
> +			bpf_core_free_cands(cand_ids);
> +			return err;
> +		}
> +	}
> +
> +	for (i = 0, j = 0; i < cand_ids->len; i++) {
> +		cand_id = cand_ids->data[i];
> +		cand_type = btf__type_by_id(targ_btf, cand_id);
> +		cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> +
> +		err = bpf_core_spec_match(&local_spec, targ_btf,
> +					  cand_id, &cand_spec);
> +		if (err < 0) {
> +			pr_warning("prog '%s': relo #%d: failed to match spec ",
> +				   prog_name, relo_idx);
> +			bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> +			libbpf_print(LIBBPF_WARN,
> +				     " to candidate #%d [%d] (%s): %d\n",
> +				     i, cand_id, cand_name, err);
> +			return err;
> +		}
> +		if (err == 0) {
> +			pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> +				 prog_name, relo_idx, i, cand_id, cand_name);
> +			bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> +			libbpf_print(LIBBPF_DEBUG, "\n");
> +			continue;
> +		}
> +
> +		pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> +			 prog_name, relo_idx, i);
> +		bpf_core_dump_spec(LIBBPF_DEBUG, &cand_spec);
> +		libbpf_print(LIBBPF_DEBUG, "\n");
> +
> +		if (j == 0) {
> +			targ_spec = cand_spec;
> +		} else if (cand_spec.offset != targ_spec.offset) {
> +			/* if there are many candidates, they should all
> +			 * resolve to the same offset
> +			 */
> +			pr_warning("prog '%s': relo #%d: candidate #%d spec ",
> +				   prog_name, relo_idx, i);
> +			bpf_core_dump_spec(LIBBPF_WARN, &cand_spec);
> +			libbpf_print(LIBBPF_WARN,
> +				     " conflicts with target spec ");
> +			bpf_core_dump_spec(LIBBPF_WARN, &targ_spec);
> +			libbpf_print(LIBBPF_WARN, "\n");
> +			return -EINVAL;
> +		}
> +
> +		cand_ids->data[j++] = cand_spec.spec[0].type_id;
> +	}
> +
> +	cand_ids->len = j;
> +	if (cand_ids->len == 0) {
> +		pr_warning("prog '%s': relo #%d: no matching targets found for [%d] (%s) + %s\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str);
> +		return -ESRCH;
> +	}
> +
> +	err = bpf_core_reloc_insn(prog, relo->insn_off,
> +				  local_spec.offset, targ_spec.offset);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: failed to patch insn at offset %d: %d\n",
> +			   prog_name, relo_idx, relo->insn_off, err);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +bpf_core_reloc_offsets(struct bpf_object *obj, const char *targ_btf_path)
> +{
> +	const struct btf_ext_info_sec *sec;
> +	const struct bpf_offset_reloc *rec;
> +	const struct btf_ext_info *seg;
> +	struct hashmap_entry *entry;
> +	struct hashmap *cand_cache = NULL;
> +	struct bpf_program *prog;
> +	struct btf *targ_btf;
> +	const char *sec_name;
> +	int i, err = 0;
> +
> +	if (targ_btf_path)
> +		targ_btf = btf__parse_elf(targ_btf_path, NULL);
> +	else
> +		targ_btf = bpf_core_find_kernel_btf();
> +	if (IS_ERR(targ_btf)) {
> +		pr_warning("failed to get target BTF: %ld\n",
> +			   PTR_ERR(targ_btf));
> +		return PTR_ERR(targ_btf);
> +	}
> +
> +	cand_cache = hashmap__new(bpf_core_hash_fn, bpf_core_equal_fn, NULL);
> +	if (IS_ERR(cand_cache)) {
> +		err = PTR_ERR(cand_cache);
> +		goto out;
> +	}
> +
> +	seg = &obj->btf_ext->offset_reloc_info;
> +	for_each_btf_ext_sec(seg, sec) {
> +		sec_name = btf__name_by_offset(obj->btf, sec->sec_name_off);
> +		if (str_is_empty(sec_name)) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +		prog = bpf_object__find_program_by_title(obj, sec_name);
> +		if (!prog) {
> +			pr_warning("failed to find program '%s' for CO-RE offset relocation\n",
> +				   sec_name);
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		pr_debug("prog '%s': performing %d CO-RE offset relocs\n",
> +			 sec_name, sec->num_info);
> +
> +		for_each_btf_ext_rec(seg, sec, i, rec) {
> +			err = bpf_core_reloc_offset(prog, rec, i, obj->btf,
> +						    targ_btf, cand_cache);
> +			if (err) {
> +				pr_warning("prog '%s': relo #%d: failed to relocate: %d\n",
> +					   sec_name, i, err);
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	btf__free(targ_btf);
> +	if (!IS_ERR_OR_NULL(cand_cache)) {
> +		hashmap__for_each_entry(cand_cache, entry, i) {
> +			bpf_core_free_cands(entry->value);
> +		}
> +		hashmap__free(cand_cache);
> +	}
> +	return err;
> +}
> +
> +static int
> +bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path)
> +{
> +	int err = 0;
> +
> +	if (obj->btf_ext->offset_reloc_info.len)
> +		err = bpf_core_reloc_offsets(obj, targ_btf_path);
> +
> +	return err;
> +}
> +
> static int
> bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj,
> 			struct reloc_desc *relo)
> @@ -2399,14 +3293,21 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
> 	return 0;
> }
> 
> -
> static int
> -bpf_object__relocate(struct bpf_object *obj)
> +bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path)
> {
> 	struct bpf_program *prog;
> 	size_t i;
> 	int err;
> 
> +	if (obj->btf_ext) {
> +		err = bpf_object__relocate_core(obj, targ_btf_path);
> +		if (err) {
> +			pr_warning("failed to perform CO-RE relocations: %d\n",
> +				   err);
> +			return err;
> +		}
> +	}
> 	for (i = 0; i < obj->nr_programs; i++) {
> 		prog = &obj->programs[i];
> 
> @@ -2807,7 +3708,7 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
> 	obj->loaded = true;
> 
> 	CHECK_ERR(bpf_object__create_maps(obj), err, out);
> -	CHECK_ERR(bpf_object__relocate(obj), err, out);
> +	CHECK_ERR(bpf_object__relocate(obj, attr->target_btf_path), err, out);
> 	CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out);
> 
> 	return 0;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 8a9d462a6f6d..e8f70977d137 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -92,6 +92,7 @@ LIBBPF_API void bpf_object__close(struct bpf_object *object);
> struct bpf_object_load_attr {
> 	struct bpf_object *obj;
> 	int log_level;
> +	const char *target_btf_path;
> };
> 
> /* Load/unload object into/from kernel */
> -- 
> 2.17.1
> 


^ permalink raw reply

* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Stephen Hemminger @ 2019-07-30 23:39 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev, davem
In-Reply-To: <20190722214023.9513-16-snelson@pensando.io>

On Mon, 22 Jul 2019 14:40:19 -0700
Shannon Nelson <snelson@pensando.io> wrote:

> +
> +static void ionic_lif_set_netdev_info(struct lif *lif)
> +{
> +	struct ionic_admin_ctx ctx = {
> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
> +		.cmd.lif_setattr = {
> +			.opcode = CMD_OPCODE_LIF_SETATTR,
> +			.index = cpu_to_le16(lif->index),
> +			.attr = IONIC_LIF_ATTR_NAME,
> +		},
> +	};
> +
> +	strlcpy(ctx.cmd.lif_setattr.name, lif->netdev->name,
> +		sizeof(ctx.cmd.lif_setattr.name));
> +
> +	dev_info(lif->ionic->dev, "NETDEV_CHANGENAME %s %s\n",
> +		 lif->name, ctx.cmd.lif_setattr.name);
> +

There is already a kernel message for this. Repeating the same thing in the
driver is redundant.

^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Alexei Starovoitov @ 2019-07-30 23:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, netdev
In-Reply-To: <20190730155915.5bbe3a03@cakuba.netronome.com>

On Tue, Jul 30, 2019 at 03:59:15PM -0700, Jakub Kicinski wrote:
> On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:
> > Currently, bpftool net only supports dumping progs loaded on the
> > interface. To load XDP prog on interface, user must use other tool
> > (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> > (un)load XDP prog on interface.
> 
> I don't understand why using another tool is a bad thing :(
> What happened to the Unix philosophy?
> 
> I remain opposed to duplicating iproute2's functionality under 
> bpftool net :( The way to attach bpf programs in the networking
> subsystem is through the iproute2 commends - ip and tc.. 
> 
> It seems easy enough to add a feature to bpftool but from 
> a perspective of someone adding a new feature to the kernel, 
> and wanting to update user space components it's quite painful :(
> 
> So could you describe to me in more detail why this is a good idea?
> Perhaps others can chime in?

I don't think it has anything to do with 'unix philosophy'.
Here the proposal to teach bpftool to attach xdp progs.
I see nothing wrong with that.
Another reason is iproute2 is still far away from adopting libbpf.
So all the latest goodness like BTF, introspection, etc will not
be available to iproute2 users for some time.
Even when iproute2 is ready it would be convenient for folks like me
(who need to debug stuff in production) to remember cmd line of
bpftool only to introspect the server. Debugging often includes
detaching/attaching progs. Not only doing 'bpftool p s'.

If bpftool was taught to do equivalent of 'ip link' that would be
very different story and I would be opposed to that.

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf : make libbpf_num_possible_cpus function thread safe
From: Jakub Kicinski @ 2019-07-30 23:05 UTC (permalink / raw)
  To: Takshak Chahande; +Cc: netdev, ast, daniel, rdna, kernel-team, hechaol
In-Reply-To: <20190730222447.3918919-1-ctakshak@fb.com>

On Tue, 30 Jul 2019 15:24:47 -0700, Takshak Chahande wrote:
> Having static variable `cpus` in libbpf_num_possible_cpus function without
> guarding it with mutex makes this function thread-unsafe.
> 
> If multiple threads accessing this function, in the current form; it
> leads to incrementing the static variable value `cpus` in the multiple
> of total available CPUs.
> 
> Let caching the number of possile CPUs handled by libbpf's users than
> this library itself; 

Can we just use stack variable for the calculations and
READ_ONCE()/WRITE_ONCE() for assignment to the static?  
libbpf itself uses this helper so caller caching wouldn't
work there.

> and let this function be rock bottom one which reads
> and parse the file (/sys/devices/system/cpu/possible) everytime it gets
> called to simplify the things.

I don't understand can you rephrase?

> Fixes: 6446b3155521 (bpf: add a new API libbpf_num_possible_cpus())
> 

No new line after the fixes tag, also I think you're missing quotation
marks around the commit title?

> Signed-off-by: Takshak Chahande <ctakshak@fb.com>
> Acked-by: Andrey Ignatov <rdna@fb.com>

^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jakub Kicinski @ 2019-07-30 22:59 UTC (permalink / raw)
  To: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190730184821.10833-1-danieltimlee@gmail.com>

On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:
> Currently, bpftool net only supports dumping progs loaded on the
> interface. To load XDP prog on interface, user must use other tool
> (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> (un)load XDP prog on interface.

I don't understand why using another tool is a bad thing :(
What happened to the Unix philosophy?

I remain opposed to duplicating iproute2's functionality under 
bpftool net :( The way to attach bpf programs in the networking
subsystem is through the iproute2 commends - ip and tc.. 

It seems easy enough to add a feature to bpftool but from 
a perspective of someone adding a new feature to the kernel, 
and wanting to update user space components it's quite painful :(

So could you describe to me in more detail why this is a good idea?
Perhaps others can chime in?

>     $ ./bpftool prog
>     ...
>     208: xdp  name xdp_prog1  tag ad822e38b629553f  gpl
>       loaded_at 2019-07-28T18:03:11+0900  uid 0
>     ...
>     $ ./bpftool net load id 208 xdpdrv enp6s0np1
>     $ ./bpftool net
>     xdp:
>     enp6s0np1(5) driver id 208
>     ...
>     $ ./bpftool net unload xdpdrv enp6s0np1
>     $ ./bpftool net
>     xdp:
>     ...
> 
> The word 'load' is used instead of 'attach', since XDP program is not
> considered as 'bpf_attach_type' and can't be attached with
> 'BPF_PROG_ATTACH'. In this context, the meaning of 'load' is, prog will
> be loaded on interface.
> 
> While this patch only contains support for XDP, through `net (un)load`,
> bpftool can further support other prog attach types.
> 
> XDP (un)load tested on Netronome Agilio.

^ permalink raw reply

* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Shannon Nelson @ 2019-07-30 22:53 UTC (permalink / raw)
  To: Saeed Mahameed, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <a27ba11984c8872a35206bd9fbeee0800ba7b050.camel@mellanox.com>

On 7/25/19 4:55 PM, Saeed Mahameed wrote:
> On Mon, 2019-07-22 at 14:40 -0700, Shannon Nelson wrote:
>> When the netdev gets a new name from userland, pass that name
>> down to the NIC for internal tracking.
>>
> Just out of curiosity, why your NIC internal device/firmware need to
> keep tracking of the netdev name ?
>
>
It is helpful in a debugging method inside the NIC firmware.

sln


^ permalink raw reply

* Re: [patch net-next v2 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jakub Kicinski @ 2019-07-30 22:40 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-4-jiri@resnulli.us>

On Tue, 30 Jul 2019 10:57:34 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> When user does create new netdevsim instance using sysfs bus file,
> create the devlink instance and related netdev instance in the namespace
> of the caller.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

> v1->v2:
> - remove net_namespace.h include and forward decralared net struct
> - add comment to initial_net pointer

Thanks!

^ permalink raw reply

* Re: [patch net-next v2 2/3] net: devlink: export devlink net set/get helpers
From: Jakub Kicinski @ 2019-07-30 22:40 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-3-jiri@resnulli.us>

On Tue, 30 Jul 2019 10:57:33 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Allow drivers to set/get net struct for devlink instance. Set is only
> allowed for newly allocated devlink instance.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox