* Re: [PATCH iproute2] bpf: add btf func and func_proto kind support
From: Stephen Hemminger @ 2019-02-05 23:30 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20190125004107.3075254-1-yhs@fb.com>
On Thu, 24 Jan 2019 16:41:07 -0800
Yonghong Song <yhs@fb.com> wrote:
> The issue is discovered for bpf selftest test_skb_cgroup.sh.
> Currently we have,
> $ ./test_skb_cgroup_id.sh
> Wait for testing link-local IP to become available ... OK
> Object has unknown BTF type: 13!
> [PASS]
>
> In the above the BTF type 13 refers to BTF kind
> BTF_KIND_FUNC_PROTO.
> This patch added support of BTF_KIND_FUNC_PROTO and
> BTF_KIND_FUNC during type parsing.
> With this patch, I got
> $ ./test_skb_cgroup_id.sh
> Wait for testing link-local IP to become available ... OK
> [PASS]
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
Sure applied.
^ permalink raw reply
* Re: [PATCH iproute2-master] bridge: fdb: Fix FDB dump with strict checking disabled
From: Stephen Hemminger @ 2019-02-05 23:28 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev@vger.kernel.org, dsahern@gmail.com, ivecera@redhat.com,
liali@redhat.com, mlxsw
In-Reply-To: <20190125170837.6434-1-idosch@mellanox.com>
On Fri, 25 Jan 2019 17:09:17 +0000
Ido Schimmel <idosch@mellanox.com> wrote:
> While iproute2 correctly uses ifinfomsg struct as the ancillary header
> when requesting an FDB dump on old kernels, it sets the message type to
> RTM_GETLINK. This results in wrong reply being returned.
>
> Fix this by using RTM_GETNEIGH instead.
>
> Before:
> $ bridge fdb show brport dummy0
> Not RTM_NEWNEIGH: 00000158 00000010 00000002
>
> After:
> $ bridge fdb show brport dummy0
> 2a:0b:41:1c:92:d3 vlan 1 master br0 permanent
> 2a:0b:41:1c:92:d3 master br0 permanent
> 33:33:00:00:00:01 self permanent
> 01:00:5e:00:00:01 self permanent
>
> Fixes: 05880354c2cf ("bridge: fdb: Fix filtering with strict checking disabled")
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Reported-by: LiLiang <liali@redhat.com>
Looks good applied
^ permalink raw reply
* Re: [patch iproute2] libnetlink: linkdump_req: AF_PACKET family also expects ext_filter_mask
From: Stephen Hemminger @ 2019-02-05 23:26 UTC (permalink / raw)
To: Chris Mi; +Cc: netdev, dsahern
In-Reply-To: <1548412627-6006-1-git-send-email-chrism@mellanox.com>
On Fri, 25 Jan 2019 10:37:07 +0000
Chris Mi <chrism@mellanox.com> wrote:
> Without this fix, the VF info can't be showed using command
> "ip link".
>
> 146: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
> link/ether 24:8a:07:ad:78:52 brd ff:ff:ff:ff:ff:ff
> vf 0 MAC 02:25:d0:12:01:01, spoof checking off, link-state auto, trust off, query_rss off
> vf 1 MAC 02:25:d0:12:01:02, spoof checking off, link-state auto, trust off, query_rss off
>
> Fixes: d97b16b2c906 ("libnetlink: linkdump_req: Only AF_UNSPEC family expects an ext_filter_mask")
>
> Signed-off-by: Chris Mi <chrism@mellanox.com>
Applied
^ permalink raw reply
* Re: Kernel panic in eth_header
From: Andrew @ 2019-02-05 23:09 UTC (permalink / raw)
To: Netdev
In-Reply-To: <7c2280ea-27dc-667f-78b4-e7eaaf97d337@gmail.com>
Thanks. At least, IPv4 tests passed (IPv4 overlapped - fails at first
time, but passes on next times; on IPv6 I've got 'send_fragment:
operation not permitted', I didn't look deeply because I don't use
IPv6). No KASAN warnings in dmesg.
On 05.02.2019 22:28, Eric Dumazet wrote:
>
> On 02/05/2019 12:21 PM, Andrew wrote:
>
>> I think that backport will be trivial - at least patch lays smoothly on 4.9 (just with offsets difference).
>>
>> I'll test it.
>>
>> Btw, maybe there's a some test conditions to quickly check if patch helps? Crash is reproducible with unpredictable interval (tens of hours of quite heavy load).
>>
> Build your kernel with CONFIG_KASAN=y
>
> Then run the tests Peter wrote.
>
> 4c3510483d26420d2c2c7cc075ad872286cc5932 selftests: net: ip_defrag: cover new IPv6 defrag behavior
> 3271a4821882a64214acc1bd7b173900ec70c9bf selftests: net: fix/improve ip_defrag selftest
> bccc17118bcf3c62c947361d51760334f6602f43 selftests/net: add ipv6 tests to ip_defrag selftest
> 02c7f38b7ace9f1b2ddb7a88139127eef4cf8706 selftests/net: add ip_defrag selftest
>
>
^ permalink raw reply
* [PATCH net] net: dsa: mv88e6xxx: Fix counting of ATU violations
From: Andrew Lunn @ 2019-02-05 23:02 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Florian Fainelli, Andrew Lunn
The ATU port vector contains a bit per port of the switch. The code
wrongly used it as a port number, and incremented a port counter. This
resulted in the wrong interfaces counter being incremented, and
potentially going off the end of the array of ports.
Fix this by using the source port ID for the violation, which really
is a port number.
Reported-by: Chris Healy <Chris.Healy@zii.aero>
Tested-by: Chris Healy <Chris.Healy@zii.aero>
Fixes: 65f60e4582bd ("net: dsa: mv88e6xxx: Keep ATU/VTU violation statistics")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
drivers/net/dsa/mv88e6xxx/global1_atu.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/global1_atu.c b/drivers/net/dsa/mv88e6xxx/global1_atu.c
index 5200e4bdce93..ea243840ee0f 100644
--- a/drivers/net/dsa/mv88e6xxx/global1_atu.c
+++ b/drivers/net/dsa/mv88e6xxx/global1_atu.c
@@ -314,6 +314,7 @@ static irqreturn_t mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id)
{
struct mv88e6xxx_chip *chip = dev_id;
struct mv88e6xxx_atu_entry entry;
+ int spid;
int err;
u16 val;
@@ -336,6 +337,8 @@ static irqreturn_t mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id)
if (err)
goto out;
+ spid = entry.state;
+
if (val & MV88E6XXX_G1_ATU_OP_AGE_OUT_VIOLATION) {
dev_err_ratelimited(chip->dev,
"ATU age out violation for %pM\n",
@@ -344,23 +347,23 @@ static irqreturn_t mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id)
if (val & MV88E6XXX_G1_ATU_OP_MEMBER_VIOLATION) {
dev_err_ratelimited(chip->dev,
- "ATU member violation for %pM portvec %x\n",
- entry.mac, entry.portvec);
- chip->ports[entry.portvec].atu_member_violation++;
+ "ATU member violation for %pM portvec %x spid %d\n",
+ entry.mac, entry.portvec, spid);
+ chip->ports[spid].atu_member_violation++;
}
if (val & MV88E6XXX_G1_ATU_OP_MISS_VIOLATION) {
dev_err_ratelimited(chip->dev,
- "ATU miss violation for %pM portvec %x\n",
- entry.mac, entry.portvec);
- chip->ports[entry.portvec].atu_miss_violation++;
+ "ATU miss violation for %pM portvec %x spid %d\n",
+ entry.mac, entry.portvec, spid);
+ chip->ports[spid].atu_miss_violation++;
}
if (val & MV88E6XXX_G1_ATU_OP_FULL_VIOLATION) {
dev_err_ratelimited(chip->dev,
- "ATU full violation for %pM portvec %x\n",
- entry.mac, entry.portvec);
- chip->ports[entry.portvec].atu_full_violation++;
+ "ATU full violation for %pM portvec %x spid %d\n",
+ entry.mac, entry.portvec, spid);
+ chip->ports[spid].atu_full_violation++;
}
mutex_unlock(&chip->reg_lock);
--
2.20.1
^ permalink raw reply related
* [PATCH]: net: hso: do not call unregister if not registered
From: Yavuz, Tuba @ 2019-02-05 23:01 UTC (permalink / raw)
To: netdev@vger.kernel.org
On an error path inside the hso_create_net_device function of the hso
driver, hso_free_net_device gets called. This causes potentially a
negative reference count in the net device if register_netdev has not
been called yet as hso_free_net_device calls unregister_netdev
regardless. I think the driver should distinguish these cases and call
unregister_netdev only if register_netdev has been called.
Signed-off-by: Tuba Yavuz <tuba@ece.ufl.edu>
---
--- linux-stable/drivers/net/usb/hso.c.orig 2019-01-27 14:45:58.232683119 -0500
+++ linux-stable/drivers/net/usb/hso.c 2019-02-05 17:54:17.056496019 -0500
@@ -2377,7 +2377,9 @@ static void hso_free_net_device(struct h
remove_net_device(hso_net->parent);
- if (hso_net->net)
+ if (hso_net->net &&
+ hso_net->net->reg_state == NETREG_REGISTERED
+ )
unregister_netdev(hso_net->net);
/* start freeing */
^ permalink raw reply
* Re: [PATCH iproute2-next] tc: full JSON support for 'bpf' actions
From: David Ahern @ 2019-02-05 22:59 UTC (permalink / raw)
To: Stephen Hemminger, Davide Caratti; +Cc: netdev
In-Reply-To: <20190205145359.74508b0e@hermes.lan>
On 2/5/19 2:53 PM, Stephen Hemminger wrote:
> On Thu, 31 Jan 2019 18:58:09 +0100
> Davide Caratti <dcaratti@redhat.com> wrote:
>
>> + print_uint(PRINT_ANY, "code", "%hu ", ops[i].code);
>> + print_uint(PRINT_ANY, "jt", "%hhu ", ops[i].jt);
>> + print_uint(PRINT_ANY, "jf", "%hhu ", ops[i].jf);
>
> Did you know that print_uint promotes the argument to unsigned int
> then you are printing it with %hhu which expects only a u8.
>
I did look at the print_hhu option and it seems really weird that you
use "print_hhu(..., "%hhu", ...)" which is why I took the patch as is.
There are existing examples of print_uint with '%hu' too.
The print_ functions really should be renamed (print_uchar,
print_ushort, etc).
^ permalink raw reply
* Re: [RFC PATCH iproute2 2/5] act_ct: first import
From: Stephen Hemminger @ 2019-02-05 22:56 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Guy Shattah, Aaron Conole, John Hurley, Simon Horman,
Justin Pettit, Gregory Rose, Eelco Chaudron, Flavio Leitner,
Florian Westphal, Jiri Pirko, Rashid Khan, Sushil Kulkarni,
Andy Gospodarek, Roi Dayan, Yossi Kuperman, Or Gerlitz,
Rony Efraim, davem@davemloft.net, netdev
In-Reply-To: <e16eb55db89c48ec883fcaf582451be05f93f384.1548287070.git.mleitner@redhat.com>
On Fri, 25 Jan 2019 00:33:30 -0200
Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> +/*
> + * m_ct.c Connection Tracking target module
> + *
> + * This program is free software; you can distribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
Please just use SPDX for license info not GPL boilerplate.
^ permalink raw reply
* Re: [xfrm, backport request] Request backport of e2612cd496e7 - set-mark backwards compatibility
From: Maciej Żenczykowski @ 2019-02-05 22:54 UTC (permalink / raw)
To: Benedict Wong, Greg Kroah-Hartman, Greg Kroah-Hartman
Cc: Linux NetDev, Nathan Harold, Lorenzo Colitti, Steffen Klassert,
Eyal Birger, tobias
In-Reply-To: <CANrj0bbjjtczU=Bk86y4hn=BX0pc0SQtKDLVwBJ4jihcC9JU2Q@mail.gmail.com>
> I propose backporting commit e2612cd496e7 ("xfrm: Make set-mark default
> behavior backward compatible") to 4.19 and 4.20 kernels to fix a backwards
> compatibility bug introduced in 9b42c1f179a6 (“xfrm: Extend the
> output_mark to support input direction and masking”).
>
> The fix is small, relatively simple, and has unit tests. :)
>
> Without this change, systems using mark-based routing on 4.19 or 4.20
> kernels will by fail to route IPsec tunnel mode packets correctly in the
> default case. This specifically affects Android devices.
Looks like it already includes a 'fixes: sha1' tag.
I'm not sure what causes these patches to get picked up for stable...
I'm guessing it's some sort of Greg-fu-style-magic...?
^ permalink raw reply
* [RFC PATCH 3/4] bpf: Charge bpf jit limit in bpf_jit_alloc_exec
From: Rick Edgecombe @ 2019-02-05 22:51 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, ard.biesheuvel, dave.hansen, kristen, Rick Edgecombe
In-Reply-To: <20190205225103.28296-1-rick.p.edgecombe@intel.com>
This moves the calls to bpf_jit_charge_modmem and
bpf_jit_uncharge_modmem to bpf_jit_alloc_exec so the behavior can be
overridden for arch specific capabilities, for example if the arch
supports for allocating outside the module space.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
include/linux/filter.h | 3 +++
kernel/bpf/core.c | 20 +++++++++++---------
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ad106d845b22..33c0ae5990e1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -859,6 +859,9 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
unsigned int alignment,
bpf_jit_fill_hole_t bpf_fill_ill_insns);
void bpf_jit_binary_free(struct bpf_binary_header *hdr);
+int bpf_jit_charge_modmem(u32 pages);
+void bpf_jit_uncharge_modmem(u32 pages);
+
void bpf_jit_free(struct bpf_prog *fp);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f908b9356025..ce08a09839c2 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -722,7 +722,7 @@ static int __init bpf_jit_charge_init(void)
}
pure_initcall(bpf_jit_charge_init);
-static int bpf_jit_charge_modmem(u32 pages)
+int bpf_jit_charge_modmem(u32 pages)
{
if (atomic_long_add_return(pages, &bpf_jit_current) >
(bpf_jit_limit >> PAGE_SHIFT)) {
@@ -735,14 +735,22 @@ static int bpf_jit_charge_modmem(u32 pages)
return 0;
}
-static void bpf_jit_uncharge_modmem(u32 pages)
+void bpf_jit_uncharge_modmem(u32 pages)
{
atomic_long_sub(pages, &bpf_jit_current);
}
void *__weak bpf_jit_alloc_exec(unsigned long size)
{
- return module_alloc(size);
+ void *ret;
+ u32 pages = size / PAGE_SIZE;
+
+ if (bpf_jit_charge_modmem(pages))
+ return NULL;
+ ret = module_alloc(size);
+ if (!ret)
+ bpf_jit_uncharge_modmem(pages);
+ return ret;
}
void __weak bpf_jit_free_exec(void *addr)
@@ -765,13 +773,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
pages = size / PAGE_SIZE;
- if (bpf_jit_charge_modmem(pages))
- return NULL;
hdr = bpf_jit_alloc_exec(size);
- if (!hdr) {
- bpf_jit_uncharge_modmem(pages);
- return NULL;
- }
/* Fill space with illegal/arch-dep instructions. */
bpf_fill_ill_insns(hdr, size);
--
2.17.1
^ permalink raw reply related
* [RFC PATCH 2/4] bpf, x64: Increase distance for bpf calls
From: Rick Edgecombe @ 2019-02-05 22:51 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, ard.biesheuvel, dave.hansen, kristen, Rick Edgecombe
In-Reply-To: <20190205225103.28296-1-rick.p.edgecombe@intel.com>
This allows for BPF calls to be emitted that call further than the
relative call range. When a call cannot be emitted as a relative call it
is emitted as a full indirect call.
The image has to be allocated in order to compute the distance of the
call in order to know what type of call to be emitted. The two types of
calls have different sizes, so this requires allowing the image to shrink
further after image is created. After the image is allocated two more
passes are needed, one to get the new sizes and the other to set the
final offsets.
So the algorithm in bpf_int_jit_compile is changed to always do at least
2 more passes after the program converges once. The old check inside the
loop that verified if the program length changed after the image was
created now looks for the same scenario by checking if the image ever
converged a second time after if the maximum passes were reached.
In the case of retpoline, the call needs to be made through a retpoline
sequence. This sequence is emitted at the end of the program so that it
can be re-used by multiple calls.
This change is intended to not change the emitted code that is actually
executed in the case of using the module space, however the allocation
may be larger at the end when using retpoline due the thunk emitted at
the end.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
arch/x86/net/bpf_jit_comp.c | 117 +++++++++++++++++++++++++++++-------
1 file changed, 94 insertions(+), 23 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5542303c43d9..c9781d471e31 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -13,6 +13,7 @@
#include <linux/filter.h>
#include <linux/if_vlan.h>
#include <linux/bpf.h>
+#include <linux/moduleloader.h>
#include <asm/set_memory.h>
#include <asm/nospec-branch.h>
@@ -408,13 +409,17 @@ static void emit_mov_reg(u8 **pprog, bool is64, u32 dst_reg, u32 src_reg)
*pprog = prog;
}
+#define RETPOL_THUNK_SIZE ((IS_ENABLED(CONFIG_RETPOLINE) \
+ * RETPOLINE_RBX_BPF_JIT_CALL_SIZE) + 1)
+
static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
int oldproglen, struct jit_context *ctx)
{
struct bpf_insn *insn = bpf_prog->insnsi;
int insn_cnt = bpf_prog->len;
bool seen_exit = false;
- u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+ const int retpol_thunk = oldproglen - RETPOL_THUNK_SIZE;
+ u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY + RETPOL_THUNK_SIZE];
int i, cnt = 0;
int proglen = 0;
u8 *prog = temp;
@@ -430,7 +435,6 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
s64 jmp_offset;
u8 jmp_cond;
int ilen;
- u8 *func;
switch (insn->code) {
/* ALU */
@@ -856,16 +860,49 @@ xadd: if (is_imm8(insn->off))
/* call */
case BPF_JMP | BPF_CALL:
- func = (u8 *) __bpf_call_base + imm32;
- jmp_offset = func - (image + addrs[i]);
- if (!imm32 || !is_simm32(jmp_offset)) {
- pr_err("unsupported BPF func %d addr %p image %p\n",
- imm32, func, image);
- return -EINVAL;
+ {
+ bool func_addr_fixed;
+ u64 func_addr;
+ int ret = bpf_jit_get_func_addr(bpf_prog, insn,
+ !image, &func_addr, &func_addr_fixed);
+ if (ret < 0)
+ return ret;
+
+ jmp_offset = func_addr - (u64)(image + addrs[i]);
+
+ /*
+ * Need to know the allocation location before we can
+ * know whether we can use a relative call or not.
+ *
+ * Always emit indirect version until we have the image
+ * so the length will shrink.
+ */
+ if (image && (imm32 && is_simm32(jmp_offset))) {
+ /* Emit relative call */
+
+ EMIT1_off32(0xE8, jmp_offset);
+ } else {
+ /* Emit indirect call */
+
+ EMIT1(0x53); /* push rbx */
+ emit_mov_imm64(&prog, BPF_REG_6,
+ (u32)(((u64)func_addr) >> 32),
+ (u32)(u64)func_addr);
+ /* -1 to account for pop rbx below */
+ jmp_offset = retpol_thunk - (addrs[i] - 1);
+
+ /*
+ * If retpoline, jump to retpoline thunk, or
+ * else emit the default indirect call version.
+ */
+ if (IS_ENABLED(CONFIG_RETPOLINE))
+ EMIT1_off32(0xE8, jmp_offset);
+ else
+ RETPOLINE_RBX_BPF_JIT_CALL();
+ EMIT1(0x5B); /* pop rbx */
}
- EMIT1_off32(0xE8, jmp_offset);
break;
-
+ }
case BPF_JMP | BPF_TAIL_CALL:
emit_bpf_tail_call(&prog);
break;
@@ -1049,6 +1086,27 @@ xadd: if (is_imm8(insn->off))
addrs[i] = proglen;
prog = temp;
}
+
+ /*
+ * If this allocation is far from the kernel text, we may need
+ * to do a retpoline call. Embed the chunk here so it can be
+ * re-used by multiple calls.
+ */
+ if (IS_ENABLED(CONFIG_RETPOLINE)) {
+ int new_len = RETPOLINE_RBX_BPF_JIT_CALL_SIZE + 1;
+
+ RETPOLINE_RBX_BPF_JIT_CALL();
+ EMIT1(0xC3); /* ret */
+
+ if (image) {
+ if (unlikely(proglen + new_len > oldproglen)) {
+ pr_err("bpf_jit: fatal error\n");
+ return -EFAULT;
+ }
+ memcpy(image + proglen, temp, new_len);
+ }
+ proglen += new_len;
+ }
return proglen;
}
@@ -1073,6 +1131,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
int *addrs;
int pass;
int i;
+ int converged_pass;
if (!prog->jit_requested)
return orig_prog;
@@ -1127,10 +1186,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
/*
* JITed image shrinks with every pass and the loop iterates
* until the image stops shrinking. Very large BPF programs
- * may converge on the last pass. In such case do one more
- * pass to emit the final image.
+ * may converge on the last pass. In such case one more
+ * pass is needed to emit the final image.
+ *
+ * After the image memory is allocated there needs to be at least 2 more
+ * passes as the emitted calls can shrink if they are in relative range.
+ * The logic is, we always do at least 2 more passes after the image
+ * converges once.
*/
- for (pass = 0; pass < 20 || image; pass++) {
+ converged_pass = 1;
+ for (pass = 0; pass < 22 && pass < converged_pass + 2; pass++) {
proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
if (proglen <= 0) {
out_image:
@@ -1140,26 +1205,32 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
prog = orig_prog;
goto out_addrs;
}
- if (image) {
- if (proglen != oldproglen) {
- pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
- proglen, oldproglen);
- goto out_image;
- }
- break;
- }
- if (proglen == oldproglen) {
+
+ if (proglen != oldproglen)
+ converged_pass = pass + 1;
+
+ if (proglen == oldproglen && !image) {
header = bpf_jit_binary_alloc(proglen, &image,
- 1, jit_fill_hole);
+ 1, jit_fill_hole);
if (!header) {
prog = orig_prog;
goto out_addrs;
}
}
+
oldproglen = proglen;
cond_resched();
}
+ /* See if the image ever converged */
+ if (image) {
+ if (proglen != oldproglen) {
+ pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
+ proglen, oldproglen);
+ goto out_image;
+ }
+ }
+
if (bpf_jit_enable > 1)
bpf_jit_dump(prog->len, proglen, pass + 1, image);
--
2.17.1
^ permalink raw reply related
* [RFC PATCH 4/4] bpf, x64: Enable unprivlidged jit in vmalloc
From: Rick Edgecombe @ 2019-02-05 22:51 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, ard.biesheuvel, dave.hansen, kristen, Rick Edgecombe
In-Reply-To: <20190205225103.28296-1-rick.p.edgecombe@intel.com>
This enables unprivlidged JIT allocations to be made in vmalloc space
when the bpf jit limit is exceeded.
The logic is we use module space unless it is full or we are not
CAP_SYS_ADMIN and bpf_jit_limit is exceeded, in which case we use vmalloc
space. So vmalloc is only used when either the insertion would fail, or
BPF would fallback to the interpreter.
In the case of using vmalloc, it is not charged against bpf_jit_limit.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
arch/x86/net/bpf_jit_comp.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index c9781d471e31..66d2b32a1db1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1118,6 +1118,38 @@ struct x64_jit_data {
struct jit_context ctx;
};
+void *bpf_jit_alloc_exec(unsigned long size)
+{
+ void *ret;
+ u32 pages = size / PAGE_SIZE;
+
+ /*
+ * The logic is we use module space unless it is full or we are not
+ * CAP_SYS_ADMIN and bpf_jit_limit is exceeded, in which case we use
+ * vmalloc space.
+ */
+ if (bpf_jit_charge_modmem(pages))
+ return vmalloc_exec(size);
+
+ ret = module_alloc(size);
+
+ if (!ret) {
+ bpf_jit_uncharge_modmem(pages);
+ /* If module space is full, try vmalloc */
+ return vmalloc_exec(size);
+ }
+
+ return ret;
+}
+
+void bpf_jit_free_exec(void *addr)
+{
+ if (is_vmalloc_addr(addr))
+ vfree(addr);
+ else
+ module_memfree(addr);
+}
+
struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
{
struct bpf_binary_header *header = NULL;
--
2.17.1
^ permalink raw reply related
* [RFC PATCH 1/4] bpf, x64: Implement BPF call retpoline
From: Rick Edgecombe @ 2019-02-05 22:51 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, ard.biesheuvel, dave.hansen, kristen, Rick Edgecombe
In-Reply-To: <20190205225103.28296-1-rick.p.edgecombe@intel.com>
Add x86 call retpoline sequence from the "Intel Retpoline: A Branch Target
Injection Mitigation White Paper" for BPF JIT compiler. Unlike the paper
it uses RBX instead of RAX since RAX is part of the BPF calling
convetions.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
arch/x86/include/asm/nospec-branch.h | 45 ++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index dad12b767ba0..70b3f6534134 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -320,6 +320,8 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
#endif /* __ASSEMBLY__ */
+#ifdef CONFIG_RETPOLINE
+# ifdef CONFIG_X86_64
/*
* Below is used in the eBPF JIT compiler and emits the byte sequence
* for the following assembly:
@@ -341,8 +343,6 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
* jmp *%rax for x86_64
* jmp *%edx for x86_32
*/
-#ifdef CONFIG_RETPOLINE
-# ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 17
# define RETPOLINE_RAX_BPF_JIT() \
do { \
@@ -355,6 +355,44 @@ do { \
EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \
EMIT1(0xC3); /* retq */ \
} while (0)
+
+/* Modified from "Intel Retpoline: A Branch Target Injection Mitigation White
+ * Paper" to use RBX register in order to not intefere with BPF calling
+ * conventions:
+ *
+ * jmp label2
+ *
+ * label0:
+ * call label1
+ *
+ * capture_ret_spec:
+ * pause
+ * lfence
+ * jmp capture_ret_spec
+ *
+ * label1:
+ * mov %rbx,(%rsp)
+ * ret
+ *
+ * label2:
+ * call label0
+ */
+# define RETPOLINE_RBX_BPF_JIT_CALL_SIZE 24
+# define RETPOLINE_RBX_BPF_JIT_CALL() \
+do { \
+ EMIT2(0xEB, 0x11); /* jump label2 */ \
+ /* label2: */ \
+ EMIT1_off32(0xE8, 7); /* call label1 */ \
+ /* capture_ret_spec: */ \
+ EMIT2(0xF3, 0x90); /* pause */ \
+ EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \
+ EMIT2(0xEB, 0xF9); /* jmp capture_ret_spec */ \
+ /* label1: */ \
+ EMIT4(0x48, 0x89, 0x1c, 0x24); /* mov %rbx,(%rsp) */ \
+ EMIT1(0xC3); /* ret */ \
+ /* label2: */ \
+ EMIT1_off32(0xE8, -22); /* call label0 */ \
+} while (0)
# else /* !CONFIG_X86_64 */
# define RETPOLINE_EDX_BPF_JIT() \
do { \
@@ -373,6 +411,9 @@ do { \
# define RETPOLINE_RAX_BPF_JIT_SIZE 2
# define RETPOLINE_RAX_BPF_JIT() \
EMIT2(0xFF, 0xE0); /* jmp *%rax */
+# define RETPOLINE_RBX_BPF_JIT_CALL_SIZE 2
+# define RETPOLINE_RBX_BPF_JIT_CALL() \
+ EMIT2(0xff, 0xD3) /* call *%rbx */
# else /* !CONFIG_X86_64 */
# define RETPOLINE_EDX_BPF_JIT() \
EMIT2(0xFF, 0xE2) /* jmp *%edx */
--
2.17.1
^ permalink raw reply related
* [RFC PATCH 0/4] Initial support for allocating BPF JITs in vmalloc for x86
From: Rick Edgecombe @ 2019-02-05 22:50 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, ard.biesheuvel, dave.hansen, kristen, Rick Edgecombe
This introduces a new capability for BPF program JIT's to be located in vmalloc
space on x86_64. This can serve as a backup area for CONFIG_BPF_JIT_ALWAYS_ON in
case an unprivileged app uses all of the module space allowed by bpf_jit_limit.
In order to allow for calls from the increased distance of vmalloc from
kernel/module space, relative calls are emitted as full indirect calls if the
maximum relative call distance is exceeded. So the resulting performance of call
BPF instructions in this case is similar to the BPF interpreter.
Below is the results for the BPF unit test "BPF_MAXINSNS: Call heavy
transformations":
Config Duration 1 Duration 2
--------------------------------------------------
JIT Modules 13304 14224
JIT Vmalloc 33711 26877
Interpreter 31931 28180
Retpoline JIT Modules 14517 13375
Retpoline JIT Vmalloc 65739 60281
Retpoline Interpreter 63097 61141
Since this benchmark is made up of only calls to the kernel text, it suggests
real world performance shouldn't ever be worse than the interpreter.
The logic for when to use vmalloc, is that we use module space unless it is
full, or, we are not CAP_SYS_ADMIN and bpf_jit_limit is exceeded. So vmalloc is
only used when either the insertion would fail, or BPF would fallback to the
interpreter. So there should be no run time regression when used in these
scenarios.
Todo:
- This passes all the unit tests, but could use further verification that the
compiler will still always converge.
- Fix ARM after "bpf: Charge bpf jit limit in bpf_jit_alloc_exec" change
- Update documentation for bpf_jit_limit
Possible future optimizations:
- Look at re-mapping some text in vmalloc so that calls can be in relative jump
range. For example, a BPF library program could maybe be re-mapped multiple
times so that a copy is always near the caller and so we could use the faster
calls.
Rick Edgecombe (4):
bpf, x64: Implement BPF call retpoline
bpf, x64: Increase distance for bpf calls
bpf: Charge bpf jit limit in bpf_jit_alloc_exec
bpf, x64: Enable unprivlidged jit in vmalloc
arch/x86/include/asm/nospec-branch.h | 45 +++++++-
arch/x86/net/bpf_jit_comp.c | 149 ++++++++++++++++++++++-----
include/linux/filter.h | 3 +
kernel/bpf/core.c | 20 ++--
4 files changed, 183 insertions(+), 34 deletions(-)
--
2.17.1
^ permalink raw reply
* Re: [PATCH iproute2-next] tc: full JSON support for 'bpf' actions
From: Stephen Hemminger @ 2019-02-05 22:53 UTC (permalink / raw)
To: Davide Caratti; +Cc: David Ahern, netdev
In-Reply-To: <c1bb8538298d24e302ef0e84ead3bfb97f2438dc.1548957196.git.dcaratti@redhat.com>
On Thu, 31 Jan 2019 18:58:09 +0100
Davide Caratti <dcaratti@redhat.com> wrote:
> + print_uint(PRINT_ANY, "code", "%hu ", ops[i].code);
> + print_uint(PRINT_ANY, "jt", "%hhu ", ops[i].jt);
> + print_uint(PRINT_ANY, "jf", "%hhu ", ops[i].jf);
Did you know that print_uint promotes the argument to unsigned int
then you are printing it with %hhu which expects only a u8.
^ permalink raw reply
* Re: [PATCH iproute2-next] Introduce ip-brctl shell script
From: Stephen Hemminger @ 2019-02-05 22:50 UTC (permalink / raw)
To: Roopa Prabhu
Cc: Stefano Brivio, Nikolay Aleksandrov, David Ahern, Phil Sutter,
Eric Garver, Tomas Dolezal, Lennert Buytenhek, netdev
In-Reply-To: <CAJieiUiMhpE6ALM4Oiq1V3nGxBj4fo5-Wj8Y5Vb3Mqy6xA+YBA@mail.gmail.com>
On Thu, 31 Jan 2019 08:28:29 -0800
Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> On Thu, Jan 31, 2019 at 4:46 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Wed, 30 Jan 2019 14:30:59 -0800
> > Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> >
> > > On Sun, Jan 27, 2019 at 11:57 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > They can't replace brctl not because they are badly designed or
> > > > unusable, but simply because they are different tools with different
> > > > purposes (see also my comments to Nikolay).
> > >
> > > I don't think i understand that they are different tools. The new netlink tools
> > > are supposed to deprecate the old tools that use ioctls. this is the same reason
> > > we don't have a ip-ifconfig today
> >
> > It's not just ioctl vs. netlink: brctl operates on bridges, whereas
> > ip-link and 'bridge' operate on generic links. If you look at
> > cmd_showstp() and cmd_show() from my script (I wouldn't recommend that
> > right after breakfast ;)) that should be apparent.
> >
> > If you want to get basic information about a bridge, with ip-link or
> > 'bridge' you need multiple commands. This is very different from
> > ifconfig: there's no such ifconfig command that can't be implemented as
> > a single ip-link command.
> >
> > > > > So, I think its better to fix the first two instead of introducing
> > > > > another one.
> > > >
> > > > This is really not the same thing: I'm not introducing a new tool, I'm
> > > > effectively replacing a 1794-LoC, non-trivial, ioctl-based
> > > > implementation with a trivial, 572-lines shell script, with half
> > > > the binary size.
> > >
> > > you are replacing a ioctl-based tool from another package into
> > > iproute2..and maybe there-by deprecating the netlink based tool ? :).
> >
> > Oh, I see your point here. Well, it's not deprecating anything because
> > I'm using the netlink tool in the wrapper (which allows it to be
> > rather dumb and simple), but sure, I understand what you mean.
> >
> > Still, we could very easily get the wrapper to print equivalent ip-link
> > and bridge commands before executing them -- and this would actually
> > ease and encourage migration, compared to the current situation.
> >
> > > We are in opposite camps, I strongly want people to move to bridge and
> > > ip link which has all the latest support.
> >
> > I wouldn't say we are in opposite camps, I just think that whatever has
> > been done so far to get people to switch hasn't worked.
>
> agreed. I think not much has been done to make output prettier for
> humans. The detailed output especially is very tricky to read.
> I think we should fix that.
>
> >
> > > > > Can we extend 'bridge' tool with extra options to provide a summary
> > > > > view of all bridges like brctl ?
> > > >
> > > > We could, and I initially thought of that approach instead, but that
> > > > has a number of fundamental downsides:
> > > >
> > > > - we can't provide a brctl-compatible syntax, unless we want to
> > > > substantially rewrite the 'bridge' interface, and I think it's a
> > > > bad idea to break 'bridge' syntax for users, while we won't be able to
> > > > replace brctl if we don't provide a similar syntax, history showed
> > >
> > > I am certainly not suggesting we break existing bridge users. I am
> > > talking about new options.
> >
> > Let's take 'brctl showstp br0'. I wouldn't even know where to start
> > adding options to 'bridge' to have a similar functionality (any idea?),
> > and still, that brctl command is very convenient.
>
>
> I don't see a reason not to have 'bridge stp show' (similar to bridge
> vlan show).
> Maybe there are better ways to represent the same thing.
>
>
> >
> > > I understand some people are finding it hard to move away from brctl
> > > output, but in my experience,
> > > these are also the people who want new things in brctl like json
> > > output etc. which is already available in the bridge command
> >
> > Not in my experience: the output of brctl showmacs and showstp commands
> > is human-readable, that's what a large category of users need. JSON is
> > well suited for other purposes.
>
> For showmacs i can tell you, we have been able to move every one to
> 'bridge fdb show' because beyond certain point brctl showmacs is not
> very useful. bridge fdb show on the other hand covers vxlan etc.
>
> >
> > > > - the fdb implementation has a long-dated comment by Stephen in its
> > > > header,
> > > > * TODO: merge/replace this with ip neighbour
> > > > and this is actually the only part of 'bridge' I'm using in ip-brctl.
> > > > Code is conceptually duplicated there, and I think we should actually
> > > > get rid of that -- but then 'bridge' wouldn't even give information
> > > > about the FDB, one would need to use ip neighbour instead.
> > >
> > > This could be comment from initial days. Today bridge has support for
> > > fdb, vlans and vlan tunnels which you
> > > cannot get from brctl and any brctl compat tool.
> >
> > This is unrelated. I'm specifically referring to the fdb information,
> > which is similar to what ip neighbour displays, and which you can
> > indeed get with brctl, see cmd_showmacs() from my script.
>
> I understood what you are saying. But there is no point moving fdb to
> neigh now because there is so much around it 'bridge fdb show, 'bridge
> monitor fdb' etc)
>
>
> >
> > > > - 'bridge' doesn't implement settings for basic bridge features (say,
> > > > STP), which are convenient for users, especially if they are used to
> > > > brctl. To get that, even at an interface/syntax level, we would need
> > > > to duplicate some parts of ip-link, which looks like a bad idea per
> > > > se.
> > >
> > > thats fine IMO. Today ip link set extended bridge attribute support is
> > > only for convenience.
> > >
> > > You can set most attributes both from ip link set and bridge link
> > > command. We can see if they can share code.
> >
> > *This* is exactly what looks confusing to me.
> >
> > > You can set a vlan on a bridge today via the bridge command. I dont
> > > see why we should hesitate about STP here.
> > > And you will get the json output for free.
> >
> > I'm not particularly interested in non-human users here, not in the
> > context of this discussion at least.
>
> sure, i am with u for better human readable output.
>
> >
> > > > > Its supposed to be the netlink based tool for all bridging and hence
> > > > > could be a good replacement for all brctl users.
> > > >
> > > > I still think the best replacement for users is the one that changes
> > > > absolutely nothing, and if that's easily achievable, I'd rather go for
> > > > it.
> > >
> > > That would also mean we add ip-ifconfig and ip-ethtool (if we
> > > deprecate ethtool tomorrow. i am not saying its going away....,
> > > but just giving you an example of ioctl to netlink based tools).
> >
> > Again, it's not the same as ifconfig, for the reasons I mentioned
> > above. And by the way I think ifconfig doesn't even vaguely have the
> > same amount of users of brctl nowadays, and for a reason, but I would
> > only have anecdotal experience to support this.
> >
> > --
> > Stefano
my $.02 mostly agrees with Roopa.
The code for bridge was written as part of the "lets make bridge code
use netlink" and tried as much as possible to look like other tools.
Providing brctl or ifconfig scripts is possible, but it really should
be put in a sample or demo directory and not installed by default.
It would just cause too much pain to distributions.
I love concise human readable output and hate long winded VMS style
commands.
Remember iproute has always had output formats that match input commands.
This was even required by some VPN tools in the past.
^ permalink raw reply
* Pull patches from tip/perf/core to bpf-next
From: Song Liu @ 2019-02-05 22:47 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov; +Cc: Kernel Team, Netdev
Hi Alexei and Daniel,
The following patches are required for BPF introspection in perf tools.
Please pull them to bpf-next, so that we get all the dependencies in one
tree.
Thanks,
Song
(from 1/10 to 10/10)
76193a94522f perf, bpf: Introduce PERF_RECORD_KSYMBOL
d764ac646491 tools headers uapi: Sync tools/include/uapi/linux/perf_event.h
6ee52e2a3fe4 perf, bpf: Introduce PERF_RECORD_BPF_EVENT
df063c83aa2c tools headers uapi: Sync tools/include/uapi/linux/perf_event.h
9aa0bfa370b2 perf tools: Handle PERF_RECORD_KSYMBOL
45178a928a4b perf tools: Handle PERF_RECORD_BPF_EVENT
7b612e291a5a perf tools: Synthesize PERF_RECORD_* for loaded BPF programs
a40b95bcd30c perf top: Synthesize BPF events for pre-existing loaded BPF programs
6934058d9fb6 bpf: Add module name [bpf] to ksymbols for bpf programs
811184fb6977 perf bpf: Fix synthesized PERF_RECORD_KSYMBOL/BPF_EVENT
^ permalink raw reply
* Re: [PATCH] mlx4_ib: Increase the timeout for CM cache
From: Jason Gunthorpe @ 2019-02-05 22:36 UTC (permalink / raw)
To: Håkon Bugge
Cc: David S . Miller, netdev, linux-rdma, rds-devel, linux-kernel,
Jack Morgenstein
In-Reply-To: <20190131170951.178676-1-haakon.bugge@oracle.com>
On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
> Using CX-3 virtual functions, either from a bare-metal machine or
> pass-through from a VM, MAD packets are proxied through the PF driver.
>
> Since the VMs have separate name spaces for MAD Transaction Ids
> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
> in a cache.
>
> Following the RDMA CM protocol, it is clear when an entry has to
> evicted form the cache. But life is not perfect, remote peers may die
> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
> the PF driver assumes the remote peer has gone.
>
> We have experienced excessive amount of DREQ retries during fail-over
> testing, when running with eight VMs per database server.
>
> The problem has been reproduced in a bare-metal system using one VM
> per physical node. In this environment, running 256 processes in each
> VM, each process uses RDMA CM to create an RC QP between himself and
> all (256) remote processes. All in all 16K QPs.
>
> When tearing down these 16K QPs, excessive DREQ retries (and
> duplicates) are observed. With some cat/paste/awk wizardry on the
> infiniband_cm sysfs, we observe:
>
> dreq: 5007
> cm_rx_msgs:
> drep: 3838
> dreq: 13018
> rep: 8128
> req: 8256
> rtu: 8256
> cm_tx_msgs:
> drep: 8011
> dreq: 68856
> rep: 8256
> req: 8128
> rtu: 8128
> cm_tx_retries:
> dreq: 60483
>
> Note that the active/passive side is distributed.
>
> Enabling pr_debug in cm.c gives tons of:
>
> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
> 1,sl_cm_id: 0xd393089f} is NULL!
>
> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
> tear-down phase of the application is reduced from 113 to 67
> seconds. Retries/duplicates are also significantly reduced:
>
> cm_rx_duplicates:
> dreq: 7726
> []
> cm_tx_retries:
> drep: 1
> dreq: 7779
>
> Increasing the timeout further didn't help, as these duplicates and
> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
> numbers fell down to about one hundred for both of them.
>
> Adjustment of the CMA timeout is _not_ part of this commit.
>
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
> drivers/infiniband/hw/mlx4/cm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Jack? What do you think?
> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
> index fedaf8260105..8c79a480f2b7 100644
> --- a/drivers/infiniband/hw/mlx4/cm.c
> +++ b/drivers/infiniband/hw/mlx4/cm.c
> @@ -39,7 +39,7 @@
>
> #include "mlx4_ib.h"
>
> -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ)
> +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ)
>
> struct id_map_entry {
> struct rb_node node;
> --
> 2.20.1
>
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] btf: separate btf creation and loading
From: Andrii Nakryiko @ 2019-02-05 22:34 UTC (permalink / raw)
To: Yonghong Song
Cc: Andrii Nakryiko, Song Liu, Alexei Starovoitov, Martin Lau,
netdev@vger.kernel.org, daniel@iogearbox.net
In-Reply-To: <a6975dcd-58fa-c70c-dcb4-7894a179b754@fb.com>
On Tue, Feb 5, 2019 at 1:25 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/5/19 11:48 AM, Andrii Nakryiko wrote:
> > This change splits out previous btf__new functionality of constructing
> > struct btf and loading it into kernel into two:
> > - btf__new() just creates and initializes struct btf
> > - btf__load() attempts to load existing struct btf into kernel
> >
> > btf__free will still close BTF fd, if it was ever loaded successfully
> > into kernel.
> >
> > This change allows users of libbpf to manipulate BTF using its API,
> > without the need to unnecessarily load it into kernel.
> >
> > One of the intended use cases is pahole using libbpf to do DWARF to BTF
> > conversion and deduplication using libbpf, while handling ELF sections
> > overwrites and other concerns on its own.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > Acked-by: Song Liu <songliubraving@fb.com>
> > ---
> > tools/lib/bpf/btf.c | 53 ++++++++++++++++++++++------------------
> > tools/lib/bpf/btf.h | 1 +
> > tools/lib/bpf/libbpf.c | 2 +-
> > tools/lib/bpf/libbpf.map | 1 +
> > 4 files changed, 32 insertions(+), 25 deletions(-)
> >
> > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> > index 4949f8840bda..065d51fa63e5 100644
> > --- a/tools/lib/bpf/btf.c
> > +++ b/tools/lib/bpf/btf.c
> > @@ -366,8 +366,6 @@ void btf__free(struct btf *btf)
> >
> > struct btf *btf__new(__u8 *data, __u32 size)
> > {
> > - __u32 log_buf_size = 0;
> > - char *log_buf = NULL;
> > struct btf *btf;
> > int err;
> >
> > @@ -377,15 +375,6 @@ struct btf *btf__new(__u8 *data, __u32 size)
> >
> > btf->fd = -1;
> >
> > - log_buf = malloc(BPF_LOG_BUF_SIZE);
> > - if (!log_buf) {
> > - err = -ENOMEM;
> > - goto done;
> > - }
> > -
> > - *log_buf = 0;
> > - log_buf_size = BPF_LOG_BUF_SIZE;
> > -
> > btf->data = malloc(size);
> > if (!btf->data) {
> > err = -ENOMEM;
> > @@ -395,17 +384,6 @@ struct btf *btf__new(__u8 *data, __u32 size)
> > memcpy(btf->data, data, size);
> > btf->data_size = size;
> >
> > - btf->fd = bpf_load_btf(btf->data, btf->data_size,
> > - log_buf, log_buf_size, false);
> > -
> > - if (btf->fd == -1) {
> > - err = -errno;
> > - pr_warning("Error loading BTF: %s(%d)\n", strerror(errno), errno);
> > - if (log_buf && *log_buf)
> > - pr_warning("%s\n", log_buf);
> > - goto done;
> > - }
> > -
> > err = btf_parse_hdr(btf);
> > if (err)
> > goto done;
> > @@ -417,8 +395,6 @@ struct btf *btf__new(__u8 *data, __u32 size)
> > err = btf_parse_type_sec(btf);
> >
> > done:
> > - free(log_buf);
> > -
> > if (err) {
> > btf__free(btf);
> > return ERR_PTR(err);
> > @@ -427,6 +403,35 @@ struct btf *btf__new(__u8 *data, __u32 size)
> > return btf;
> > }
> >
> > +int btf__load(struct btf* btf) {
> > + __u32 log_buf_size = BPF_LOG_BUF_SIZE;
> > + char *log_buf = NULL;
> > +
> > + if (btf->fd >= 0) {
> > + return -EEXIST;
> > + }
> > +
> > + log_buf = malloc(log_buf_size);
> > + if (!log_buf)
> > + return -ENOMEM;
> > +
> > + *log_buf = 0;
> > +
> > + btf->fd = bpf_load_btf(btf->data, btf->data_size,
> > + log_buf, log_buf_size, false);
> > + if (btf->fd < 0) {
> > + btf->fd = -errno;
>
> Why you set btf->fd = -errno? Do you have any intended use for it later.
> If not, I would still prefer the existing value -1. This will be
> consistent with all other fd field convention in libbpf.
I though it would be convenient to get created fd back from
btf_load(), but I don't have strong preferences. In any case, we have
btf__get_fd() to get it back, so I will change btf__load() to return
just error code.
>
> > + pr_warning("Error loading BTF: %s(%d)\n", strerror(errno), errno);
> > + if (*log_buf)
> > + pr_warning("%s\n", log_buf);
> > + goto done;
> > + }
> > +
> > +done:
> > + free(log_buf);
> > + return btf->fd;
> > +}
> > +
> > int btf__fd(const struct btf *btf)
> > {
> > return btf->fd;
> > diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
> > index 25a9d2db035d..e8410887f93a 100644
> > --- a/tools/lib/bpf/btf.h
> > +++ b/tools/lib/bpf/btf.h
> > @@ -57,6 +57,7 @@ struct btf_ext_header {
> >
> > LIBBPF_API void btf__free(struct btf *btf);
> > LIBBPF_API struct btf *btf__new(__u8 *data, __u32 size);
> > +LIBBPF_API int btf__load(struct btf* btf);
> > LIBBPF_API __s32 btf__find_by_name(const struct btf *btf,
> > const char *type_name);
> > LIBBPF_API __u32 btf__get_nr_types(const struct btf *btf);
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 47969aa0faf8..75b82c1cfc72 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -835,7 +835,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
> > obj->efile.maps_shndx = idx;
> > else if (strcmp(name, BTF_ELF_SEC) == 0) {
> > obj->btf = btf__new(data->d_buf, data->d_size);
> > - if (IS_ERR(obj->btf)) {
> > + if (IS_ERR(obj->btf) || btf__load(obj->btf) < 0) {
> > pr_warning("Error loading ELF section %s: %ld. Ignored and continue.\n",
> > BTF_ELF_SEC, PTR_ERR(obj->btf));
> > obj->btf = NULL;
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index 89c1149e32ee..ffa1fe044f6a 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -134,6 +134,7 @@ LIBBPF_0.0.2 {
> > bpf_object__find_map_fd_by_name;
> > bpf_get_link_xdp_id;
> > btf__dedup;
> > + btf__load;
>
> Maybe put btf__load after btf__get_strings based on alphabetical order.
> With the above changes,
Yep, will fix to be alphabetically sorted.
>
>
> > btf__get_map_kv_tids;
> > btf__get_nr_types;
> > btf__get_strings;
> >
^ permalink raw reply
* [PATCH bpf-next] tools/bpf: fix a selftest test_btf failure
From: Yonghong Song @ 2019-02-05 22:28 UTC (permalink / raw)
To: Andrii Nakryiko, netdev
Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song
Commit 9c651127445c ("selftests/btf: add initial BTF dedup tests")
added dedup tests in test_btf.c.
It broke the raw test:
BTF raw test[71] (func proto (Bad arg name_off)):
btf_raw_create:2905:FAIL Error getting string #65535, strs_cnt:1
The test itself encodes invalid func_proto parameter name
offset 0xffffFFFF as a negative test for the kernel.
The above commit changed the meaning of that offset and
resulted in a user space error.
#define NAME_NTH(N) (0xffff0000 | N)
#define IS_NAME_NTH(X) ((X & 0xffff0000) == 0xffff0000)
#define GET_NAME_NTH_IDX(X) (X & 0x0000ffff)
Currently, the kernel permits maximum name offset 0xffff.
Set the test name off as 0x0fffFFFF to trigger the kernel
verification failure.
Cc: Andrii Nakryiko <andriin@fb.com>
Fixes: 9c651127445c ("selftests/btf: add initial BTF dedup tests")
Signed-off-by: Yonghong Song <yhs@fb.com>
---
tools/testing/selftests/bpf/test_btf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index 30c3edde7e07..447acc34db94 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -1978,7 +1978,7 @@ static struct btf_raw_test raw_tests[] = {
/* void (*)(int a, unsigned int <bad_name_off>) */
BTF_FUNC_PROTO_ENC(0, 2), /* [3] */
BTF_FUNC_PROTO_ARG_ENC(NAME_TBD, 1),
- BTF_FUNC_PROTO_ARG_ENC(0xffffffff, 2),
+ BTF_FUNC_PROTO_ARG_ENC(0x0fffffff, 2),
BTF_END_RAW,
},
.str_sec = "\0a",
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v1] net: dsa: qca8k: implement DT-based ports <-> phy translation
From: Florian Fainelli @ 2019-02-05 22:29 UTC (permalink / raw)
To: Christian Lamparter, Andrew Lunn; +Cc: netdev, Vivien Didelot
In-Reply-To: <2061891.QOleSDBFsG@debian64>
On 2/5/19 2:12 PM, Christian Lamparter wrote:
> On Tuesday, February 5, 2019 10:29:34 PM CET Andrew Lunn wrote:
>>> For now, I added the DT binding update to the patch as well.
>>> But if this is indeed the way to go, it'll get a separate patch.
>>
>> Hi Christian
>>
>> You need to be careful with the DT binding. You need to keep backwards
>> compatible with it. An old DT blob needs to keep working. I don't
>> think this is true with this change.
>
> Do you mean because of the
>
> - switch0@0 {
> + switch@10 {
> compatible = "qca,qca8337";
> #address-cells = <1>;
> #size-cells = <0>;
>
> - reg = <0>;
> + reg = <0x10>;
>
> change?
>
> or because I removed the phy-handles?>
> The reg = <0x10>; will be necessary regardless. Because this
> is really a bug in the existing binding example and if it is
> copied it will prevent the qca8k driver from loading.
> This is due to a resource conflict, because there will be
> already a "phy_port1: phy@0" registered at reg = <0>;
> So this never worked would have worked.
That part is fine, it is the removal of the phy-handle properties that
is possibly a problem, but in hindsight, I do not believe it will be a
compatibility issue. Lack of "phy-handle" property within the core DSA
layer means: utilize the switch's internal MDIO bus (ds->slave_mii_bus)
instance, which you are not removing, you are just changing how the PHYs
map to port numbers.
>
> Regards,
> Christian
>
>
--
Florian
^ permalink raw reply
* [PATCH] net: stmmac: Variable "val" in function sun8i_dwmac_set_syscon() could be uninitialized
From: Yizhuo @ 2019-02-05 22:15 UTC (permalink / raw)
Cc: csong, zhiyunq, Yizhuo, Giuseppe Cavallaro, Alexandre Torgue,
Maxime Ripard, Chen-Yu Tsai, netdev, linux-arm-kernel,
linux-kernel
In function sun8i_dwmac_set_syscon(), local variable "val" could
be uninitialized if function regmap_read() returns -EINVAL.
However, it will be used directly in the if statement, which
is potentially unsafe.
Signed-off-by: Yizhuo <yzhai003@ucr.edu>
---
drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index 39c2122a4f26..11d481c9e7ab 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -639,9 +639,14 @@ static int sun8i_dwmac_set_syscon(struct stmmac_priv *priv)
struct sunxi_priv_data *gmac = priv->plat->bsp_priv;
struct device_node *node = priv->device->of_node;
int ret;
- u32 reg, val;
+ u32 reg, val = 0;
+
+ ret = regmap_read(gmac->regmap, SYSCON_EMAC_REG, &val);
+ if (ret) {
+ dev_err(priv->device, "Fail to read SYSCON_EMAC_REG.\n");
+ return ret;
+ }
- regmap_read(gmac->regmap, SYSCON_EMAC_REG, &val);
reg = gmac->variant->default_syscon_value;
if (reg != val)
dev_warn(priv->device,
--
2.17.1
^ permalink raw reply related
* Re: [PATCH net-next v2 10/12] staging: fsl-dpaa2: ethsw: Implement ndo_get_port_parent_id()
From: Florian Fainelli @ 2019-02-05 22:18 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Ido Schimmel, open list,
open list:MELLANOX MLX5 core VPI driver,
open list:NETRONOME ETHERNET DRIVERS, open list:STAGING SUBSYSTEM,
moderated list:ETHERNET BRIDGE
In-Reply-To: <20190205221311.18476-11-f.fainelli@gmail.com>
On 2/5/19 2:13 PM, Florian Fainelli wrote:
> ethsw implements SWITCHDEV_ATTR_ID_PORT_PARENT_ID and we want to get rid
> of switchdev_ops eventually, ease that migration by implementing a
> ndo_get_port_parent_id() function which returns what
> switchdev_port_attr_get() would do.
>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
> drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
> index daabaceeea52..622f32377b91 100644
> --- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
> +++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
> @@ -505,6 +505,17 @@ static netdev_tx_t port_dropframe(struct sk_buff *skb,
> return NETDEV_TX_OK;
> }
>
> +static int swdev_get_port_parent_id(struct net_device *dev,
> + struct netdev_phys_item_id *ppid)
> +{
> + struct ethsw_port_priv *port_priv = netdev_priv(dev);
> +
> + ppid->id_len = 1;
> + ppid->id[0] = port_priv->ethsw_data->dev_id;
> +
> + return 0;
> +}
> +
> static const struct net_device_ops ethsw_port_ops = {
> .ndo_open = port_open,
> .ndo_stop = port_stop,
> @@ -515,6 +526,7 @@ static const struct net_device_ops ethsw_port_ops = {
> .ndo_get_offload_stats = port_get_offload_stats,
>
> .ndo_start_xmit = port_dropframe,
> + .ndo_get_port_parent_id = swdev_get_port_parent_id,
> };
>
> static void ethsw_links_state_update(struct ethsw_core *ethsw)
> @@ -634,10 +646,6 @@ static int swdev_port_attr_get(struct net_device *netdev,
> struct ethsw_port_priv *port_priv = netdev_priv(netdev);
>
> switch (attr->id) {
> - case SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
> - attr->u.ppid.id_len = 1;
> - attr->u.ppid.id[0] = port_priv->ethsw_data->dev_id;
> - break;
> case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
> attr->u.brport_flags =
> (port_priv->ethsw_data->learning ? BR_LEARNING : 0) |
> @@ -1434,7 +1442,6 @@ static int ethsw_probe_port(struct ethsw_core *ethsw, u16 port_idx)
> SET_NETDEV_DEV(port_netdev, dev);
> port_netdev->netdev_ops = ðsw_port_ops;
> port_netdev->ethtool_ops = ðsw_port_ethtool_ops;
> - port_netdev->switchdev_ops = ðsw_port_switchdev_ops;
This hunk should not be removed, I will respin a new version after
getting feedback.
--
Florian
^ permalink raw reply
* [PATCH net-next v2 00/12] net: Introduce ndo_get_port_parent_id()
From: Florian Fainelli @ 2019-02-05 22:12 UTC (permalink / raw)
To: netdev
Cc: Florian Fainelli, David S. Miller, Ido Schimmel, open list,
open list:MELLANOX MLX5 core VPI driver,
open list:NETRONOME ETHERNET DRIVERS, open list:STAGING SUBSYSTEM,
moderated list:ETHERNET BRIDGE
Hi all,
Based on discussion with Ido and feedback from Jakub there are clearly
two classes of users that implement SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
- PF/VF drivers which typically only implement return the port's parent
ID, yet have to implement switchdev_port_attr_get() just for that
- Ethernet switch drivers: mlxsw, ocelot, DSA, etc. which implement more
attributes which we want to be able to eventually veto in the context
of the caller, thus making them candidates for using a blocking notifier
chain
Changes in v2:
- resolved build failures spotted by kbuild test robot
- added helpers functions into the core network device layer:
dev_get_port_parent_id() and netdev_port_same_parent_id();
- added support for recursion to lower devices
Changes from RFC:
- introduce a ndo_get_port_parent_id() and convert all relevant drivers
to use it
- get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID
A subsequent set of patches will convert switchdev_port_attr_set() to
use a blocking notifier call, and still get rid of
switchdev_port_attr_get() altogether.
Florian Fainelli (12):
net: Introduce ndo_get_port_parent_id()
bnxt: Implement ndo_get_port_parent_id()
liquidio: Implement ndo_get_port_parent_id()
net/mlx5e: Implement ndo_get_port_parent_id()
mlxsw: Implement ndo_get_port_parent_id()
mscc: ocelot: Implement ndo_get_port_parent_id()
nfp: Implement ndo_get_port_parent_id()
rocker: Implement ndo_get_port_parent_id()
netdevsim: Implement ndo_get_port_parent_id()
staging: fsl-dpaa2: ethsw: Implement ndo_get_port_parent_id()
net: dsa: Implement ndo_get_port_parent_id()
net: Get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 28 +++------
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 4 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 3 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 12 ++--
.../net/ethernet/cavium/liquidio/lio_main.c | 22 ++-----
.../net/ethernet/cavium/liquidio/lio_vf_rep.c | 25 +++-----
.../ethernet/mellanox/mlx5/core/en/tc_tun.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en_rep.c | 31 ++++------
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 5 +-
.../net/ethernet/mellanox/mlxsw/spectrum.c | 13 +++++
.../mellanox/mlxsw/spectrum_switchdev.c | 5 --
.../net/ethernet/mellanox/mlxsw/switchx2.c | 36 +++++-------
drivers/net/ethernet/mscc/ocelot.c | 33 +++++------
.../ethernet/netronome/nfp/flower/action.c | 3 +-
.../ethernet/netronome/nfp/nfp_net_common.c | 4 +-
.../net/ethernet/netronome/nfp/nfp_net_repr.c | 4 +-
drivers/net/ethernet/netronome/nfp/nfp_port.c | 23 ++------
drivers/net/ethernet/netronome/nfp/nfp_port.h | 4 +-
drivers/net/ethernet/rocker/rocker_main.c | 17 ++++--
drivers/net/netdevsim/netdev.c | 22 ++-----
drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 17 ++++--
include/linux/netdevice.h | 9 +++
include/net/switchdev.h | 11 ----
net/bridge/br_switchdev.c | 11 ++--
net/core/dev.c | 57 +++++++++++++++++++
net/core/net-sysfs.c | 12 +---
net/core/rtnetlink.c | 14 ++---
net/dsa/slave.c | 18 ++++--
net/ipv4/ipmr.c | 14 ++---
net/switchdev/switchdev.c | 20 -------
30 files changed, 219 insertions(+), 260 deletions(-)
--
2.17.1
^ permalink raw reply
* [PATCH net-next v2 01/12] net: Introduce ndo_get_port_parent_id()
From: Florian Fainelli @ 2019-02-05 22:13 UTC (permalink / raw)
To: netdev
Cc: Florian Fainelli, David S. Miller, Ido Schimmel, open list,
open list:MELLANOX MLX5 core VPI driver,
open list:NETRONOME ETHERNET DRIVERS, open list:STAGING SUBSYSTEM,
moderated list:ETHERNET BRIDGE
In-Reply-To: <20190205221311.18476-1-f.fainelli@gmail.com>
In preparation for getting rid of switchdev_ops, create a dedicated NDO
operation for getting the port's parent identifier. There are
essentially two classes of drivers that need to implement getting the
port's parent ID which are VF/PF drivers with a built-in switch, and
pure switchdev drivers such as mlxsw, ocelot, dsa etc.
We introduce a helper function: dev_get_port_parent_id() which supports
recursing into the lower devices to obtain the first port's parent ID.
Convert the bridge, core and ipv4 multicast routing code to check for
such ndo_get_port_parent_id() and call the helper functino when valid
before falling back to switchdev_port_attr_get(). This will allow us to
convert all relevant drivers in one go instead of having to implement
both switchdev_port_attr_get() and ndo_get_port_parent_id() operations,
then get rid of switchdev_port_attr_get().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
include/linux/netdevice.h | 9 +++++++
net/bridge/br_switchdev.c | 9 +++++--
net/core/dev.c | 57 +++++++++++++++++++++++++++++++++++++++
net/core/net-sysfs.c | 7 ++++-
net/core/rtnetlink.c | 6 ++++-
net/ipv4/ipmr.c | 8 +++++-
6 files changed, 91 insertions(+), 5 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ba57d0ba425e..1d95e634f3fe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1188,6 +1188,10 @@ struct dev_ifalias {
* not implement this, it is assumed that the hw is not able to have
* multiple net devices on single physical port.
*
+ * int (*ndo_get_port_parent_id)(struct net_device *dev,
+ * struct netdev_phys_item_id *ppid)
+ * Called to get the parent ID of the physical port of this device.
+ *
* void (*ndo_udp_tunnel_add)(struct net_device *dev,
* struct udp_tunnel_info *ti);
* Called by UDP tunnel to notify a driver about the UDP port and socket
@@ -1412,6 +1416,8 @@ struct net_device_ops {
bool new_carrier);
int (*ndo_get_phys_port_id)(struct net_device *dev,
struct netdev_phys_item_id *ppid);
+ int (*ndo_get_port_parent_id)(struct net_device *dev,
+ struct netdev_phys_item_id *ppid);
int (*ndo_get_phys_port_name)(struct net_device *dev,
char *name, size_t len);
void (*ndo_udp_tunnel_add)(struct net_device *dev,
@@ -3651,6 +3657,9 @@ int dev_get_phys_port_id(struct net_device *dev,
struct netdev_phys_item_id *ppid);
int dev_get_phys_port_name(struct net_device *dev,
char *name, size_t len);
+int dev_get_port_parent_id(struct net_device *dev,
+ struct netdev_phys_item_id *ppid, bool recurse);
+bool netdev_port_same_parent_id(struct net_device *a, struct net_device *b);
int dev_change_proto_down(struct net_device *dev, bool proto_down);
struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev, bool *again);
struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 4d2b9eb7604a..06b0ae44585f 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -14,7 +14,8 @@ static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev)
/* dev is yet to be added to the port list. */
list_for_each_entry(p, &br->port_list, list) {
- if (switchdev_port_same_parent_id(dev, p->dev))
+ if (netdev_port_same_parent_id(dev, p->dev) ||
+ switchdev_port_same_parent_id(dev, p->dev))
return p->offload_fwd_mark;
}
@@ -23,6 +24,7 @@ static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev)
int nbp_switchdev_mark_set(struct net_bridge_port *p)
{
+ const struct net_device_ops *ops = p->dev->netdev_ops;
struct switchdev_attr attr = {
.orig_dev = p->dev,
.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
@@ -31,7 +33,10 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
ASSERT_RTNL();
- err = switchdev_port_attr_get(p->dev, &attr);
+ if (ops->ndo_get_port_parent_id)
+ err = dev_get_port_parent_id(p->dev, &attr.u.ppid, true);
+ else
+ err = switchdev_port_attr_get(p->dev, &attr);
if (err) {
if (err == -EOPNOTSUPP)
return 0;
diff --git a/net/core/dev.c b/net/core/dev.c
index bfa4be42afff..8c6d5cf8a308 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7877,6 +7877,63 @@ int dev_get_phys_port_name(struct net_device *dev,
}
EXPORT_SYMBOL(dev_get_phys_port_name);
+/**
+ * dev_get_port_parent_id - Get the device's port parent identifier
+ * @dev: network device
+ * @ppid: pointer to a storage for the port's parent identifier
+ * @recurse: allow/disallow recursion to lower devices
+ *
+ * Get the devices's port parent identifier
+ */
+int dev_get_port_parent_id(struct net_device *dev,
+ struct netdev_phys_item_id *ppid,
+ bool recurse)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+ struct netdev_phys_item_id first = { };
+ struct net_device *lower_dev;
+ struct list_head *iter;
+ int err = -EOPNOTSUPP;
+
+ if (ops->ndo_get_port_parent_id)
+ return ops->ndo_get_port_parent_id(dev, ppid);
+
+ if (!recurse)
+ return err;
+
+ netdev_for_each_lower_dev(dev, lower_dev, iter) {
+ err = dev_get_port_parent_id(lower_dev, ppid, recurse);
+ if (err)
+ break;
+ if (!first.id_len)
+ first = *ppid;
+ else if (memcmp(&first, ppid, sizeof(*ppid)))
+ return -ENODATA;
+ }
+
+ return err;
+}
+EXPORT_SYMBOL(dev_get_port_parent_id);
+
+/**
+ * netdev_port_same_parent_id - Indicate if two network devices have
+ * the same port parent identifier
+ * @a: first network device
+ * @b: second network device
+ */
+bool netdev_port_same_parent_id(struct net_device *a, struct net_device *b)
+{
+ struct netdev_phys_item_id a_id = { };
+ struct netdev_phys_item_id b_id = { };
+
+ if (dev_get_port_parent_id(a, &a_id, true) ||
+ dev_get_port_parent_id(b, &b_id, true))
+ return false;
+
+ return netdev_phys_item_id_same(&a_id, &b_id);
+}
+EXPORT_SYMBOL(netdev_port_same_parent_id);
+
/**
* dev_change_proto_down - update protocol port state information
* @dev: device
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index ff9fd2bb4ce4..4eace9f1dcf9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -495,6 +495,7 @@ static ssize_t phys_switch_id_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct net_device *netdev = to_net_dev(dev);
+ const struct net_device_ops *ops = netdev->netdev_ops;
ssize_t ret = -EINVAL;
if (!rtnl_trylock())
@@ -507,7 +508,11 @@ static ssize_t phys_switch_id_show(struct device *dev,
.flags = SWITCHDEV_F_NO_RECURSE,
};
- ret = switchdev_port_attr_get(netdev, &attr);
+ if (ops->ndo_get_port_parent_id)
+ ret = dev_get_port_parent_id(netdev, &attr.u.ppid,
+ false);
+ else
+ ret = switchdev_port_attr_get(netdev, &attr);
if (!ret)
ret = sprintf(buf, "%*phN\n", attr.u.ppid.id_len,
attr.u.ppid.id);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f5a98082ac7a..90dd02c1f561 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1146,6 +1146,7 @@ static int rtnl_phys_port_name_fill(struct sk_buff *skb, struct net_device *dev)
static int rtnl_phys_switch_id_fill(struct sk_buff *skb, struct net_device *dev)
{
+ const struct net_device_ops *ops = dev->netdev_ops;
int err;
struct switchdev_attr attr = {
.orig_dev = dev,
@@ -1153,7 +1154,10 @@ static int rtnl_phys_switch_id_fill(struct sk_buff *skb, struct net_device *dev)
.flags = SWITCHDEV_F_NO_RECURSE,
};
- err = switchdev_port_attr_get(dev, &attr);
+ if (ops->ndo_get_port_parent_id)
+ err = dev_get_port_parent_id(dev, &attr.u.ppid, false);
+ else
+ err = switchdev_port_attr_get(dev, &attr);
if (err) {
if (err == -EOPNOTSUPP)
return 0;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index fb99002c3d4e..c71bcc42d66d 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -837,6 +837,7 @@ static void ipmr_update_thresholds(struct mr_table *mrt, struct mr_mfc *cache,
static int vif_add(struct net *net, struct mr_table *mrt,
struct vifctl *vifc, int mrtsock)
{
+ const struct net_device_ops *ops;
int vifi = vifc->vifc_vifi;
struct switchdev_attr attr = {
.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
@@ -920,7 +921,12 @@ static int vif_add(struct net *net, struct mr_table *mrt,
(VIFF_TUNNEL | VIFF_REGISTER));
attr.orig_dev = dev;
- if (!switchdev_port_attr_get(dev, &attr)) {
+ ops = dev->netdev_ops;
+ if (ops->ndo_get_port_parent_id &&
+ !dev_get_port_parent_id(dev, &attr.u.ppid, true)) {
+ memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
+ v->dev_parent_id.id_len = attr.u.ppid.id_len;
+ } else if (!switchdev_port_attr_get(dev, &attr)) {
memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
v->dev_parent_id.id_len = attr.u.ppid.id_len;
} else {
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox