* Re: Requirements for a shutdown function?
From: Florian Fainelli @ 2017-05-09 18:46 UTC (permalink / raw)
To: Timur Tabi, netdev
In-Reply-To: <49bee65f-2ea8-1787-9642-659a967df8f0@codeaurora.org>
On 05/09/2017 09:58 AM, Timur Tabi wrote:
> I'm trying to add a platform_driver.shutdown function to my Ethernet driver
> (drivers/net/ethernet/qualcomm/emac/*), but I can't find any definitive
> information as to what a network driver shutdown callback is supposed to do.
> I also don't know what testcase I should use to verify that my function is
> working.
A good test case for exercising a .shutdown() function is kexec'ing a
new kernel for instance.
>
> I see only four instances of a platform_driver.shutdown function in
> drivers/net/ethernet:
>
> $ git grep -A 20 -w platform_driver | grep '\.shutdown'
> apm/xgene-v2/main.c- .shutdown = xge_shutdown,
> apm/xgene/xgene_enet_main.c- .shutdown = xgene_enet_shutdown,
> marvell/mv643xx_eth.c- .shutdown = mv643xx_eth_shutdown,
> marvell/pxa168_eth.c- .shutdown = pxa168_eth_shutdown,
>
> (Other shutdown functions are for pci_driver.shutdown).
>
> For the xgene drivers, the shutdown function just calls the 'remove'
> function. Isn't that overkill? Why bother with a shutdown function if it's
> just the same thing as removing the driver outright?
Yes, that appears unnecessary.
>
> mv643xx_eth_shutdown() seems more reasonable. All it does is halt the TX
> and RX queues.
>
> pxa168_eth_shutdown() is a little more heavyweight: halts the queues, and
> stops the DMA and calls phy_stop().
>
> Can anyone help me figure out what my driver really should do?
You should put your HW in a state where it won't be doing DMA, or have
any adverse side effects to the system, putting it in a low power state
is also a good approach.
--
Florian
^ permalink raw reply
* RE: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
From: Chiappero, Marco @ 2017-05-09 18:41 UTC (permalink / raw)
To: Dan Williams, Jiri Benc
Cc: netdev@vger.kernel.org, David S . Miller, Kirsher, Jeffrey T,
Duyck, Alexander H, Grandhi, Sainath, Mahesh Bandewar
In-Reply-To: <1494267188.8248.21.camel@redhat.com>
> -----Original Message-----
> From: Dan Williams [mailto:dcbw@redhat.com]
> Sent: Monday, May 8, 2017 7:13 PM
> To: Chiappero, Marco <marco.chiappero@intel.com>; Jiri Benc
> <jbenc@redhat.com>
> Cc: netdev@vger.kernel.org; David S . Miller <davem@davemloft.net>; Kirsher,
> Jeffrey T <jeffrey.t.kirsher@intel.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; Grandhi, Sainath
> <sainath.grandhi@intel.com>; Mahesh Bandewar <maheshb@google.com>
> Subject: Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
>
> On Mon, 2017-05-08 at 15:29 +0000, Chiappero, Marco wrote:
> > > -----Original Message-----
> > > From: Jiri Benc [mailto:jbenc@redhat.com]
> > > Sent: Thursday, May 4, 2017 5:44 PM
> > > To: Chiappero, Marco <marco.chiappero@intel.com>
> > > Cc: Dan Williams <dcbw@redhat.com>; netdev@vger.kernel.org; David S
> > >
> > > And doing some kind of weird MAC NAT in ipvlan just to satisfy
> > > broken tools that can't cope with multiple interfaces with the same
> > > MAC address is wrong, too.
> >
> > Ipvlan has always had the MAC issue, regardless, these tools simply
> > make it more apparent. And as I said already, whether they are broken
>
> If we're talking about the slaves being unable to change when the parent MAC
> changes, everyone agrees that was a bug.
>
> If we are talking about the "all slaves have the same MAC" then that's not an
> "issue", that's the way it's designed. Whether that design is the best design
> possible is debatable, but that's the way it currently is.
I'm referring to the latter, the former is obvious. That "design" looks like a temporary workaround to me, in fact a couple of TODOs in the code lead to believe it is. They should be removed, at the very least.
> > is debatable (yet I have to read a reasonable motivation). At the
>
> Existing tools are already broken for bond slaves and a couple existing drivers,
> see below.
>
> Note that making the MACs unique would break DHCP functionality, because
> now the MAC address encoded in the 'chaddr' field of the DHCP packet would
> no longer correspond to the MAC address of the device that DHCP replies should
> be received on. The userspace process writes the 'chaddr' field, and the
> userspace process would see the unique (and
> incorrect) MAC address.
Fair point. However, despite not fixing the issues with DHCP, it would be no way worse that it is now (even though I admit I don't like much the workaround). BTW, I don't know about the ISC upstream version, but on Debian/Ubuntu the -B flag is not available, which makes ipvlan+DHCP non-viable on a very large number of deployments.
> > very least their expectation to have unique addresses on the same
> > broadcast domain is hardly arguable. Should ipvlan considered special?
> > Again, questionable.
>
> bond slaves
> drivers/net/ethernet/toshiba/ps3_gelic_net.c
> drivers/s390/net/qeth_l3_main.c
>
> already all have the same MAC address for different netdevices. Yeah, not many
> people have PS3s anymore, but s390 qeth is fairly widely used.
> Just pointing out that ipvlan is hardly the first device to have encountered this
> issue, or to have solved it this way. qeth does pretty much the same thing as
> ipvlan.
>
> I'm not arguing that ipvlan is perfect. I'm just arguing that in its current form
> (eg, virtualized L2 device) making this change is incorrect.
Don't get me wrong, I understand and appreciate your lengthy reply, even though the fact that a poor solution is already included elsewhere doesn't make it any better.
However, regardless of the uniqueness topic, I don't feel is completely right to change the whole world upstairs making it ipvlan aware either, unless there is a real and coherent differentiation (e.g. L3 only interfaces). I'll drop considering ipvlan as an option for now.
Regards,
Marco
--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.
^ permalink raw reply
* Re: [PATCH net v2] driver: vrf: Fix one possible use-after-free issue
From: David Miller @ 2017-05-09 18:37 UTC (permalink / raw)
To: gfree.wind; +Cc: dsa, shm, fw, netdev
In-Reply-To: <1494325653-39885-1-git-send-email-gfree.wind@vip.163.com>
From: gfree.wind@vip.163.com
Date: Tue, 9 May 2017 18:27:33 +0800
> @@ -989,6 +989,7 @@ static u32 vrf_fib_table(const struct net_device *dev)
>
> static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> {
> + kfree_skb(skb);
> return 0;
> }
>
> @@ -998,7 +999,7 @@ static struct sk_buff *vrf_rcv_nfhook(u8 pf, unsigned int hook,
> {
> struct net *net = dev_net(dev);
>
> - if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0)
> + if (nf_hook(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) != 1)
> skb = NULL; /* kfree_skb(skb) handled by nf code */
>
> return skb;
Indeed, this fixes the immediate problem with NF_STOLEN.
Making NF_STOLEN fully functional is another story, we'd need to stack
this all together properly:
static int __ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
...
}
static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
return l3mdev_ip_rcv(skb, __ip_rcv_finish);
}
...
static inline
struct sk_buff *l3mdev_ip_rcv(struct sk_buff *skb,
int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
return l3mdev_l3_rcv(skb, okfn, AF_INET);
}
etc. but that's going to really add a kink to the receive path,
microbenchmark wise.
^ permalink raw reply
* Re: bpf pointer alignment validation
From: David Miller @ 2017-05-09 18:32 UTC (permalink / raw)
To: daniel; +Cc: ast, netdev
In-Reply-To: <59104D35.8080108@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 08 May 2017 12:49:25 +0200
> Could you also add test cases specifically to this for test_verifier
> in bpf selftests? I'm thinking of the cases when we have no pkt id
> and offset originated from reg->off (accumulated through const imm
> ops on reg) and insn->off, where we had i) no pkt id and ii) a
> specific pkt id (so we can probe for aux_off_align rejection as well).
> I believe we do have coverage to some extend in some of the tests
> (more on the map_value_adj though), but it would be good to keep
> tracking this specifically as well.
So I created a new test, that uses the verifier, but in a new way.
The thing I wanted to do is match on verifier dump strings so that I
can test what the verifier thinks about the known alignment et al. of
various registers after the execution of certain instructions.
I accomplish this by doing two things:
1) If the log level of 2 or greater is given to the verifier, I dump
the internal state to the log after every instruction.
2) A new BPF library helper called bpf_verify_program() is added which
always passes the log to the BPF system call and uses a log level
of 2.
Then in my "test_align" I have BPF programs as well as strings to
match against in the verifier dump.
The whole thing is below, and writing the test cases certainly helped
me refine the implementation quite a bit.
I'll keep adding some more tests and hacking on this. It just
occurred to me that it would be extremely variable to be able to turn
on strict alignment requirements even on architectures that do not
need it.
That way anyone adding regressions to the alignment code, or hitting
new cases the code can't currently handle, would notice immediately
regardles of the type of system the run the test case on.
To be quite honest, strict alignment might not be a bad default either
to give helpful feedback to people writing eBPF programs.
---
include/linux/bpf_verifier.h | 3 +
kernel/bpf/verifier.c | 104 ++++++++-
tools/lib/bpf/bpf.c | 20 ++
tools/lib/bpf/bpf.h | 4 +
tools/testing/selftests/bpf/Makefile | 4 +-
tools/testing/selftests/bpf/test_align.c | 378 +++++++++++++++++++++++++++++++
6 files changed, 500 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/bpf/test_align.c
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 5efb4db..7c6a519 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -40,6 +40,9 @@ struct bpf_reg_state {
*/
s64 min_value;
u64 max_value;
+ u32 min_align;
+ u32 aux_off;
+ u32 aux_off_align;
};
enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c2ff608..2b1b06e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -241,6 +241,12 @@ static void print_verifier_state(struct bpf_verifier_state *state)
if (reg->max_value != BPF_REGISTER_MAX_RANGE)
verbose(",max_value=%llu",
(unsigned long long)reg->max_value);
+ if (reg->min_align)
+ verbose(",min_align=%u", reg->min_align);
+ if (reg->aux_off)
+ verbose(",aux_off=%u", reg->aux_off);
+ if (reg->aux_off_align)
+ verbose(",aux_off_align=%u", reg->aux_off_align);
}
for (i = 0; i < MAX_BPF_STACK; i += BPF_REG_SIZE) {
if (state->stack_slot_type[i] == STACK_SPILL)
@@ -455,6 +461,9 @@ static void init_reg_state(struct bpf_reg_state *regs)
regs[i].imm = 0;
regs[i].min_value = BPF_REGISTER_MIN_RANGE;
regs[i].max_value = BPF_REGISTER_MAX_RANGE;
+ regs[i].min_align = 0;
+ regs[i].aux_off = 0;
+ regs[i].aux_off_align = 0;
}
/* frame pointer */
@@ -483,11 +492,17 @@ static void reset_reg_range_values(struct bpf_reg_state *regs, u32 regno)
regs[regno].max_value = BPF_REGISTER_MAX_RANGE;
}
+static void reset_reg_align(struct bpf_reg_state *regs, u32 regno)
+{
+ regs[regno].min_align = 0;
+}
+
static void mark_reg_unknown_value_and_range(struct bpf_reg_state *regs,
u32 regno)
{
mark_reg_unknown_value(regs, regno);
reset_reg_range_values(regs, regno);
+ reset_reg_align(regs, regno);
}
enum reg_arg_type {
@@ -770,13 +785,24 @@ static bool is_pointer_value(struct bpf_verifier_env *env, int regno)
static int check_pkt_ptr_alignment(const struct bpf_reg_state *reg,
int off, int size)
{
- if (reg->id && size != 1) {
- verbose("Unknown alignment. Only byte-sized access allowed in packet access.\n");
- return -EACCES;
+ int reg_off;
+
+ /* Byte size accesses are always allowed. */
+ if (size == 1)
+ return 0;
+
+ reg_off = reg->off;
+ if (reg->id) {
+ if (reg->aux_off_align % size) {
+ verbose("Packet access is only %u byte aligned, %d byte access not allowed\n",
+ reg->aux_off_align, size);
+ return -EACCES;
+ }
+ reg_off += reg->aux_off;
}
/* skb->data is NET_IP_ALIGN-ed */
- if ((NET_IP_ALIGN + reg->off + off) % size != 0) {
+ if ((NET_IP_ALIGN + reg_off + off) % size != 0) {
verbose("misaligned packet access off %d+%d+%d size %d\n",
NET_IP_ALIGN, reg->off, off, size);
return -EACCES;
@@ -872,6 +898,8 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
value_regno);
/* note that reg.[id|off|range] == 0 */
state->regs[value_regno].type = reg_type;
+ state->regs[value_regno].aux_off = 0;
+ state->regs[value_regno].aux_off_align = 0;
}
} else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
@@ -1444,6 +1472,8 @@ static int check_packet_ptr_add(struct bpf_verifier_env *env,
*/
dst_reg->off += imm;
} else {
+ bool had_id;
+
if (src_reg->type == PTR_TO_PACKET) {
/* R6=pkt(id=0,off=0,r=62) R7=imm22; r7 += r6 */
tmp_reg = *dst_reg; /* save r7 state */
@@ -1477,14 +1507,23 @@ static int check_packet_ptr_add(struct bpf_verifier_env *env,
src_reg->imm);
return -EACCES;
}
+
+ had_id = (dst_reg->id != 0);
+
/* dst_reg stays as pkt_ptr type and since some positive
* integer value was added to the pointer, increment its 'id'
*/
dst_reg->id = ++env->id_gen;
- /* something was added to pkt_ptr, set range and off to zero */
+ /* something was added to pkt_ptr, set range to zero */
+ dst_reg->aux_off = dst_reg->off;
dst_reg->off = 0;
dst_reg->range = 0;
+ if (had_id)
+ dst_reg->aux_off_align = min(dst_reg->aux_off_align,
+ src_reg->min_align);
+ else
+ dst_reg->aux_off_align = src_reg->min_align;
}
return 0;
}
@@ -1658,6 +1697,20 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
reg->min_value = BPF_REGISTER_MIN_RANGE;
}
+static u32 calc_align(u32 imm)
+{
+ u32 align = 1;
+
+ if (!imm)
+ return 1U << 31;
+
+ while (!(imm & 1)) {
+ imm >>= 1;
+ align <<= 1;
+ }
+ return align;
+}
+
static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
struct bpf_insn *insn)
{
@@ -1665,8 +1718,10 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
s64 min_val = BPF_REGISTER_MIN_RANGE;
u64 max_val = BPF_REGISTER_MAX_RANGE;
u8 opcode = BPF_OP(insn->code);
+ u32 dst_align, src_align;
dst_reg = ®s[insn->dst_reg];
+ src_align = 0;
if (BPF_SRC(insn->code) == BPF_X) {
check_reg_overflow(®s[insn->src_reg]);
min_val = regs[insn->src_reg].min_value;
@@ -1682,18 +1737,25 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
regs[insn->src_reg].type != UNKNOWN_VALUE) {
min_val = BPF_REGISTER_MIN_RANGE;
max_val = BPF_REGISTER_MAX_RANGE;
+ src_align = 0;
+ } else {
+ src_align = regs[insn->src_reg].min_align;
}
} else if (insn->imm < BPF_REGISTER_MAX_RANGE &&
(s64)insn->imm > BPF_REGISTER_MIN_RANGE) {
min_val = max_val = insn->imm;
+ src_align = calc_align(insn->imm);
}
+ dst_align = dst_reg->min_align;
+
/* We don't know anything about what was done to this register, mark it
* as unknown.
*/
if (min_val == BPF_REGISTER_MIN_RANGE &&
max_val == BPF_REGISTER_MAX_RANGE) {
reset_reg_range_values(regs, insn->dst_reg);
+ reset_reg_align(regs, insn->dst_reg);
return;
}
@@ -1712,18 +1774,21 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
dst_reg->min_value += min_val;
if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
dst_reg->max_value += max_val;
+ dst_reg->min_align = min(src_align, dst_align);
break;
case BPF_SUB:
if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
dst_reg->min_value -= min_val;
if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
dst_reg->max_value -= max_val;
+ dst_reg->min_align = min(src_align, dst_align);
break;
case BPF_MUL:
if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
dst_reg->min_value *= min_val;
if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
dst_reg->max_value *= max_val;
+ dst_reg->min_align = max(src_align, dst_align);
break;
case BPF_AND:
/* Disallow AND'ing of negative numbers, ain't nobody got time
@@ -1735,17 +1800,23 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
else
dst_reg->min_value = 0;
dst_reg->max_value = max_val;
+ dst_reg->min_align = max(src_align, dst_align);
break;
case BPF_LSH:
/* Gotta have special overflow logic here, if we're shifting
* more than MAX_RANGE then just assume we have an invalid
* range.
*/
- if (min_val > ilog2(BPF_REGISTER_MAX_RANGE))
+ if (min_val > ilog2(BPF_REGISTER_MAX_RANGE)) {
dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
- else if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
- dst_reg->min_value <<= min_val;
-
+ dst_reg->min_align = 1;
+ } else {
+ if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+ dst_reg->min_value <<= min_val;
+ if (!dst_reg->min_align)
+ dst_reg->min_align = 1;
+ dst_reg->min_align <<= min_val;
+ }
if (max_val > ilog2(BPF_REGISTER_MAX_RANGE))
dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
else if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
@@ -1755,11 +1826,19 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
/* RSH by a negative number is undefined, and the BPF_RSH is an
* unsigned shift, so make the appropriate casts.
*/
- if (min_val < 0 || dst_reg->min_value < 0)
+ if (min_val < 0 || dst_reg->min_value < 0) {
dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
- else
+ } else {
dst_reg->min_value =
(u64)(dst_reg->min_value) >> min_val;
+ }
+ if (min_val < 0) {
+ dst_reg->min_align = 1;
+ } else {
+ dst_reg->min_align >>= (u64) min_val;
+ if (!dst_reg->min_align)
+ dst_reg->min_align = 1;
+ }
if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
dst_reg->max_value >>= max_val;
break;
@@ -1861,6 +1940,7 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
regs[insn->dst_reg].imm = insn->imm;
regs[insn->dst_reg].max_value = insn->imm;
regs[insn->dst_reg].min_value = insn->imm;
+ regs[insn->dst_reg].min_align = calc_align(insn->imm);
}
} else if (opcode > BPF_END) {
@@ -2845,7 +2925,7 @@ static int do_check(struct bpf_verifier_env *env)
goto process_bpf_exit;
}
- if (log_level && do_print_state) {
+ if (log_level > 1 || (log_level && do_print_state)) {
verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);
print_verifier_state(&env->cur_state);
do_print_state = false;
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 4fe444b80..d9d3164 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -117,6 +117,26 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}
+int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t insns_cnt, const char *license,
+ __u32 kern_version, char *log_buf, size_t log_buf_sz)
+{
+ union bpf_attr attr;
+
+ bzero(&attr, sizeof(attr));
+ attr.prog_type = type;
+ attr.insn_cnt = (__u32)insns_cnt;
+ attr.insns = ptr_to_u64(insns);
+ attr.license = ptr_to_u64(license);
+ attr.log_buf = ptr_to_u64(log_buf);
+ attr.log_size = log_buf_sz;
+ attr.log_level = 2;
+ log_buf[0] = 0;
+ attr.kern_version = kern_version;
+
+ return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
int bpf_map_update_elem(int fd, const void *key, const void *value,
__u64 flags)
{
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index edb4dae..1a32b02 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -35,6 +35,10 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
size_t insns_cnt, const char *license,
__u32 kern_version, char *log_buf,
size_t log_buf_sz);
+int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
+ size_t insns_cnt, const char *license,
+ __u32 kern_version, char *log_buf,
+ size_t log_buf_sz);
int bpf_map_update_elem(int fd, const void *key, const void *value,
__u64 flags);
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 91edd05..1e1cde1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -11,7 +11,8 @@ endif
CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include
LDLIBS += -lcap -lelf
-TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs
+TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
+ test_align
TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o
diff --git a/tools/testing/selftests/bpf/test_align.c b/tools/testing/selftests/bpf/test_align.c
new file mode 100644
index 0000000..22f34fa
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -0,0 +1,378 @@
+#include <asm/types.h>
+#include <linux/types.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <stddef.h>
+#include <stdbool.h>
+
+#include <linux/unistd.h>
+#include <linux/filter.h>
+#include <linux/bpf_perf_event.h>
+#include <linux/bpf.h>
+
+#include <bpf/bpf.h>
+
+#include "../../../include/linux/filter.h"
+
+#ifndef ARRAY_SIZE
+# define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+#define MAX_INSNS 512
+#define MAX_MATCHES 16
+
+struct bpf_align_test {
+ const char *descr;
+ struct bpf_insn insns[MAX_INSNS];
+ enum {
+ UNDEF,
+ ACCEPT,
+ REJECT
+ } result;
+ enum bpf_prog_type prog_type;
+ const char *matches[MAX_MATCHES];
+};
+
+static struct bpf_align_test tests[] = {
+ {
+ .descr = "mov",
+ .insns = {
+ BPF_MOV64_IMM(BPF_REG_3, 2),
+ BPF_MOV64_IMM(BPF_REG_3, 4),
+ BPF_MOV64_IMM(BPF_REG_3, 8),
+ BPF_MOV64_IMM(BPF_REG_3, 16),
+ BPF_MOV64_IMM(BPF_REG_3, 32),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ "from 0 to 1: R1=ctx R3=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+ "from 0 to 2: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+ "from 0 to 3: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+ "from 0 to 4: R1=ctx R3=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
+ "from 0 to 5: R1=ctx R3=imm32,min_value=32,max_value=32,min_align=32 R10=fp",
+ },
+ },
+ {
+ .descr = "shift",
+ .insns = {
+ BPF_MOV64_IMM(BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_3, 4),
+ BPF_MOV64_IMM(BPF_REG_4, 32),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ "from 0 to 1: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R10=fp",
+ "from 0 to 2: R1=ctx R3=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+ "from 0 to 3: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+ "from 0 to 4: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+ "from 0 to 5: R1=ctx R3=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
+ "from 0 to 6: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R10=fp",
+ "from 0 to 7: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm32,min_value=32,max_value=32,min_align=32 R10=fp",
+ "from 0 to 8: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
+ "from 0 to 9: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+ "from 0 to 10: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+ "from 0 to 11: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+ },
+ },
+ {
+ .descr = "addsub",
+ .insns = {
+ BPF_MOV64_IMM(BPF_REG_3, 4),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, 4),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, 2),
+ BPF_MOV64_IMM(BPF_REG_4, 8),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 2),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ "from 0 to 1: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+ "from 0 to 2: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=4 R10=fp",
+ "from 0 to 3: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R10=fp",
+ "from 0 to 4: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R4=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+ "from 0 to 5: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R4=imm12,min_value=12,max_value=12,min_align=4 R10=fp",
+ "from 0 to 6: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R4=imm14,min_value=14,max_value=14,min_align=2 R10=fp",
+ },
+ },
+ {
+ .descr = "mul",
+ .insns = {
+ BPF_MOV64_IMM(BPF_REG_3, 7),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_3, 2),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_3, 4),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ "from 0 to 1: R1=ctx R3=imm7,min_value=7,max_value=7,min_align=1 R10=fp",
+ "from 0 to 2: R1=ctx R3=imm7,min_value=7,max_value=7,min_align=1 R10=fp",
+ "from 0 to 3: R1=ctx R3=imm14,min_value=14,max_value=14,min_align=2 R10=fp",
+ "from 0 to 4: R1=ctx R3=imm56,min_value=56,max_value=56,min_align=4 R10=fp",
+ },
+ },
+
+#define LOAD_UNKNOWN(DST_REG) \
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, \
+ offsetof(struct __sk_buff, data)), \
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1, \
+ offsetof(struct __sk_buff, data_end)), \
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), \
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8), \
+ BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_0, 1), \
+ BPF_EXIT_INSN(), \
+ BPF_LDX_MEM(BPF_B, DST_REG, BPF_REG_2, 0)
+
+ {
+ .descr = "unknown shift",
+ .insns = {
+ LOAD_UNKNOWN(BPF_REG_3),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+ LOAD_UNKNOWN(BPF_REG_4),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_4, 5),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ "from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R10=fp",
+ "from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv55,min_align=2 R10=fp",
+ "from 4 to 9: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv54,min_align=4 R10=fp",
+ "from 4 to 10: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv53,min_align=8 R10=fp",
+ "from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv52,min_align=16 R10=fp",
+ "from 15 to 18: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv56 R10=fp",
+ "from 15 to 19: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv51,min_align=32 R10=fp",
+ "from 15 to 20: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv52,min_align=16 R10=fp",
+ "from 15 to 21: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv53,min_align=8 R10=fp",
+ "from 15 to 22: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv54,min_align=4 R10=fp",
+ "from 15 to 23: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv55,min_align=2 R10=fp",
+ },
+ },
+ {
+ .descr = "unknown mul",
+ .insns = {
+ LOAD_UNKNOWN(BPF_REG_3),
+ BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 1),
+ BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 2),
+ BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 4),
+ BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 8),
+ BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 2),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ "from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R10=fp",
+ "from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+ "from 4 to 9: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv55,min_align=1 R10=fp",
+ "from 4 to 10: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+ "from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv54,min_align=2 R10=fp",
+ "from 4 to 12: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+ "from 4 to 13: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv53,min_align=4 R10=fp",
+ "from 4 to 14: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+ "from 4 to 15: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv52,min_align=8 R10=fp",
+ "from 4 to 16: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv50,min_align=8 R10=fp"
+ },
+ },
+ {
+ .descr = "packet variable offset",
+ .insns = {
+ LOAD_UNKNOWN(BPF_REG_6),
+ BPF_ALU64_IMM(BPF_LSH, BPF_REG_6, 2),
+
+ /* First, add a constant to the packet pointer,
+ * then a variable with a known alignment.
+ */
+ BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 14),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+ BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+ BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+ BPF_EXIT_INSN(),
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_5, 0),
+
+ /* Now, test in the other direction. Adding first
+ * the variable offset, then the constant.
+ */
+ BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 14),
+ BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+ BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+ BPF_EXIT_INSN(),
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_5, 0),
+
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+ .matches = {
+ /* Calculated offset in R4 has unknown value, but known
+ * alignment of 4.
+ */
+ "from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R6=inv54,min_align=4 R10=fp",
+
+ /* Offset is added to packet pointer, resulting in known
+ * auxiliary alignment and offset.
+ */
+ "from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R5=pkt(id=1,off=0,r=0),aux_off=14,aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+ /* At the time the word size load is performed from R5,
+ * it's total offset is NET_IP_ALIGN + reg->off (0) +
+ * reg->aux_off (14) which is 16. Then the variable
+ * offset is considered using reg->aux_off_align which
+ * is 4 and meets the load's requirements.
+ */
+ "from 13 to 15: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=pkt(id=1,off=4,r=4),aux_off=14,aux_off_align=4 R5=pkt(id=1,off=0,r=4),aux_off=14,aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+
+ /* Variable offset is added to R5 packet pointer,
+ * resulting in auxiliary alignment of 4.
+ */
+ "from 13 to 18: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv,aux_off=14,aux_off_align=4 R5=pkt(id=2,off=0,r=0),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+ /* Constant offset is added to R5, resulting in
+ * reg->off of 14.
+ */
+ "from 13 to 19: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv,aux_off=14,aux_off_align=4 R5=pkt(id=2,off=14,r=0),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+ /* At the time the word size load is performed from R5,
+ * it's total offset is NET_IP_ALIGN + reg->off (14) which
+ * is 16. Then the variable offset is considered using
+ * reg->aux_off_align which is 4 and meets the load's
+ * requirements.
+ */
+ "from 21 to 23: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=pkt(id=2,off=18,r=18),aux_off_align=4 R5=pkt(id=2,off=14,r=18),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+ },
+ },
+};
+
+static int probe_filter_length(const struct bpf_insn *fp)
+{
+ int len;
+
+ for (len = MAX_INSNS - 1; len > 0; --len)
+ if (fp[len].code != 0 || fp[len].imm != 0)
+ break;
+ return len + 1;
+}
+
+static char bpf_vlog[32768];
+
+static int do_test_single(struct bpf_align_test *test)
+{
+ struct bpf_insn *prog = test->insns;
+ int prog_type = test->prog_type;
+ int prog_len, i;
+ int fd_prog;
+ int ret;
+
+ prog_len = probe_filter_length(prog);
+ fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
+ prog, prog_len, "GPL", 0,
+ bpf_vlog, sizeof(bpf_vlog));
+ if (fd_prog < 0) {
+ printf("Failed to load program.\n");
+ printf("%s", bpf_vlog);
+ ret = 1;
+ } else {
+ ret = 0;
+ for (i = 0; i < MAX_MATCHES; i++) {
+ const char *t, *m = test->matches[i];
+
+ if (!m)
+ break;
+ t = strstr(bpf_vlog, m);
+ if (!t) {
+ printf("Failed to find match: %s\n", m);
+ ret = 1;
+ printf("%s", bpf_vlog);
+ break;
+ }
+ }
+ /* printf("%s", bpf_vlog); */
+ close(fd_prog);
+ }
+ return ret;
+}
+
+static int do_test(unsigned int from, unsigned int to)
+{
+ int all_pass = 0;
+ int all_fail = 0;
+ unsigned int i;
+
+ for (i = from; i < to; i++) {
+ struct bpf_align_test *test = &tests[i];
+ int fail;
+
+ printf("Test %3d: %s ... ",
+ i, test->descr);
+ fail = do_test_single(test);
+ if (fail) {
+ all_fail++;
+ printf("FAIL\n");
+ } else {
+ all_pass++;
+ printf("PASS\n");
+ }
+ }
+ printf("Results: %d pass %d fail\n",
+ all_pass, all_fail);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ unsigned int from = 0, to = ARRAY_SIZE(tests);
+
+ if (argc == 3) {
+ unsigned int l = atoi(argv[argc - 2]);
+ unsigned int u = atoi(argv[argc - 1]);
+
+ if (l < to && u < to) {
+ from = l;
+ to = u + 1;
+ }
+ } else if (argc == 2) {
+ unsigned int t = atoi(argv[argc - 1]);
+
+ if (t < to) {
+ from = t;
+ to = t + 1;
+ }
+ }
+ return do_test(from, to);
+}
--
2.1.2.532.g19b5d50
^ permalink raw reply related
* Re: mlx5 endpoint driver problem
From: Joao Pinto @ 2017-05-09 17:57 UTC (permalink / raw)
To: Saeed Mahameed, Joao Pinto; +Cc: Saeed Mahameed, netdev
In-Reply-To: <CALzJLG9OSk4MQEAhHLh76G2n=L=nkahKc_Ro+Fu9HyLYxmuJ=A@mail.gmail.com>
Hi again Saeed,
Às 6:44 PM de 5/9/2017, Saeed Mahameed escreveu:
> On Tue, May 9, 2017 at 8:38 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>> Hi Saeed,
>>
>> Às 6:35 PM de 5/9/2017, Saeed Mahameed escreveu:
>>> On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>>>> Hello,
>>>>
>>>> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
>>>> when trying to enable the hca:
>>>>
>>>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>>>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>>>> INFO: task swapper:1 blocked for more than 10 seconds.
>>>> Not tainted 4.11.0-BETAMSIX1 #51
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> swapper D 0 1 0 0x00000000
>>>>
>>>> Stack Trace:
>>>> __switch_to+0x0/0x94
>>>> __schedule+0x1da/0x8b0
>>>> schedule+0x26/0x6c
>>>> schedule_timeout+0x2da/0x380
>>>> wait_for_completion+0x92/0x104
>>>> mlx5_cmd_exec+0x70e/0xd60
>>>> mlx5_load_one+0x1b4/0xad8
>>>> init_one+0x404/0x600
>>>> pci_device_probe+0x122/0x1f0
>>>> really_probe+0x1ac/0x348
>>>> __driver_attach+0xa8/0xd0
>>>> bus_for_each_dev+0x3c/0x74
>>>> bus_add_driver+0xc2/0x184
>>>> driver_register+0x50/0xec
>>>> init+0x40/0x60
>>>>
>>>> (...)
>>>>
>>>> Stack Trace:
>>>> __switch_to+0x0/0x94
>>>> __schedule+0x1da/0x8b0
>>>> schedule+0x26/0x6c
>>>> schedule_timeout+0x2da/0x380
>>>> wait_for_completion+0x92/0x104
>>>> mlx5_cmd_exec+0x70e/0xd60
>>>> mlx5_load_one+0x1b4/0xad8
>>>> init_one+0x404/0x600
>>>> pci_device_probe+0x122/0x1f0
>>>> really_probe+0x1ac/0x348
>>>> __driver_attach+0xa8/0xd0
>>>> bus_for_each_dev+0x3c/0x74
>>>> bus_add_driver+0xc2/0x184
>>>> driver_register+0x50/0xec
>>>> init+0x40/0x60
>>>> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
>>>> cause a leak of a command resource
>>>> mlx5_core 0000:01:00.0: enable hca failed
>>>> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
>>>> mlx5_core: probe of 0000:01:00.0 failed with error -110
>>>>
>>>> Could you give me a clue of what might be happennig?
>>>>
>>>
>>> Hi Joao,
>>>
>>> Looks like FW is not responding, most likely due to the DMA mask
>>> setting warnings, which architecture is this ?
>>>
>>>> Thanks,
>>>> Joao
>>
>> I am working with a 32-bit ARC processor based board, connected to a prototyped
>> Gen4 PCI RC.
>>
>
> Ok, i will consult with our PCI and FW experts and get back to you.
>
> please note that the current mlx5 driver was never tested on 32-bit
> architecture and might not work properly for you.
I have new data for you. My colleague is using a Mellanox MT27800 Family
(ConnectX-5) with Firmware version 16.19.148 and it does have hangs, but it
fails in the CPU mask:
mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.148
mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged
mlx5_core 0000:01:00.0: mlx5_irq_set_affinity_hint:628:(pid 1):
irq_set_affinity_hint failed,irq 0x0032
mlx5_core 0000:01:00.0: Failed to alloc affinity hint cpumask
mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -22
mlx5_core: probe of 0000:01:00.0 failed with error -22
Mine is a Mellanox MT28800 Family (ConnectX-5) with Firmware Version 16.19.21102.
Hopes it gives more data for analysis.
Thanks,
Joao
>
>> Thanks,
>> Joao
>>
>>
>>
^ permalink raw reply
* Re: [PATCH 0/2] net: Set maximum receive packet size on veth interfaces
From: David Miller @ 2017-05-09 17:48 UTC (permalink / raw)
To: fredrik.markstrom; +Cc: eric.dumazet, daniel, netdev, bridge, ast, linux-kernel
In-Reply-To: <20170509124439.45674-1-fredrik.markstrom@gmail.com>
From: Fredrik Markstrom <fredrik.markstrom@gmail.com>
Date: Tue, 9 May 2017 14:44:36 +0200
> Currently veth drops all packets larger then the mtu set on the
> receiving end of the pair. This is inconsistent with most hardware
> ethernet drivers.
False.
In fact, many pieces of ethernet hardware a not physically capable of
sending even VLAN packets that are above the normal base ethernet MTU.
It is therefore untenable to remove this restriction univerally like
this.
^ permalink raw reply
* Re: mlx5 endpoint driver problem
From: Saeed Mahameed @ 2017-05-09 17:44 UTC (permalink / raw)
To: Joao Pinto; +Cc: Saeed Mahameed, netdev
In-Reply-To: <1ea4f8fa-16dc-01d8-2995-514053635fd3@synopsys.com>
On Tue, May 9, 2017 at 8:38 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
> Hi Saeed,
>
> Às 6:35 PM de 5/9/2017, Saeed Mahameed escreveu:
>> On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>>> Hello,
>>>
>>> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
>>> when trying to enable the hca:
>>>
>>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>>> INFO: task swapper:1 blocked for more than 10 seconds.
>>> Not tainted 4.11.0-BETAMSIX1 #51
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> swapper D 0 1 0 0x00000000
>>>
>>> Stack Trace:
>>> __switch_to+0x0/0x94
>>> __schedule+0x1da/0x8b0
>>> schedule+0x26/0x6c
>>> schedule_timeout+0x2da/0x380
>>> wait_for_completion+0x92/0x104
>>> mlx5_cmd_exec+0x70e/0xd60
>>> mlx5_load_one+0x1b4/0xad8
>>> init_one+0x404/0x600
>>> pci_device_probe+0x122/0x1f0
>>> really_probe+0x1ac/0x348
>>> __driver_attach+0xa8/0xd0
>>> bus_for_each_dev+0x3c/0x74
>>> bus_add_driver+0xc2/0x184
>>> driver_register+0x50/0xec
>>> init+0x40/0x60
>>>
>>> (...)
>>>
>>> Stack Trace:
>>> __switch_to+0x0/0x94
>>> __schedule+0x1da/0x8b0
>>> schedule+0x26/0x6c
>>> schedule_timeout+0x2da/0x380
>>> wait_for_completion+0x92/0x104
>>> mlx5_cmd_exec+0x70e/0xd60
>>> mlx5_load_one+0x1b4/0xad8
>>> init_one+0x404/0x600
>>> pci_device_probe+0x122/0x1f0
>>> really_probe+0x1ac/0x348
>>> __driver_attach+0xa8/0xd0
>>> bus_for_each_dev+0x3c/0x74
>>> bus_add_driver+0xc2/0x184
>>> driver_register+0x50/0xec
>>> init+0x40/0x60
>>> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
>>> cause a leak of a command resource
>>> mlx5_core 0000:01:00.0: enable hca failed
>>> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
>>> mlx5_core: probe of 0000:01:00.0 failed with error -110
>>>
>>> Could you give me a clue of what might be happennig?
>>>
>>
>> Hi Joao,
>>
>> Looks like FW is not responding, most likely due to the DMA mask
>> setting warnings, which architecture is this ?
>>
>>> Thanks,
>>> Joao
>
> I am working with a 32-bit ARC processor based board, connected to a prototyped
> Gen4 PCI RC.
>
Ok, i will consult with our PCI and FW experts and get back to you.
please note that the current mlx5 driver was never tested on 32-bit
architecture and might not work properly for you.
> Thanks,
> Joao
>
>
>
^ permalink raw reply
* Re: mlx5 endpoint driver problem
From: Joao Pinto @ 2017-05-09 17:38 UTC (permalink / raw)
To: Saeed Mahameed, Joao Pinto; +Cc: Saeed Mahameed, netdev
In-Reply-To: <CALzJLG8uieD52haz2WYbh2zWLKLgF-kUFw-2DU-sqC1+byPAKw@mail.gmail.com>
Hi Saeed,
Às 6:35 PM de 5/9/2017, Saeed Mahameed escreveu:
> On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>> Hello,
>>
>> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
>> when trying to enable the hca:
>>
>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>> INFO: task swapper:1 blocked for more than 10 seconds.
>> Not tainted 4.11.0-BETAMSIX1 #51
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> swapper D 0 1 0 0x00000000
>>
>> Stack Trace:
>> __switch_to+0x0/0x94
>> __schedule+0x1da/0x8b0
>> schedule+0x26/0x6c
>> schedule_timeout+0x2da/0x380
>> wait_for_completion+0x92/0x104
>> mlx5_cmd_exec+0x70e/0xd60
>> mlx5_load_one+0x1b4/0xad8
>> init_one+0x404/0x600
>> pci_device_probe+0x122/0x1f0
>> really_probe+0x1ac/0x348
>> __driver_attach+0xa8/0xd0
>> bus_for_each_dev+0x3c/0x74
>> bus_add_driver+0xc2/0x184
>> driver_register+0x50/0xec
>> init+0x40/0x60
>>
>> (...)
>>
>> Stack Trace:
>> __switch_to+0x0/0x94
>> __schedule+0x1da/0x8b0
>> schedule+0x26/0x6c
>> schedule_timeout+0x2da/0x380
>> wait_for_completion+0x92/0x104
>> mlx5_cmd_exec+0x70e/0xd60
>> mlx5_load_one+0x1b4/0xad8
>> init_one+0x404/0x600
>> pci_device_probe+0x122/0x1f0
>> really_probe+0x1ac/0x348
>> __driver_attach+0xa8/0xd0
>> bus_for_each_dev+0x3c/0x74
>> bus_add_driver+0xc2/0x184
>> driver_register+0x50/0xec
>> init+0x40/0x60
>> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
>> cause a leak of a command resource
>> mlx5_core 0000:01:00.0: enable hca failed
>> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
>> mlx5_core: probe of 0000:01:00.0 failed with error -110
>>
>> Could you give me a clue of what might be happennig?
>>
>
> Hi Joao,
>
> Looks like FW is not responding, most likely due to the DMA mask
> setting warnings, which architecture is this ?
>
>> Thanks,
>> Joao
I am working with a 32-bit ARC processor based board, connected to a prototyped
Gen4 PCI RC.
Thanks,
Joao
^ permalink raw reply
* Re: mlx5 endpoint driver problem
From: Saeed Mahameed @ 2017-05-09 17:35 UTC (permalink / raw)
To: Joao Pinto; +Cc: Saeed Mahameed, netdev
In-Reply-To: <f0b8881d-9aa3-8816-7ea6-daccc0e91262@synopsys.com>
On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
> Hello,
>
> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
> when trying to enable the hca:
>
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
> INFO: task swapper:1 blocked for more than 10 seconds.
> Not tainted 4.11.0-BETAMSIX1 #51
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> swapper D 0 1 0 0x00000000
>
> Stack Trace:
> __switch_to+0x0/0x94
> __schedule+0x1da/0x8b0
> schedule+0x26/0x6c
> schedule_timeout+0x2da/0x380
> wait_for_completion+0x92/0x104
> mlx5_cmd_exec+0x70e/0xd60
> mlx5_load_one+0x1b4/0xad8
> init_one+0x404/0x600
> pci_device_probe+0x122/0x1f0
> really_probe+0x1ac/0x348
> __driver_attach+0xa8/0xd0
> bus_for_each_dev+0x3c/0x74
> bus_add_driver+0xc2/0x184
> driver_register+0x50/0xec
> init+0x40/0x60
>
> (...)
>
> Stack Trace:
> __switch_to+0x0/0x94
> __schedule+0x1da/0x8b0
> schedule+0x26/0x6c
> schedule_timeout+0x2da/0x380
> wait_for_completion+0x92/0x104
> mlx5_cmd_exec+0x70e/0xd60
> mlx5_load_one+0x1b4/0xad8
> init_one+0x404/0x600
> pci_device_probe+0x122/0x1f0
> really_probe+0x1ac/0x348
> __driver_attach+0xa8/0xd0
> bus_for_each_dev+0x3c/0x74
> bus_add_driver+0xc2/0x184
> driver_register+0x50/0xec
> init+0x40/0x60
> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
> cause a leak of a command resource
> mlx5_core 0000:01:00.0: enable hca failed
> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
> mlx5_core: probe of 0000:01:00.0 failed with error -110
>
> Could you give me a clue of what might be happennig?
>
Hi Joao,
Looks like FW is not responding, most likely due to the DMA mask
setting warnings, which architecture is this ?
> Thanks,
> Joao
^ permalink raw reply
* Re: [PATCH net v2] driver: vrf: Fix one possible use-after-free issue
From: Florian Westphal @ 2017-05-09 17:11 UTC (permalink / raw)
To: David Ahern; +Cc: gfree.wind, shm, davem, fw, netdev
In-Reply-To: <43be9c0c-2cee-5d05-0908-53b81b4ebbba@gmail.com>
David Ahern <dsahern@gmail.com> wrote:
> On 5/9/17 3:27 AM, gfree.wind@vip.163.com wrote:
> > diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> > index ceda586..db88249 100644
> > --- a/drivers/net/vrf.c
> > +++ b/drivers/net/vrf.c
> > @@ -989,6 +989,7 @@ static u32 vrf_fib_table(const struct net_device *dev)
> >
> > static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> > {
> > + kfree_skb(skb);
> > return 0;
> > }
> >
> > @@ -998,7 +999,7 @@ static struct sk_buff *vrf_rcv_nfhook(u8 pf, unsigned int hook,
> > {
> > struct net *net = dev_net(dev);
> >
> > - if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0)
> > + if (nf_hook(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) != 1)
> > skb = NULL; /* kfree_skb(skb) handled by nf code */
> >
> > return skb;
> >
>
> I'm clearly misunderstanding something ...
>
> With the current code:
> - nf_hook returns 1, NF_HOOK invokes vrf_rcv_finish as the okfn, it
> returns 0, skb passes on.
>
> - nf_hook returns 0, vrf_rcv_finish has been called by the nf_hook tree,
> vrf_rcv_finish returns 0, skb passes on
Yes, thats a bug. The skb could be have been queued to userspace, or
stolen by a hook.
It is illegal to use the skb after NF_HOOK() no matter the return value.
The okfn has to do further processing.
If nf_hook() is used instead, only a return value of 1 means the skb is
still valid.
In < 0 case it was free'd, in 0 case its in unknown state (usually queued
or free'd).
As for the patch, it avoids skb leak on userspace reinject but nfqueue
still won't work as no reinject is possible (vrf_rcv_finish is a sink that
doesn't do anyting).
Hope this clarifies things.
^ permalink raw reply
* Requirements for a shutdown function?
From: Timur Tabi @ 2017-05-09 16:58 UTC (permalink / raw)
To: netdev
I'm trying to add a platform_driver.shutdown function to my Ethernet driver
(drivers/net/ethernet/qualcomm/emac/*), but I can't find any definitive
information as to what a network driver shutdown callback is supposed to do.
I also don't know what testcase I should use to verify that my function is
working.
I see only four instances of a platform_driver.shutdown function in
drivers/net/ethernet:
$ git grep -A 20 -w platform_driver | grep '\.shutdown'
apm/xgene-v2/main.c- .shutdown = xge_shutdown,
apm/xgene/xgene_enet_main.c- .shutdown = xgene_enet_shutdown,
marvell/mv643xx_eth.c- .shutdown = mv643xx_eth_shutdown,
marvell/pxa168_eth.c- .shutdown = pxa168_eth_shutdown,
(Other shutdown functions are for pci_driver.shutdown).
For the xgene drivers, the shutdown function just calls the 'remove'
function. Isn't that overkill? Why bother with a shutdown function if it's
just the same thing as removing the driver outright?
mv643xx_eth_shutdown() seems more reasonable. All it does is halt the TX
and RX queues.
pxa168_eth_shutdown() is a little more heavyweight: halts the queues, and
stops the DMA and calls phy_stop().
Can anyone help me figure out what my driver really should do?
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply
* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Eric Dumazet @ 2017-05-09 16:56 UTC (permalink / raw)
To: Cong Wang
Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
Eric Dumazet
In-Reply-To: <CAM_iQpWQoLppsMfDOQWsdifGd56CAY24tCNm0Jm02QCe7ZThOw@mail.gmail.com>
On Tue, 2017-05-09 at 09:44 -0700, Cong Wang wrote:
>
> Eric, how did you produce it?
> I guess it's because of nh_dev which is the only netdevice pointer inside
> fib_info. Let me take a deeper look.
>
Nothing particular, I am using kexec to boot new kernels, and all my
attempts with your patch included demonstrated the issue.
eth0 is a bonding device, it might matter, I do not know.
We also have some tunnels, but unfortunately I can not provide a setup
that you could use on say a VM.
I can send you the .config if this can help
> >>
> >> I am assuming you are quite confident it is this change?
> >
> > At least, reverting the patch resolves the issue for me.
> >
> > Keeping fib (and their reference to netdev) is apparently too much,
> > we probably need to implement a refcount on the metrics themselves,
> > being stand alone objects.
>
> I don't disagree, just that it may need to change too much code which
> goes beyond a stable candidate.
Well, your choice, but dealing with a full blown fib and its
dependencies look fragile to me.
^ permalink raw reply
* [PATCH net-next] net: stmmac: use correct pointer when printing normal descriptor ring
From: Niklas Cassel @ 2017-05-09 16:52 UTC (permalink / raw)
To: Giuseppe Cavallaro, Alexandre Torgue; +Cc: Niklas Cassel, netdev, linux-kernel
From: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index cd8c60132390..a74c481401c4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3725,7 +3725,7 @@ static void sysfs_display_ring(void *head, int size, int extend_desc,
ep++;
} else {
seq_printf(seq, "%d [0x%x]: 0x%x 0x%x 0x%x 0x%x\n",
- i, (unsigned int)virt_to_phys(ep),
+ i, (unsigned int)virt_to_phys(p),
le32_to_cpu(p->des0), le32_to_cpu(p->des1),
le32_to_cpu(p->des2), le32_to_cpu(p->des3));
p++;
--
2.11.0
^ permalink raw reply related
* Re: [PATCH net v2] driver: vrf: Fix one possible use-after-free issue
From: David Ahern @ 2017-05-09 16:51 UTC (permalink / raw)
To: gfree.wind, shm, davem, fw, netdev
In-Reply-To: <1494325653-39885-1-git-send-email-gfree.wind@vip.163.com>
On 5/9/17 3:27 AM, gfree.wind@vip.163.com wrote:
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> index ceda586..db88249 100644
> --- a/drivers/net/vrf.c
> +++ b/drivers/net/vrf.c
> @@ -989,6 +989,7 @@ static u32 vrf_fib_table(const struct net_device *dev)
>
> static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> {
> + kfree_skb(skb);
> return 0;
> }
>
> @@ -998,7 +999,7 @@ static struct sk_buff *vrf_rcv_nfhook(u8 pf, unsigned int hook,
> {
> struct net *net = dev_net(dev);
>
> - if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0)
> + if (nf_hook(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) != 1)
> skb = NULL; /* kfree_skb(skb) handled by nf code */
>
> return skb;
>
I'm clearly misunderstanding something ...
With the current code:
- nf_hook returns 1, NF_HOOK invokes vrf_rcv_finish as the okfn, it
returns 0, skb passes on.
- nf_hook returns 0, vrf_rcv_finish has been called by the nf_hook tree,
vrf_rcv_finish returns 0, skb passes on
- nf_hook returns < 0, vrf_rcv_finish is not called, skb is freed by
netfilter code, vrf_rcv_nfhook returns NULL
What am I missing?
With the above, if nf_hook returns 1, vrf_rcv_finish is not called.
^ permalink raw reply
* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Cong Wang @ 2017-05-09 16:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
Eric Dumazet
In-Reply-To: <1494296302.7796.61.camel@edumazet-glaptop3.roam.corp.google.com>
On Mon, May 8, 2017 at 7:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-05-08 at 21:22 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 08 May 2017 17:01:20 -0700
>>
>> > On Mon, 2017-05-08 at 14:35 -0400, David Miller wrote:
>> >> From: Cong Wang <xiyou.wangcong@gmail.com>
>> >> Date: Thu, 4 May 2017 14:54:17 -0700
>> >>
>> >> > IPv4 dst could use fi->fib_metrics to store metrics but fib_info
>> >> > itself is refcnt'ed, so without taking a refcnt fi and
>> >> > fi->fib_metrics could be freed while dst metrics still points to
>> >> > it. This triggers use-after-free as reported by Andrey twice.
>> >> >
>> >> > This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
>> >> > restore this reference counting. It is a quick fix for -net and
>> >> > -stable, for -net-next, as Eric suggested, we can consider doing
>> >> > reference counting for metrics itself instead of relying on fib_info.
>> >> >
>> >> > IPv6 is very different, it copies or steals the metrics from mx6_config
>> >> > in fib6_commit_metrics() so probably doesn't need a refcnt.
>> >> >
>> >> > Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
>> >> >
>> >> > Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
>> >> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
>> >> > Tested-by: Andrey Konovalov <andreyknvl@google.com>
>> >> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> >>
>> >> Applied and queued up for -stable, thanks.
>> >
>> > Although I now have on latest net tree these messages when I reboot my
>> > test machine.
>> >
>> > [ 224.085873] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
>>
>> Strange, the refcounting looks quite OK in the patch you're quoting.
>> I looked over it a few times and cannot figure out a possible cause
>> there.
Eric, how did you produce it?
I guess it's because of nh_dev which is the only netdevice pointer inside
fib_info. Let me take a deeper look.
>>
>> I am assuming you are quite confident it is this change?
>
> At least, reverting the patch resolves the issue for me.
>
> Keeping fib (and their reference to netdev) is apparently too much,
> we probably need to implement a refcount on the metrics themselves,
> being stand alone objects.
I don't disagree, just that it may need to change too much code which
goes beyond a stable candidate.
Thanks for the bug report!
^ permalink raw reply
* mlx5 endpoint driver problem
From: Joao Pinto @ 2017-05-09 16:25 UTC (permalink / raw)
To: saeedm; +Cc: netdev
Hello,
I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
when trying to enable the hca:
mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
INFO: task swapper:1 blocked for more than 10 seconds.
Not tainted 4.11.0-BETAMSIX1 #51
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swapper D 0 1 0 0x00000000
Stack Trace:
__switch_to+0x0/0x94
__schedule+0x1da/0x8b0
schedule+0x26/0x6c
schedule_timeout+0x2da/0x380
wait_for_completion+0x92/0x104
mlx5_cmd_exec+0x70e/0xd60
mlx5_load_one+0x1b4/0xad8
init_one+0x404/0x600
pci_device_probe+0x122/0x1f0
really_probe+0x1ac/0x348
__driver_attach+0xa8/0xd0
bus_for_each_dev+0x3c/0x74
bus_add_driver+0xc2/0x184
driver_register+0x50/0xec
init+0x40/0x60
(...)
Stack Trace:
__switch_to+0x0/0x94
__schedule+0x1da/0x8b0
schedule+0x26/0x6c
schedule_timeout+0x2da/0x380
wait_for_completion+0x92/0x104
mlx5_cmd_exec+0x70e/0xd60
mlx5_load_one+0x1b4/0xad8
init_one+0x404/0x600
pci_device_probe+0x122/0x1f0
really_probe+0x1ac/0x348
__driver_attach+0xa8/0xd0
bus_for_each_dev+0x3c/0x74
bus_add_driver+0xc2/0x184
driver_register+0x50/0xec
init+0x40/0x60
mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
cause a leak of a command resource
mlx5_core 0000:01:00.0: enable hca failed
mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
mlx5_core: probe of 0000:01:00.0 failed with error -110
Could you give me a clue of what might be happennig?
Thanks,
Joao
^ permalink raw reply
* [PATCH] netxen_nic: set rcode to the return status from the call to netxen_issue_cmd
From: Colin King @ 2017-05-09 16:19 UTC (permalink / raw)
To: Manish Chopra, Rahul Verma, Dept-GELinuxNICDev, netdev
Cc: kernel-janitors, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
Currently rcode is being initialized to NX_RCODE_SUCCESS and later it
is checked to see if it is not NX_RCODE_SUCCESS which is never true. It
appears that there is an unintentional missing assignment of rcode from
the return of the call to netxen_issue_cmd() that was dropped in
an earlier fix, so add it in.
Detected by CoverityScan, CID#401900 ("Logically dead code")
Fixes: 2dcd5d95ad6b2 ("netxen_nic: fix cdrp race condition")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
index b8d5270359cd..e30676515529 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
@@ -247,7 +247,7 @@ nx_fw_cmd_set_mtu(struct netxen_adapter *adapter, int mtu)
cmd.req.arg3 = 0;
if (recv_ctx->state == NX_HOST_CTX_STATE_ACTIVE)
- netxen_issue_cmd(adapter, &cmd);
+ rcode = netxen_issue_cmd(adapter, &cmd);
if (rcode != NX_RCODE_SUCCESS)
return -EIO;
--
2.11.0
^ permalink raw reply related
* Re: [PATCH 0/2] net: Set maximum receive packet size on veth interfaces
From: Stephen Hemminger @ 2017-05-09 15:49 UTC (permalink / raw)
To: Fredrik Markstrom
Cc: Eric Dumazet, David S. Miller, Alexei Starovoitov,
Daniel Borkmann, netdev, linux-kernel, bridge
In-Reply-To: <20170509124439.45674-1-fredrik.markstrom@gmail.com>
On Tue, 9 May 2017 14:44:36 +0200
Fredrik Markstrom <fredrik.markstrom@gmail.com> wrote:
> Currently veth drops all packets larger then the mtu set on the receiving
> end of the pair. This is inconsistent with most hardware ethernet drivers.
There is no guarantee that packets larger than MTU + VLAN tag will be received
by hardware drivers. So why is this necessary for veth? What is your special
use case which makes this necessary?
^ permalink raw reply
* openvswitch MTU patch needed in 4.10 stable
From: Stephen Hemminger @ 2017-05-09 15:46 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Could you queue the patch to stable?
Begin forwarded message:
Date: Tue, 09 May 2017 08:21:46 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 195695] New: openvswitch: Set internal device max mtu to ETH_MAX_MTU
https://bugzilla.kernel.org/show_bug.cgi?id=195695
Bug ID: 195695
Summary: openvswitch: Set internal device max mtu to
ETH_MAX_MTU
Product: Networking
Version: 2.5
Kernel Version: 4.10
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: Other
Assignee: stephen@networkplumber.org
Reporter: endre@basis-consulting.com
Regression: No
Can not use jumbo frames with openvswitch bridge in kernel 4.10. This is fixed
in kernel 4.11:
Commit 91572088e3fd ("net: use core MTU range checking in core net
infra") changed the openvswitch internal device to use the core net
infra for controlling the MTU range, but failed to actually set the
max_mtu as described in the commit message, which now defaults to
ETH_DATA_LEN.
This patch fixes this by setting max_mtu to ETH_MAX_MTU after
ether_setup() call.
Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'net/openvswitch/vport-internal_dev.c')
-rw-r--r--
net/openvswitch/vport-internal_dev.c
2
1 files changed, 2 insertions, 0 deletions
diff --git a/net/openvswitch/vport-internal_dev.c
b/net/openvswitch/vport-internal_dev.c
index 09141a1..89193a6 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -149,6 +149,8 @@ static void do_setup(struct net_device *netdev)
{
ether_setup(netdev);
+ netdev->max_mtu = ETH_MAX_MTU;
+
netdev->netdev_ops = &internal_dev_netdev_ops;
netdev->priv_flags &= ~IFF_TX_SKB_SHARING;
Br.
Endre Vaade
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply related
* Re: [PATCH RFC net-next 0/6] net: reducing memory footprint of network devices
From: David Ahern @ 2017-05-09 15:42 UTC (permalink / raw)
To: nicolas.dichtel, Florian Fainelli, netdev; +Cc: roopa
In-Reply-To: <1be60a5e-da39-5657-b1fe-c91266800046@6wind.com>
On 5/9/17 2:50 AM, Nicolas Dichtel wrote:
> Your initial patch tried to make those interfaces transparent, this is not the
> case anymore here. It would probably be useful to be able to filter those
> interfaces in the kernel during a dump.
The earlier email was for hidden devices; the intent there is to hide
certain devices (e.g., switch control netdevs) from user dumps by default.
Adding an attribute at create time such as IFF_INVISIBLE for such
devices would be a follow on to this set - but leveraging the same
sysctl and sysfs bypasses.
^ permalink raw reply
* Re: [PATCH net 0/5] qed*: General fixes
From: David Miller @ 2017-05-09 15:25 UTC (permalink / raw)
To: Yuval.Mintz; +Cc: netdev
In-Reply-To: <1494331671-16273-1-git-send-email-Yuval.Mintz@cavium.com>
From: Yuval Mintz <Yuval.Mintz@cavium.com>
Date: Tue, 9 May 2017 15:07:46 +0300
> This series contain several fixes for qed and qede.
>
> - #1 [and ~#5] relate to XDP cleanups
> - #2 and #5 correct VF behavior
> - #3 and #4 fix and add missing configurations needed for RoCE & storage
>
> Dave,
>
> Please consider applying the series to 'net'.
Series applied, thank you.
^ permalink raw reply
* Re: [PATCH net 0/3] mlx4 misc fixes
From: David Miller @ 2017-05-09 15:23 UTC (permalink / raw)
To: tariqt; +Cc: netdev, eranbe
In-Reply-To: <1494330324-11752-1-git-send-email-tariqt@mellanox.com>
From: Tariq Toukan <tariqt@mellanox.com>
Date: Tue, 9 May 2017 14:45:21 +0300
> This patchset contains misc bug fixes from the team
> to the mlx4 Core and Eth drivers.
>
> Series generated against net commit:
> 32f1bc0f3d26 Revert "ipv4: restore rt->fi for reference counting"
Series applied, thanks Tariq.
^ permalink raw reply
* Re: [PATCH] net: dsa: loop: Check for memory allocation failure
From: Joe Perches @ 2017-05-09 15:18 UTC (permalink / raw)
To: Florian Fainelli, Julia Lawall
Cc: David Laight, 'Christophe JAILLET', andrew@lunn.ch,
vivien.didelot@savoirfairelinux.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, kernel-janitors@vger.kernel.org
In-Reply-To: <f9058de7-ef9e-f6eb-751d-72ffdce512bb@gmail.com>
On Mon, 2017-05-08 at 17:35 -0700, Florian Fainelli wrote:
> On 05/08/2017 04:46 PM, Julia Lawall wrote:
> > On Mon, 8 May 2017, Joe Perches wrote:
> > > Each time -EPROBE_DEFER occurs, another set of calls to
> > > dsa_switch_alloc and dev_kzalloc also occurs.
> > >
> > > Perhaps it'd be better to do:
> > >
> > > if (ps->netdev) {
> > > devm_kfree(&devmdev->dev, ps);
> > > devm_kfree(&mdiodev->dev, ds);
> > > return -EPROBE_DEFER;
> > > }
> >
> > Is EPROBE_DEFER handled differently than other kinds of errors?
>
> In the core device driver model, yes, EPROBE_DEFER is treated
> differently than other errors because it puts the driver on a retry queue.
>
> EPROBE_DEFER is already a slow and exceptional path, and this is a
> mock-up driver, so I am not sure what value there is in trying to
> balance devm_kzalloc() with corresponding devm_kfree()...
Example code should be as correct as possible.
^ permalink raw reply
* Re: [PATCH net] ipv6: make sure dev is not NULL before call ip6_frag_reasm
From: Eric Dumazet @ 2017-05-09 15:11 UTC (permalink / raw)
To: Hangbin Liu; +Cc: netdev
In-Reply-To: <20170509134033.GB4649@leo.usersys.redhat.com>
On Tue, 2017-05-09 at 21:40 +0800, Hangbin Liu wrote:
>
> I saw we checked the dev in this function
>
> dev = skb->dev;
> if (dev) {
> fq->iif = dev->ifindex;
> skb->dev = NULL;
> }
>
> and upper caller ipv6_frag_rcv()
>
> fq = fq_find(net, fhdr->identification, &hdr->saddr, &hdr->daddr,
> skb->dev ? skb->dev->ifindex : 0, ip6_frag_ecn(hdr));
>
>
> Apologise that I did not do enough research to make sure whether skb->dev
> could be NULL or not. I will do the check recently and reply when got a
> confirmation.
If really having a NULL dev is possible, I would rather change things
this way, as your fix has side effects.
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e1da5b888cc4901711d573075f8ae4eada7f086e..6c0a2b74ba705cbe13b4e7522d958a9c3d395c29 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -77,7 +77,7 @@ static u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
static struct inet_frags ip6_frags;
static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev);
+ struct net_device *dev, struct inet6_dev *idev);
/*
* callers should be careful not to use the hash value outside the ipfrag_lock
@@ -209,6 +209,7 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src,
static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
struct frag_hdr *fhdr, int nhoff)
{
+ struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
struct sk_buff *prev, *next;
struct net_device *dev;
int offset, end, fragsize;
@@ -223,8 +224,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
if ((unsigned int)end > IPV6_MAXPLEN) {
- __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
- IPSTATS_MIB_INHDRERRORS);
+ __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
((u8 *)&fhdr->frag_off -
skb_network_header(skb)));
@@ -258,8 +258,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
/* RFC2460 says always send parameter problem in
* this case. -DaveM
*/
- __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
- IPSTATS_MIB_INHDRERRORS);
+ __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
offsetof(struct ipv6hdr, payload_len));
return -1;
@@ -354,7 +353,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
unsigned long orefdst = skb->_skb_refdst;
skb->_skb_refdst = 0UL;
- res = ip6_frag_reasm(fq, prev, dev);
+ res = ip6_frag_reasm(fq, prev, dev, idev);
skb->_skb_refdst = orefdst;
return res;
}
@@ -365,8 +364,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
discard_fq:
inet_frag_kill(&fq->q, &ip6_frags);
err:
- __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
- IPSTATS_MIB_REASMFAILS);
+ __IP6_INC_STATS(net, idev, IPSTATS_MIB_REASMFAILS);
kfree_skb(skb);
return -1;
}
@@ -381,7 +379,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
* the last and the first frames arrived and all the bits are here.
*/
static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
- struct net_device *dev)
+ struct net_device *dev, struct inet6_dev *idev)
{
struct net *net = container_of(fq->q.net, struct net, ipv6.frags);
struct sk_buff *fp, *head = fq->q.fragments;
@@ -505,9 +503,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
skb_postpush_rcsum(head, skb_network_header(head),
skb_network_header_len(head));
- rcu_read_lock();
- __IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
- rcu_read_unlock();
+ __IP6_INC_STATS(net, idev, IPSTATS_MIB_REASMOKS);
fq->q.fragments = NULL;
fq->q.fragments_tail = NULL;
return 1;
^ permalink raw reply related
* Re: DQL and TCQ_F_CAN_BYPASS destroy performance under virtualizaiton (Was: "Re: net_sched strange in 4.11")
From: Stefan Hajnoczi @ 2017-05-09 15:11 UTC (permalink / raw)
To: Anton Ivanov; +Cc: David S. Miller, netdev, Michael S. Tsirkin, jasowang
In-Reply-To: <27ae4e1c-7c6c-14c2-f3a4-9d0b1265d034@cambridgegreys.com>
[-- Attachment #1: Type: text/plain, Size: 2995 bytes --]
On Tue, May 09, 2017 at 08:46:46AM +0100, Anton Ivanov wrote:
> I have figured it out. Two issues.
>
> 1) skb->xmit_more is hardly ever set under virtualization because the qdisc
> is usually bypassed because of TCQ_F_CAN_BYPASS. Once TCQ_F_CAN_BYPASS is
> set a virtual NIC driver is not likely see skb->xmit_more (this answers my
> "how does this work at all" question).
>
> 2) If that flag is turned off (I patched sched_generic to turn it off in
> pfifo_fast while testing), DQL keeps xmit_more from being set. If the driver
> is not DQL enabled xmit_more is never ever set. If the driver is DQL enabled
> the queue is adjusted to ensure xmit_more stops happening within 10-15 xmit
> cycles.
>
> That is plain *wrong* for virtual NICs - virtio, emulated NICs, etc. There,
> the BIG cost is telling the hypervisor that it needs to "kick" the packets.
> The cost of putting them into the vNIC buffers is negligible. You want
> xmit_more to happen - it makes between 50% and 300% (depending on vNIC
> design) difference. If there is no xmit_more the vNIC will immediately
> "kick" the hypervisor and try to signal that the packet needs to move
> straight away (as for example in virtio_net).
>
> In addition to that, the perceived line rate is proportional to this cost,
> so I am not sure that the current dql math holds. In fact, I think it does
> not - it is trying to adjust something which influences the perceived line
> rate.
>
> So - how do we turn BOTH bypass and DQL adjustment while under
> virtualization and set them to be "always qdisc" + "always xmit_more
> allowed"
>
> A.
>
> P.S. Cc-ing virtio maintainer
CCing Michael Tsirkin and Jason Wang, who are the core virtio and
virtio-net maintainers. (I maintain the vsock driver - it's unrelated
to this discussion.)
>
> A.
>
>
> On 08/05/17 08:15, Anton Ivanov wrote:
> > Hi all,
> >
> > I was revising some of my old work for UML to prepare it for submission
> > and I noticed that skb->xmit_more does not seem to be set any more.
> >
> > I traced the issue as far as net/sched/sched_generic.c
> >
> > try_bulk_dequeue_skb() is never invoked (the drivers I am working on are
> > dql enabled so that is not the problem).
> >
> > More interestingly, if I put a breakpoint and debug output into
> > dequeue_skb() around line 147 - right before the bulk: tag that skb
> > there is always NULL. ???
> >
> > Similarly, debug in pfifo_fast_dequeue shows only NULLs being dequeued.
> > Again - ???
> >
> > First and foremost, I apologize for the silly question, but how can this
> > work at all? I see the skbs showing up at the driver level, why are
> > NULLs being returned at qdisc dequeue and where do the skbs at the
> > driver level come from?
> >
> > Second, where should I look to fix it?
> >
> > A.
> >
>
>
> --
> Anton R. Ivanov
>
> Cambridge Greys Limited, England company No 10273661
> http://www.cambridgegreys.com/
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox