Netdev List
 help / color / mirror / Atom feed
* [PATCH iproute2] tc: fix ipv6 filter selector attribute for some prefix lengths
From: Yulia Kartseva @ 2017-09-25 18:12 UTC (permalink / raw)
  To: netdev; +Cc: shemminger

Wrong TCA_U32_SEL attribute packing if prefixLen AND 0x1f equals 0x1f.
These are  /31, /63, /95 and /127 prefix lengths.

Example:
# tc filter add dev eth0 protocol ipv6 parent b: prio 2307 u32 match
ip6 dst face:b00f::/31
# tc filter show dev eth0
filter parent b: protocol ipv6 pref 2307 u32
filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
key ht 800 bkt 0
  match faceb00f/ffffffff at 24


The correct match would be "faceb00e/fffffffe": don't count the last
bit of the 4th byte as the network prefix. With fix:

# tc filter show dev eth0
filter parent b: protocol ipv6 pref 2307 u32
filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
key ht 800 bkt 0
  match faceb00e/fffffffe at 24

 tc/f_u32.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tc/f_u32.c b/tc/f_u32.c
index 5815be9..14b9588 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -385,8 +385,7 @@ static int parse_ip6_addr(int *argc_p, char ***argv_p,

  plen = addr.bitlen;
  for (i = 0; i < plen; i += 32) {
- /* if (((i + 31) & ~0x1F) <= plen) { */
- if (i + 31 <= plen) {
+ if (i + 31 < plen) {
  res = pack_key(sel, addr.data[i / 32],
        0xFFFFFFFF, off + 4 * (i / 32), offmask);
  if (res < 0)

^ permalink raw reply related

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Jiri Benc @ 2017-09-25 18:14 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, e, davem
In-Reply-To: <1506348969-6233-1-git-send-email-yi.y.yang@intel.com>

On Mon, 25 Sep 2017 22:16:09 +0800, Yi Yang wrote:
> +	skb->protocol = htons(ETH_P_NSH);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

The last two lines are swapped. Network header needs to be reset before
mac_len.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_push_nsh);
> +
> +int skb_pop_nsh(struct sk_buff *skb)
> +{
> +	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
> +	size_t length;
> +	u16 inner_proto;

__be16 inner_proto.

> +static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> +		    const struct nshhdr *src_nsh_hdr)
> +{
> +	int err;
> +
> +	err = skb_push_nsh(skb, src_nsh_hdr);
> +	if (err)
> +		return err;
> +
> +	key->eth.type = htons(ETH_P_NSH);

I wonder why you have this assignment here. The key is invalidated,
thus nothing should rely on key->eth.type. However, looking at the code
and ovs_fragment in particular, I'm not sure that's the case. Could you
please explain why it is needed? And why the reverse of it is not
needed in pop_nsh?

> +
> +	/* safe right before invalidate_flow_key */
> +	key->mac_proto = MAC_PROTO_NONE;
> +	invalidate_flow_key(key);
> +	return 0;
> +}
[...]
> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct nlattr *a)
> +{
> +	struct nshhdr *nsh_hdr;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	struct ovs_key_nsh key;
> +	struct ovs_key_nsh mask;
> +
> +	err = nsh_key_from_nlattr(a, &key, &mask);
> +	if (err)
> +		return err;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				  sizeof(struct nshhdr));

Whitespace nit: the sizeof should be aligned to skb_network_offset.

> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);

Please use the nsh_hdr function (I'm sure I already requested that in
the previous review?). It means renaming the nsh_hdr variable.

> @@ -1210,6 +1307,21 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  		case OVS_ACTION_ATTR_POP_ETH:
>  			err = pop_eth(skb, key);
>  			break;
> +
> +		case OVS_ACTION_ATTR_PUSH_NSH: {
> +			u8 buffer[NSH_HDR_MAX_LEN];
> +			struct nshhdr *nsh_hdr = (struct nshhdr *)buffer;
> +			const struct nshhdr *src_nsh_hdr = nsh_hdr;
> +
> +			nsh_hdr_from_nlattr(nla_data(a), nsh_hdr,
> +					    NSH_HDR_MAX_LEN);
> +			err = push_nsh(skb, key, src_nsh_hdr);

Is the src_nsh_hdr variable really necessary? Cannot be nsh_hdr passed
directly to push_nsh, relying on automatic retyping to const?

By the way, nsh_hdr is a poor name for a variable, it clashes with the
nsh_hdr function. For clarity, please rename the variables in the whole
patch to something else.

> +static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +	struct nshhdr *nsh_hdr;
> +	unsigned int nh_ofs = skb_network_offset(skb);
> +	u8 version, length;
> +	int err;
> +
> +	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);

Again, rename the variable and use the nsh_hdr function.

> +	version = nsh_get_ver(nsh_hdr);
> +	length = nsh_hdr_len(nsh_hdr);
> +
> +	if (version != 0)
> +		return -EINVAL;
> +
> +	err = check_header(skb, nh_ofs + length);
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);

Ditto.

> +struct ovs_key_nsh {
> +	u8 flags;
> +	u8 ttl;
> +	u8 mdtype;
> +	u8 np;
> +	__be32 path_hdr;
> +	__be32 context[NSH_MD1_CONTEXT_SIZE];
> +};

This should be:

struct ovs_key_nsh {
	struct ovs_nsh_key_base base;
	__be32 context[NSH_MD1_CONTEXT_SIZE];
};

to capture the real dependency. Implicitly depending on ovs_key_nsh
starting with the same fields as ovs_nsh_key_base will only lead to bugs
in the future.

> +int nsh_hdr_from_nlattr(const struct nlattr *attr,
> +			struct nshhdr *nsh, size_t size)
> +{
> +	struct nlattr *a;
> +	int rem;
> +	u8 flags = 0;
> +	u8 ttl = 0;
> +	int mdlen = 0;
> +
> +	/* validate_nsh has check this, so we needn't do duplicate check here
> +	 */
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base =
> +				(struct ovs_nsh_key_base *)nla_data(a);

It's not necessary to retype void * explicitly. Just assign it.

> +			flags = base->flags;
> +			ttl = base->ttl;
> +			nsh->np = base->np;
> +			nsh->mdtype = base->mdtype;
> +			nsh->path_hdr = base->path_hdr;
> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD1: {
> +			const struct ovs_nsh_key_md1 *md1 =
> +				(struct ovs_nsh_key_md1 *)nla_data(a);

It's not necessary to retype void * explicitly. Just assign it.

> +			struct nsh_md1_ctx *md1_dst = &nsh->md1;
> +
> +			mdlen = nla_len(a);
> +			memcpy(md1_dst, md1, mdlen);

Why the variable? Just memcpy to &nsh->md1.

> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD2: {
> +			const struct u8 *md2 = nla_data(a);
> +			struct nsh_md2_tlv *md2_dst = &nsh->md2;
> +
> +			mdlen = nla_len(a);
> +			memcpy(md2_dst, md2, mdlen);

Ditto.

> +int nsh_key_from_nlattr(const struct nlattr *attr,
> +			struct ovs_key_nsh *nsh, struct ovs_key_nsh *nsh_mask)
> +{
> +	struct nlattr *a;
> +	int rem;
> +
> +	/* validate_nsh has check this, so we needn't do duplicate check here
> +	 */
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base =
> +				(struct ovs_nsh_key_base *)nla_data(a);

It's not necessary to retype void * explicitly. Just assign it.

> +			const struct ovs_nsh_key_base *base_mask = base + 1;
> +
> +			memcpy(nsh, base, sizeof(*base));
> +			memcpy(nsh, base_mask, sizeof(*base_mask));

The second memcpy should copy to nsh_mask, not nsh.

If you modify struct ovs_key_nsh as suggested above, this becomes a
simple:

nsh->base = *base;
nsh_mask->base = *base_mask;

I'm perfectly fine with memcpy, too, though.

> +static int nsh_key_put_from_nlattr(const struct nlattr *attr,
> +				   struct sw_flow_match *match, bool is_mask,
> +				   bool is_push_nsh, bool log)
> +{
> +	struct nlattr *a;
> +	int rem;
> +	bool has_base = false;
> +	bool has_md1 = false;
> +	bool has_md2 = false;
> +	u8 mdtype = 0;
> +	int mdlen = 0;
> +
> +	if (unlikely(is_push_nsh && is_mask))
> +		return -EINVAL;

Wrap this in WARN_ON. (And note that you don't need unlikely with
WARN_ON.)

> +		case OVS_NSH_KEY_ATTR_MD2:
> +			if (!is_push_nsh) /* Not supported MD type 2 yet */
> +				return -ENOTSUPP;
> +
> +			has_md2 = true;
> +			mdlen = nla_len(a);
> +			if ((mdlen > NSH_CTX_HDRS_MAX_LEN) ||
> +			    (mdlen <= 0)) {
> +				WARN_ON_ONCE(1);

But here, the WARN_ON_ONCE is completely inappropriate. This condition
does not indicate a kernel bug but rather invalid data from the user
space. OVS_NLERR should be here instead.

> +		if (is_push_nsh &&
> +		    (!has_base || (!has_md1 && !has_md2))) {
> +			OVS_NLERR(
> +			    1,
> +			    "push nsh attributes are invalid for type %d.",
> +			    mdtype
> +			);

"Missing nsh base and/or metadata attribute." or something like that
would be a better error message.

> +static int nsh_key_to_nlattr(const struct ovs_key_nsh *nsh, bool is_mask,
> +			     struct sk_buff *skb)
> +{
> +	struct nlattr *start;
> +	struct ovs_nsh_key_base base;
> +	struct ovs_nsh_key_md1 md1;
> +
> +	memcpy(&base, nsh, sizeof(base));
> +
> +	if (is_mask || nsh->mdtype == NSH_M_TYPE1)
> +		memcpy(md1.context, nsh->context, sizeof(md1));
> +
> +	start = nla_nest_start(skb, OVS_KEY_ATTR_NSH);
> +	if (!start)
> +		return -EMSGSIZE;
> +
> +	if (nla_put(skb, OVS_NSH_KEY_ATTR_BASE, sizeof(base), &base))

The 'base' variable is not needed, let alone copy to it, just use
&nsh->base here.

> +		goto nla_put_failure;
> +
> +	if (is_mask || nsh->mdtype == NSH_M_TYPE1) {
> +		if (nla_put(skb, OVS_NSH_KEY_ATTR_MD1, sizeof(md1), &md1))

Likewise, no need for the copy.

> +	return ((ret != 0) ? false : true);

Too little coffee or too late in the night, right? ;-)

 Jiri

^ permalink raw reply

* Re: [PATCH net-next 7/7] nfp: flower vxlan neighbour keep-alive
From: Or Gerlitz @ 2017-09-25 18:32 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers,
	John Hurley
In-Reply-To: <1506335021-32024-8-git-send-email-simon.horman@netronome.com>

On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
<simon.horman@netronome.com> wrote:
> From: John Hurley <john.hurley@netronome.com>
>
> Periodically receive messages containing the destination IPs of tunnels
> that have recently forwarded traffic. Update the neighbour entries 'used'
> value for these IPs next hop.

Are you proactively sending keep alive messages from the driver or the
fw? what's wrong with the probes sent by the kernel NUD subsystem?

In our driver we also update the used value for neighs of offloaded
tunnels, we do it based on flow counters for the offloaded tunnels
which is an evidence for activity. Any reason for you not to apply a
similar practice?

Or.

^ permalink raw reply

* Re: [PATCH net-next 2/7] nfp: compile flower vxlan tunnel metadata match fields
From: Or Gerlitz @ 2017-09-25 18:35 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers,
	John Hurley
In-Reply-To: <1506335021-32024-3-git-send-email-simon.horman@netronome.com>

On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
<simon.horman@netronome.com> wrote:
> From: John Hurley <john.hurley@netronome.com>
>
> Compile ovs-tc flower vxlan metadata match fields for offloading. Only

anything in the npf kernel bits has direct relation to ovs? what?

> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
> @@ -52,8 +52,25 @@
>          BIT(FLOW_DISSECTOR_KEY_PORTS) | \
>          BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
>          BIT(FLOW_DISSECTOR_KEY_VLAN) | \
> +        BIT(FLOW_DISSECTOR_KEY_ENC_KEYID) | \
> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) | \
> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) | \

this series takes care of IPv6 tunnels too?

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Daniel Borkmann @ 2017-09-25 18:50 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: davem, alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev, mchan
In-Reply-To: <20170925181000.GA60144@C02RW35GFVH8.dhcp.broadcom.net>

On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
[...]
> First, thanks for this detailed description.  It was helpful to read
> along with the patches.
>
> My only concern about this area being generic is that you are now in a
> state where any bpf program must know about all the bpf programs in the
> receive pipeline before it can properly parse what is stored in the
> meta-data and add it to an skb (or perform any other action).
> Especially if each program adds it's own meta-data along the way.
>
> Maybe this isn't a big concern based on the number of users of this
> today, but it just starts to seem like a concern as there are these
> hints being passed between layers that are challenging to track due to a
> lack of a standard format for passing data between.

Btw, we do have similar kind of programmable scratch buffer also today
wrt skb cb[] that you can program from tc side, the perf ring buffer,
which doesn't have any fixed layout for the slots, or a per-cpu map
where you can transfer data between tail calls for example, then tail
calls themselves that need to coordinate, or simply mangling of packets
itself if you will, but more below to your use case ...

> The main reason I bring this up is that Michael and I had discussed and
> designed a way for drivers to communicate between each other that rx
> resources could be freed after a tx completion on an XDP_REDIRECT
> action.  Much like this code, it involved adding an new element to
> struct xdp_md that could point to the important information.  Now that
> there is a generic way to handle this, it would seem nice to be able to
> leverage it, but I'm not sure how reliable this meta-data area would be
> without the ability to mark it in some manner.
>
> For additional background, the minimum amount of data needed in the case
> Michael and I were discussing was really 2 words.  One to serve as a
> pointer to an rx_ring structure and one to have a counter to the rx
> producer entry.  This data could be acessed by the driver processing the
> tx completions and callback to the driver that received the frame off the wire
> to perform any needed processing.  (For those curious this would also require a
> new callback/netdev op to act on this data stored in the XDP buffer.)

What you describe above doesn't seem to be fitting to the use-case of
this set, meaning the area here is fully programmable out of the BPF
program, the infrastructure you're describing is some sort of means of
communication between drivers for the XDP_REDIRECT, and should be
outside of the control of the BPF program to mangle.

You could probably reuse the base infra here and make a part of that
inaccessible for the program with some sort of a fixed layout, but I
haven't seen your code yet to be able to fully judge. Intention here
is to allow for programmability within the BPF prog in a generic way,
such that based on the use-case it can be populated in specific ways
and propagated to the skb w/o having to define a fixed layout and
bloat xdp_buff all the way to an skb while still retaining all the
flexibility.

Thanks,
Daniel

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Jiri Pirko @ 2017-09-25 19:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-3-xiyou.wangcong@gmail.com>

Mon, Sep 25, 2017 at 07:13:51PM CEST, xiyou.wangcong@gmail.com wrote:
>Instead of calling u32_lookup_ht() in a loop to find
>a unused handle, just switch to idr API to allocate
>new handles. u32 filters are special as the handle
>could contain a hash table id and a key id, so we
>need two IDR to allocate each of them.
>
>Cc: Chris Mi <chrism@mellanox.com>
>Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>---

[...]

>@@ -311,19 +313,19 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
> 	return u32_lookup_key(ht, handle);
> }
> 
>-static u32 gen_new_htid(struct tc_u_common *tp_c)
>+static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
> {
>-	int i = 0x800;
>+	unsigned long idr_index;
>+	int err;
> 
>-	/* hgenerator only used inside rtnl lock it is safe to increment
>+	/* This is only used inside rtnl lock it is safe to increment
> 	 * without read _copy_ update semantics
> 	 */
>-	do {
>-		if (++tp_c->hgenerator == 0x7FF)
>-			tp_c->hgenerator = 1;
>-	} while (--i > 0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20));
>-
>-	return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0;
>+	err = idr_alloc_ext(&tp_c->handle_idr, ptr, &idr_index,
>+			    1, 0x7FF, GFP_KERNEL);

Interesting, any idea why this is not 0x7FFFFFFF as well?

I wonder if we could have 0x7FFFFFFF magic defined somewhere.

Otherwise, "patchset" looks good. Thank you for taking care of this!

^ permalink raw reply

* Re: [ovs-dev] [PATCH net-next v9] openvswitch: enable NSH support
From: Eric Garver @ 2017-09-25 19:28 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, jbenc, davem
In-Reply-To: <1506348969-6233-1-git-send-email-yi.y.yang@intel.com>

On Mon, Sep 25, 2017 at 10:16:09PM +0800, Yi Yang wrote:
> v8->v9
>  - Fix build error reported by daily intel build
>    because nsh module isn't selected by openvswitch
> 
> v7->v8
>  - Rework nested value and mask for OVS_KEY_ATTR_NSH
>  - Change pop_nsh to adapt to nsh kernel module
>  - Fix many issues per comments from Jiri Benc
> 
> v6->v7
>  - Remove NSH GSO patches in v6 because Jiri Benc
>    reworked it as another patch series and they have
>    been merged.
>  - Change it to adapt to nsh kernel module added by NSH
>    GSO patch series
> 
> v5->v6
>  - Fix the rest comments for v4.
>  - Add NSH GSO support for VxLAN-gpe + NSH and
>    Eth + NSH.
> 
> v4->v5
>  - Fix many comments by Jiri Benc and Eric Garver
>    for v4.
> 
> v3->v4
>  - Add new NSH match field ttl
>  - Update NSH header to the latest format
>    which will be final format and won't change
>    per its author's confirmation.
>  - Fix comments for v3.
> 
> v2->v3
>  - Change OVS_KEY_ATTR_NSH to nested key to handle
>    length-fixed attributes and length-variable
>    attriubte more flexibly.
>  - Remove struct ovs_action_push_nsh completely
>  - Add code to handle nested attribute for SET_MASKED
>  - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
>    to transfer NSH header data.
>  - Fix comments and coding style issues by Jiri and Eric
> 
> v1->v2
>  - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
>  - Dynamically allocate struct ovs_action_push_nsh for
>    length-variable metadata.
> 
> OVS master and 2.8 branch has merged NSH userspace
> patch series, this patch is to enable NSH support
> in kernel data path in order that OVS can support
> NSH in compat mode by porting this.
> 
> Signed-off-by: Yi Yang <yi.y.yang@intel.com>
> ---
>  include/net/nsh.h                |   3 +
>  include/uapi/linux/openvswitch.h |  29 ++++
>  net/nsh/nsh.c                    |  53 +++++++
>  net/openvswitch/Kconfig          |   1 +
>  net/openvswitch/actions.c        | 112 ++++++++++++++
>  net/openvswitch/flow.c           |  51 ++++++
>  net/openvswitch/flow.h           |  11 ++
>  net/openvswitch/flow_netlink.c   | 324 ++++++++++++++++++++++++++++++++++++++-
>  net/openvswitch/flow_netlink.h   |   5 +
>  9 files changed, 588 insertions(+), 1 deletion(-)
> 
[...]
> diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
> index 58fb827..54334ca 100644
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -14,6 +14,59 @@
>  #include <net/nsh.h>
>  #include <net/tun_proto.h>
>  
> +int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)
> +{
> +	struct nshhdr *nsh_hdr;
> +	size_t length = nsh_hdr_len(src_nsh_hdr);
> +	u8 next_proto;
> +
> +	if (skb->mac_len) {
> +		next_proto = TUN_P_ETHERNET;
> +	} else {
> +		next_proto = tun_p_from_eth_p(skb->protocol);
> +		if (!next_proto)
> +			return -EAFNOSUPPORT;
> +	}
> +
> +	/* Add the NSH header */
> +	if (skb_cow_head(skb, length) < 0)
> +		return -ENOMEM;
> +
> +	skb_push(skb, length);
> +	nsh_hdr = (struct nshhdr *)(skb->data);
> +	memcpy(nsh_hdr, src_nsh_hdr, length);
> +	nsh_hdr->np = next_proto;
> +
> +	skb->protocol = htons(ETH_P_NSH);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

I think you mean to reset network_header before mac_len.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_push_nsh);
> +
> +int skb_pop_nsh(struct sk_buff *skb)
> +{
> +	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
> +	size_t length;
> +	u16 inner_proto;
> +
> +	inner_proto = tun_p_to_eth_p(nsh_hdr->np);
> +	if (!inner_proto)
> +		return -EAFNOSUPPORT;
> +
> +	length = nsh_hdr_len(nsh_hdr);
> +	skb_pull(skb, length);

Do you need to verify you can actually pull length bytes? I don't see
any guarantee.

> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

Reset network before mac_len.

> +	skb->protocol = inner_proto;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_pop_nsh);
> +
>  static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
>  				       netdev_features_t features)
>  {
[...]
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556..d026b85 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
[...]
> @@ -602,6 +642,59 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
>  	return 0;
>  }
>  
> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct nlattr *a)
> +{
> +	struct nshhdr *nsh_hdr;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	struct ovs_key_nsh key;
> +	struct ovs_key_nsh mask;
> +
> +	err = nsh_key_from_nlattr(a, &key, &mask);
> +	if (err)
> +		return err;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				  sizeof(struct nshhdr));

This calls pskb_may_pull(), but you're not pulling any data here.

> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);
> +
> +	flags = nsh_get_flags(nsh_hdr);
> +	flags = OVS_MASKED(flags, key.flags, mask.flags);
> +	flow_key->nsh.flags = flags;
> +	ttl = nsh_get_ttl(nsh_hdr);
> +	ttl = OVS_MASKED(ttl, key.ttl, mask.ttl);
> +	flow_key->nsh.ttl = ttl;
> +	nsh_set_flags_and_ttl(nsh_hdr, flags, ttl);
> +	nsh_hdr->path_hdr = OVS_MASKED(nsh_hdr->path_hdr, key.path_hdr,
> +				       mask.path_hdr);
> +	flow_key->nsh.path_hdr = nsh_hdr->path_hdr;
> +	switch (nsh_hdr->mdtype) {
> +	case NSH_M_TYPE1:
> +		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
> +			nsh_hdr->md1.context[i] =
> +			    OVS_MASKED(nsh_hdr->md1.context[i], key.context[i],
> +				       mask.context[i]);
> +		}
> +		memcpy(flow_key->nsh.context, nsh_hdr->md1.context,
> +		       sizeof(nsh_hdr->md1.context));
> +		break;
> +	case NSH_M_TYPE2:
> +		memset(flow_key->nsh.context, 0,
> +		       sizeof(flow_key->nsh.context));
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /* Must follow skb_ensure_writable() since that can move the skb data. */
>  static void set_tp_port(struct sk_buff *skb, __be16 *port,
>  			__be16 new_port, __sum16 *check)
[...]

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: John Fastabend @ 2017-09-25 19:47 UTC (permalink / raw)
  To: Daniel Borkmann, Andy Gospodarek
  Cc: davem, alexei.starovoitov, peter.waskiewicz.jr, jakub.kicinski,
	netdev, mchan
In-Reply-To: <59C94FF4.8070900@iogearbox.net>

On 09/25/2017 11:50 AM, Daniel Borkmann wrote:
> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
> [...]
>> First, thanks for this detailed description.  It was helpful to read
>> along with the patches.
>>
>> My only concern about this area being generic is that you are now in a
>> state where any bpf program must know about all the bpf programs in the
>> receive pipeline before it can properly parse what is stored in the
>> meta-data and add it to an skb (or perform any other action).
>> Especially if each program adds it's own meta-data along the way.
>>
>> Maybe this isn't a big concern based on the number of users of this
>> today, but it just starts to seem like a concern as there are these
>> hints being passed between layers that are challenging to track due to a
>> lack of a standard format for passing data between.
> 
> Btw, we do have similar kind of programmable scratch buffer also today
> wrt skb cb[] that you can program from tc side, the perf ring buffer,
> which doesn't have any fixed layout for the slots, or a per-cpu map
> where you can transfer data between tail calls for example, then tail
> calls themselves that need to coordinate, or simply mangling of packets
> itself if you will, but more below to your use case ...
> 
>> The main reason I bring this up is that Michael and I had discussed and
>> designed a way for drivers to communicate between each other that rx
>> resources could be freed after a tx completion on an XDP_REDIRECT
>> action.  Much like this code, it involved adding an new element to
>> struct xdp_md that could point to the important information.  Now that
>> there is a generic way to handle this, it would seem nice to be able to
>> leverage it, but I'm not sure how reliable this meta-data area would be
>> without the ability to mark it in some manner.
>>
>> For additional background, the minimum amount of data needed in the case
>> Michael and I were discussing was really 2 words.  One to serve as a
>> pointer to an rx_ring structure and one to have a counter to the rx
>> producer entry.  This data could be acessed by the driver processing the
>> tx completions and callback to the driver that received the frame off
>> the wire
>> to perform any needed processing.  (For those curious this would also
>> require a
>> new callback/netdev op to act on this data stored in the XDP buffer.)
> 
> What you describe above doesn't seem to be fitting to the use-case of
> this set, meaning the area here is fully programmable out of the BPF
> program, the infrastructure you're describing is some sort of means of
> communication between drivers for the XDP_REDIRECT, and should be
> outside of the control of the BPF program to mangle.
> 
> You could probably reuse the base infra here and make a part of that
> inaccessible for the program with some sort of a fixed layout, but I
> haven't seen your code yet to be able to fully judge. Intention here
> is to allow for programmability within the BPF prog in a generic way,
> such that based on the use-case it can be populated in specific ways
> and propagated to the skb w/o having to define a fixed layout and
> bloat xdp_buff all the way to an skb while still retaining all the
> flexibility.
> 
> Thanks,
> Daniel

Hi Andy,

I'm guessing this data needs to be passed from the input dev to the
output dev based on your description. If the driver data
is pushed after the BPF program is run but before the xdp_do_flush_map
call no other BPF programs can be run on that xdp_buff. It
should be safe at this point to use the metadata region directly
from the driver. We would just need to add a few helpers for the
drivers to use for this maybe, xdp_metadata_write_drv,
xdp_meadata_read_drv. I think this would work for your use case?
The data structure would have to be agreed upon by all the drivers
but would not be UAPI because it would only be exposed in the
driver. So we would be free to change/update it as needed.

Thanks,
John

^ permalink raw reply

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Miller @ 2017-09-25 20:11 UTC (permalink / raw)
  To: dsahern; +Cc: eric.dumazet, netdev, stephen
In-Reply-To: <ce328dfd-0b54-883a-36a8-91ec344f86ad@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 25 Sep 2017 10:14:23 -0600

> I made a simple program this morning and ran it under perf.

If possible please submit this for selftests.

Thank you.

^ permalink raw reply

* Re: [PATCH] r8152: add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-25 20:17 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Grant Grundler, Hayes Wang, David S . Miller, LKML, linux-usb,
	netdev
In-Reply-To: <1506351074.7827.13.camel@suse.com>

[grrhmail...sorry! resending as plain text]

Hallo Oliver!

On Mon, Sep 25, 2017 at 7:51 AM, Oliver Neukum <oneukum@suse.com> wrote:
> Am Freitag, den 22.09.2017, 12:06 -0700 schrieb Grant Grundler:
> > This Linksys dongle by default comes up in cdc_ether mode.
> > This patch allows r8152 to claim the device:
> >    Bus 002 Device 002: ID 13b1:0041 Linksys
>
> Hi,
>
> have you tested this in case cdc_ether is for some reason already
> loaded?

I did not consider testing this case since it's not possible on a
normal chromeos system (the entire root file system is signed for
normal users and get's rebooted after an update).  I could test this
in developer mode of course.

Did you expect both driver probe routines to claim the device and
wreak havoc with the device?

> The patch seems to enable r8152 but does not disable cdc_ether.

Correct. r8152 happens to claim the device before cdc_ether does - I
thought because cdc_ether is a class driver and only gets picked up
after vendor specific drivers are probed.  Is that correct?

I didn't realize cdc_ether has a blacklist to make sure
RTL8152|RTL8153 devices are not picked up by cdc_ether. Would you
prefer I add this device to the blacklist in the same patch?

cheers,
grant

>
>         Regards
>                 Oliver
>

^ permalink raw reply

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-09-25 20:18 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <3f824b0e-65f9-c69c-5421-2c5f6b349b09@redhat.com>

On 09/22/2017 12:03 AM, Jason Wang wrote:
> 
> 
> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>> patch goes even further which disables rx polling during processing tx.
>>> Please try it to see if it has any difference.
>> Unfortunately, this patch doesn't seem to have made a difference.  I
>> tried runs with both this patch and the previous patch applied, as well
>> as only this patch applied for comparison (numbers from vhost thread of
>> sending VM):
>>
>> 4.12    4.13     patch1   patch2   patch1+2
>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>
>> In each case, the regression in throughput was still present.
> 
> This probably means some other cases of the wakeups were missed. Could
> you please record the callers of __wake_up_sync_key()?
> 

Hi Jason,

With your 2 previous patches applied, every call to __wake_up_sync_key
(for both sender and server vhost threads) shows the following stack trace:

     vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
<-sock_def_readable
     vhost-11478-11520 [002] ....   312.927230: <stack trace>
 => dev_hard_start_xmit
 => sch_direct_xmit
 => __dev_queue_xmit
 => br_dev_queue_push_xmit
 => br_forward_finish
 => __br_forward
 => br_handle_frame_finish
 => br_handle_frame
 => __netif_receive_skb_core
 => netif_receive_skb_internal
 => tun_get_user
 => tun_sendmsg
 => handle_tx
 => vhost_worker
 => kthread
 => kernel_thread_starter
 => kernel_thread_starter

>>
>>> And two questions:
>>> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
>> Verified that the second set of guests are not actually required, I can
>> see the regression with only 2 VMs.
>>
>>> - Can enable batching in the tap of sending VM improve the performance
>>> (ethtool -C $tap rx-frames 64)
>> I tried this, but it did not help (actually seemed to make things a
>> little worse)
>>
> 
>  I still can't see a reason that can lead more wakeups, will take more
> time to look at this issue and keep you posted.
> 
> Thanks
> 

^ permalink raw reply

* Re: [PATCH v2 1/4] net: af_packet: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-1-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:02 +0530

> Use setup_timer function instead of initializing timer with the
> function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 2/4] net: nfc: hci: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-2-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:03 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 3/4] net: nfc: hci: llc_shdlc: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-3-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:04 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 4/4] net: nfc: llcp_core: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-4-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:05 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
From: Paolo Abeni @ 2017-09-25 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH v2 0/2] enable hires timer to timeout datagram socket
From: Thomas Gleixner @ 2017-09-25 20:42 UTC (permalink / raw)
  To: Vallish Vaidyeshwara
  Cc: Eric Dumazet, Eduardo Valentin, David Miller, dwmw2, shuah,
	richardcochran, xiyou.wangcong, netdev, linux-kernel, anchalag,
	dwmw
In-Reply-To: <20170920224816.GA73561@amazon.com>

On Wed, 20 Sep 2017, Vallish Vaidyeshwara wrote:
> On Sat, Sep 16, 2017 at 11:47:56AM +0200, Thomas Gleixner wrote:
> > > So if we need to replace all 'legacy' timers to high resolution timer,
> > > because some application was _relying_ on jiffies being kind of precise,
> > > maybe it is better to revert the change done on legacy timers.
> > 
> > Which would be a major step back in terms of timer performance and system
> > disturbance caused by massive recascading operations.
> > 
> > > Or continue the migration and make them use high res internally.
> > > 
> > > select() and poll() are the standard way to have precise timeouts,
> > > it is silly we have to maintain a timeout handling in the datagram fast
> > > path.
> > 
> > A few years ago we switched select/poll over to use hrtimers because the
> > wheel timers were too inaccurate for some operations, so it feels
> > consequent to switch the timeout in the datagram rcv path over as well. I
> > agree that the whole timeout magic there feels silly, but unfortunately
> > it's a documented property of sockets.
> > 
> 
> Thanks for your comments. This patch has been NACK'ed by David Miller. Is
> there any other approach to solve this problem with out application code
> being recompiled?

We have only three options here:

   1) Do a massive revert of the timer wheel changes and lose all the
      benefits of that rework.

   2) Make that timer list -> hrtimer change in the datagram code

   3) Ignore it

#1 Would be pretty ironic as networking would take the biggest penalty of
   the revert.

#2 Is IMO the proper solution as it cures a user space visible regression,
   though the patch itself could be made way simpler

#3 Shrug

Dave, Eric?

Thanks,

	tglx

^ permalink raw reply

* [PATCH net-next] net: ipv6: send NS for DAD when link operationally up
From: Mike Manning @ 2017-09-25 21:01 UTC (permalink / raw)
  To: netdev

The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.

Signed-off-by: Mike Manning <mmanning@brocade.com>
---
 net/ipv6/addrconf.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c2e2a78..dffbf3b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -303,10 +303,10 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
 	.disable_policy		= 0,
 };
 
-/* Check if a valid qdisc is available */
-static inline bool addrconf_qdisc_ok(const struct net_device *dev)
+/* Check if link is ready: is it up and is a valid qdisc available */
+static inline bool addrconf_link_ready(const struct net_device *dev)
 {
-	return !qdisc_tx_is_noop(dev);
+	return netif_oper_up(dev) && !qdisc_tx_is_noop(dev);
 }
 
 static void addrconf_del_rs_timer(struct inet6_dev *idev)
@@ -451,7 +451,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 
 	ndev->token = in6addr_any;
 
-	if (netif_running(dev) && addrconf_qdisc_ok(dev))
+	if (netif_running(dev) && addrconf_link_ready(dev))
 		ndev->if_flags |= IF_READY;
 
 	ipv6_mc_init_dev(ndev);
@@ -3393,7 +3393,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 			/* restore routes for permanent addresses */
 			addrconf_permanent_addr(dev);
 
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is not ready yet. */
 				pr_info("ADDRCONF(NETDEV_UP): %s: link is not ready\n",
 					dev->name);
@@ -3408,7 +3408,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 				run_pending = 1;
 			}
 		} else if (event == NETDEV_CHANGE) {
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is still not ready. */
 				break;
 			}
-- 
2.1.4

^ permalink raw reply related

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Daniel Borkmann @ 2017-09-25 21:16 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-1-xiyou.wangcong@gmail.com>

On 09/25/2017 07:13 PM, Cong Wang wrote:
> Instead of calling cls_bpf_get() in a loop to find
> a unused handle, just switch to idr API to allocate
> new handles.
>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Chris Mi <chrism@mellanox.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
[...]
> @@ -476,21 +462,30 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		}
>   	}
>
> -	if (handle == 0)
> -		prog->handle = cls_bpf_grab_new_handle(tp, head);
> -	else
> +	if (handle == 0) {
> +		ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +				    1, 0x7FFFFFFF, GFP_KERNEL);
> +		if (ret)
> +			goto errout;
> +		prog->handle = idr_index;
> +	} else {
> +		if (!oldprog) {
> +			ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +					    handle, handle + 1, GFP_KERNEL);
> +			if (ret)
> +				goto errout;
> +		}
>   		prog->handle = handle;
> -	if (prog->handle == 0) {
> -		ret = -EINVAL;
> -		goto errout;
>   	}
>
>   	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
>   	if (ret < 0)
> -		goto errout;
> +		goto errout_idr;
>
>   	ret = cls_bpf_offload(tp, prog, oldprog);
>   	if (ret) {
> +		if (!oldprog)
> +			idr_remove_ext(&head->handle_idr, prog->handle);

Shouldn't we also call idr_remove_ext() when there was an
oldprog, but we didn't care about reusing the same handle,
so it was handle == 0 initially?

There's this condition in the code before above idr allocations,
I think also in other classifiers:

         if (oldprog) {
                 if (handle && oldprog->handle != handle) {
                         ret = -EINVAL;
                         goto errout;
                 }
         }

>   		__cls_bpf_delete_prog(prog);
>   		return ret;
>   	}
> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>
>   	if (oldprog) {
> +		idr_replace_ext(&head->handle_idr, prog, handle);

And here, we should probably use prog->handle for the above
mentioned case as well, no?

Would be great if all this (and e.g. the fact that we use idr itself)
could optionally be hidden behind some handle generator api given
we could reuse that api also for cls_basic and cls_u32. Could also
be followed-up perhaps.

>   		list_replace_rcu(&oldprog->link, &prog->link);
>   		tcf_unbind_filter(tp, &oldprog->res);
>   		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
> @@ -509,6 +505,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   	*arg = prog;
>   	return 0;
>
> +errout_idr:
> +	if (!oldprog)
> +		idr_remove_ext(&head->handle_idr, prog->handle);

(Likewise as the failing cls_bpf_offload().)

>   errout:
>   	tcf_exts_destroy(&prog->exts);
>   	kfree(prog);
>

^ permalink raw reply

* WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50
From: Richard Weinberger @ 2017-09-25 21:23 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linux-kernel

Hi!

While playing with bcc's opensnoop tool on Linux 4.14-rc2 I managed to trigger 
this splat:

[  297.629773] WARNING: kernel stack frame pointer at ffff880156a5fea0 in 
bash:2103 has bad value 00007ffec7d87e50
[  297.629777] unwind stack type:0 next_sp:          (null) mask:0x6 
graph_idx:0
[  297.629783] ffff88015b207ae0: ffff88015b207b68 (0xffff88015b207b68)
[  297.629790] ffff88015b207ae8: ffffffffb163c00e (__save_stack_trace+0x6e/
0xd0)
[  297.629792] ffff88015b207af0: 0000000000000000 ...
[  297.629795] ffff88015b207af8: ffff880156a58000 (0xffff880156a58000)
[  297.629799] ffff88015b207b00: ffff880156a60000 (0xffff880156a60000)
[  297.629800] ffff88015b207b08: 0000000000000000 ...
[  297.629803] ffff88015b207b10: 0000000000000006 (0x6)
[  297.629806] ffff88015b207b18: ffff880151b02700 (0xffff880151b02700)
[  297.629809] ffff88015b207b20: 0000010100000000 (0x10100000000)
[  297.629812] ffff88015b207b28: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.629815] ffff88015b207b30: ffff88015b207ae0 (0xffff88015b207ae0)
[  297.629818] ffff88015b207b38: ffffffffc0050282 (0xffffffffc0050282)
[  297.629819] ffff88015b207b40: 0000000000000000 ...
[  297.629822] ffff88015b207b48: 0000000001000000 (0x1000000)
[  297.629825] ffff88015b207b50: ffff880157b98280 (0xffff880157b98280)
[  297.629828] ffff88015b207b58: ffff880157b98380 (0xffff880157b98380)
[  297.629831] ffff88015b207b60: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.629834] ffff88015b207b68: ffff88015b207b78 (0xffff88015b207b78)
[  297.629838] ffff88015b207b70: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629841] ffff88015b207b78: ffff88015b207da8 (0xffff88015b207da8)
[  297.629847] ffff88015b207b80: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629850] ffff88015b207b88: 000000400000000c (0x400000000c)
[  297.629852] ffff88015b207b90: ffff88015b207ba0 (0xffff88015b207ba0)
[  297.629855] ffff88015b207b98: ffff880100000000 (0xffff880100000000)
[  297.629859] ffff88015b207ba0: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629864] ffff88015b207ba8: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629868] ffff88015b207bb0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.629873] ffff88015b207bb8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.629879] ffff88015b207bc0: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.629886] ffff88015b207bc8: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.629892] ffff88015b207bd0: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.629897] ffff88015b207bd8: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.629902] ffff88015b207be0: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.629907] ffff88015b207be8: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.629913] ffff88015b207bf0: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.629916] ffff88015b207bf8: ffffffffc0050282 (0xffffffffc0050282)
[  297.629918] ffff88015b207c00: 0000000000000000 ...
[  297.629921] ffff88015b207c08: ffff88015a77e24c (0xffff88015a77e24c)
[  297.629924] ffff88015b207c10: ffff88015b207c38 (0xffff88015b207c38)
[  297.629927] ffff88015b207c18: ffff88015b207c38 (0xffff88015b207c38)
[  297.629929] ffff88015b207c20: 0000000000000086 (0x86)
[  297.629932] ffff88015b207c28: ffff88015a77db00 (0xffff88015a77db00)
[  297.629935] ffff88015b207c30: 1ffff1002b640f91 (0x1ffff1002b640f91)
[  297.629938] ffff88015b207c38: ffff88015b207d10 (0xffff88015b207d10)
[  297.629945] ffff88015b207c40: ffffffffb16c9f60 (try_to_wake_up+0xb0/0x710)
[  297.629947] ffff88015b207c48: 0000000000000000 ...
[  297.629952] ffff88015b207c50: ffffffffb2dfd3c0 (machine_ops+0x40/0x40)
[  297.629954] ffff88015b207c58: ffff88015a77df94 (0xffff88015a77df94)
[  297.629957] ffff88015b207c60: 0000000000023540 (0x23540)
[  297.629960] ffff88015b207c68: ffff88015b215c38 (0xffff88015b215c38)
[  297.629963] ffff88015b207c70: ffff88015b200000 (0xffff88015b200000)
[  297.629965] ffff88015b207c78: 0000000000000086 (0x86)
[  297.629968] ffff88015b207c80: 0000000100000000 (0x100000000)
[  297.629971] ffff88015b207c88: 0000000041b58ab3 (0x41b58ab3)
[  297.629975] ffff88015b207c90: ffffffffb2d919f2 (.LC2+0x6e0e/0x83b5)
[  297.629981] ffff88015b207c98: ffffffffb16c9eb0 (migrate_swap_stop
+0x2e0/0x2e0)
[  297.629986] ffff88015b207ca0: ffffffffb16d0f73 (account_entity_dequeue
+0x73/0x110)
[  297.629989] ffff88015b207ca8: 0000000000100000 (0x100000)
[  297.629992] ffff88015b207cb0: ffff88015b2235a0 (0xffff88015b2235a0)
[  297.629994] ffff88015b207cb8: ffff88015061e280 (0xffff88015061e280)
[  297.629997] ffff88015b207cc0: ffff88015b207ce8 (0xffff88015b207ce8)
[  297.630003] ffff88015b207cc8: ffffffffb16c87ed (sched_avg_update+0x2d/0x90)
[  297.630005] ffff88015b207cd0: 0000000000000005 (0x5)
[  297.630008] ffff88015b207cd8: ffff88015b223570 (0xffff88015b223570)
[  297.630010] ffff88015b207ce0: 00000000000000dd (0xdd)
[  297.630013] ffff88015b207ce8: ffff88015a017ea0 (0xffff88015a017ea0)
[  297.630021] ffff88015b207cf0: ffffffffb30b7128 (rcu_sched_state
+0x928/0xaa0)
[  297.630024] ffff88015b207cf8: ffff880151b02700 (0xffff880151b02700)
[  297.630026] ffff88015b207d00: 0000000000000001 (0x1)
[  297.630031] ffff88015b207d08: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630034] ffff88015b207d10: ffff88015b207d20 (0xffff88015b207d20)
[  297.630040] ffff88015b207d18: ffffffffb16ca5d0 (wake_up_process+0x10/0x20)
[  297.630043] ffff88015b207d20: ffff88015b207d48 (0xffff88015b207d48)
[  297.630045] ffff88015b207d28: ffff88015b207d48 (0xffff88015b207d48)
[  297.630048] ffff88015b207d30: 0000000000000202 (0x202)
[  297.630053] ffff88015b207d38: ffffffffb30b7120 (rcu_sched_state
+0x920/0xaa0)
[  297.630056] ffff88015b207d40: 0000000000000202 (0x202)
[  297.630059] ffff88015b207d48: ffff88015b207d68 (0xffff88015b207d68)
[  297.630063] ffff88015b207d50: ffffffffb16ee225 (swake_up+0x25/0x30)
[  297.630069] ffff88015b207d58: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630072] ffff88015b207d60: ffff88015a77db00 (0xffff88015a77db00)
[  297.630074] ffff88015b207d68: ffff88015b207d90 (0xffff88015b207d90)
[  297.630079] ffff88015b207d70: ffffffffb1720016 (rcu_gp_kthread_wake
+0x56/0x60)
[  297.630082] ffff88015b207d78: 0000000000000002 (0x2)
[  297.630087] ffff88015b207d80: ffffffffb30b7138 (rcu_sched_state
+0x938/0xaa0)
[  297.630092] ffff88015b207d88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630095] ffff88015b207d90: ffff88015b207e18 (0xffff88015b207e18)
[  297.630099] ffff88015b207d98: ffffffffb1720521 (rcu_report_qs_rnp
+0x2f1/0x310)
[  297.630102] ffff88015b207da0: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.630105] ffff88015b207da8: ffff88015b207dd0 (0xffff88015b207dd0)
[  297.630110] ffff88015b207db0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.630113] ffff88015b207db8: ffff880157b98280 (0xffff880157b98280)
[  297.630116] ffff88015b207dc0: ffffea00055ee600 (0xffffea00055ee600)
[  297.630121] ffff88015b207dc8: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630124] ffff88015b207dd0: ffff88015b207e00 (0xffff88015b207e00)
[  297.630128] ffff88015b207dd8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.630131] ffff88015b207de0: ffff880157b98280 (0xffff880157b98280)
[  297.630135] ffff88015b207de8: ffffffffb18b7e60 (get_max_files+0x10/0x10)
[  297.630141] ffff88015b207df0: ffffffffb30b72a0 (rcu_sched_state
+0xaa0/0xaa0)
[  297.630143] ffff88015b207df8: 000000000000000f (0xf)
[  297.630146] ffff88015b207e00: ffff88015b207e18 (0xffff88015b207e18)
[  297.630150] ffff88015b207e08: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630153] ffff88015b207e10: ffff880157b98280 (0xffff880157b98280)
[  297.630156] ffff88015b207e18: ffff88015b207f30 (0xffff88015b207f30)
[  297.630161] ffff88015b207e20: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.630164] ffff88015b207e28: ffff88015b21b000 (0xffff88015b21b000)
[  297.630167] ffff88015b207e30: ffff88015b21b070 (0xffff88015b21b070)
[  297.630170] ffff88015b207e38: 1ffff1002b640fd5 (0x1ffff1002b640fd5)
[  297.630173] ffff88015b207e40: ffff880151b02700 (0xffff880151b02700)
[  297.630176] ffff88015b207e48: ffff88015b224200 (0xffff88015b224200)
[  297.630178] ffff88015b207e50: ffff88015b224280 (0xffff88015b224280)
[  297.630181] ffff88015b207e58: ffff88015b2242b0 (0xffff88015b2242b0)
[  297.630184] ffff88015b207e60: ffff88015b207f08 (0xffff88015b207f08)
[  297.630187] ffff88015b207e68: ffff880151b0274c (0xffff880151b0274c)
[  297.630190] ffff88015b207e70: ffff880151b02700 (0xffff880151b02700)
[  297.630195] ffff88015b207e78: ffffffffb30b7258 (rcu_sched_state
+0xa58/0xaa0)
[  297.630198] ffff88015b207e80: ffff880157b98288 (0xffff880157b98288)
[  297.630203] ffff88015b207e88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630206] ffff88015b207e90: ffff88015b224238 (0xffff88015b224238)
[  297.630209] ffff88015b207e98: ffff88015b207ec8 (0xffff88015b207ec8)
[  297.630211] ffff88015b207ea0: 000000000000000a (0xa)
[  297.630214] ffff88015b207ea8: 0000000041b58ab3 (0x41b58ab3)
[  297.630218] ffff88015b207eb0: ffffffffb2d944f5 (.LC0+0x155c/0xa3a6)
[  297.630223] ffff88015b207eb8: ffffffffb1725530 (note_gp_changes+0xe0/0xe0)
[  297.630226] ffff88015b207ec0: ffff88015b215740 (0xffff88015b215740)
[  297.630229] ffff88015b207ec8: ffff880157b983c0 (0xffff880157b983c0)
[  297.630231] ffff88015b207ed0: ffff88014ac19eb0 (0xffff88014ac19eb0)
[  297.630234] ffff88015b207ed8: ffffffffffffffff (0xffffffffffffffff)
[  297.630236] ffff88015b207ee0: 0000000000000000 ...
[  297.630239] ffff88015b207ee8: 0000004552dda1c0 (0x4552dda1c0)
[  297.630240] ffff88015b207ef0: 0000000000000000 ...
[  297.630243] ffff88015b207ef8: ffff88015b207f20 (0xffff88015b207f20)
[  297.630249] ffff88015b207f00: ffffffffb174a0a8 (tick_program_event
+0x48/0x80)
[  297.630252] ffff88015b207f08: 0000000000000009 (0x9)
[  297.630259] ffff88015b207f10: ffffffffb3009148 (softirq_vec+0x48/0x80)
[  297.630261] ffff88015b207f18: 0000000000000009 (0x9)
[  297.630263] ffff88015b207f20: 0000000000000008 (0x8)
[  297.630265] ffff88015b207f28: 0000000000000009 (0x9)
[  297.630268] ffff88015b207f30: ffff88015b207fa8 (0xffff88015b207fa8)
[  297.630273] ffff88015b207f38: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.630276] ffff88015b207f40: 0000000a00404100 (0xa00404100)
[  297.630279] ffff88015b207f48: ffff880151b02700 (0xffff880151b02700)
[  297.630282] ffff88015b207f50: 00000000fffff730 (0xfffff730)
[  297.630284] ffff88015b207f58: 0000000000000009 (0x9)
[  297.630286] ffff88015b207f60: 0000000000000040 (0x40)
[  297.630289] ffff88015b207f68: 000001005b21c294 (0x1005b21c294)
[  297.630294] ffff88015b207f70: ffffffffb3009110 (softirq_vec+0x10/0x80)
[  297.630297] ffff88015b207f78: 0000008000000008 (0x8000000008)
[  297.630300] ffff88015b207f80: ffff88015a77ce00 (0xffff88015a77ce00)
[  297.630303] ffff88015b207f88: ffff88015b215840 (0xffff88015b215840)
[  297.630304] ffff88015b207f90: 0000000000000000 ...
[  297.630307] ffff88015b207f98: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630311] ffff88015b207fa0: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630314] ffff88015b207fa8: ffff88015b207fc0 (0xffff88015b207fc0)
[  297.630318] ffff88015b207fb0: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.630321] ffff88015b207fb8: ffff88015b215740 (0xffff88015b215740)
[  297.630324] ffff88015b207fc0: ffff88015b207fe8 (0xffff88015b207fe8)
[  297.630329] ffff88015b207fc8: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.630332] ffff88015b207fd0: ffff88015104d500 (0xffff88015104d500)
[  297.630335] ffff88015b207fd8: ffff88015b215840 (0xffff88015b215840)
[  297.630338] ffff88015b207fe0: 0000000000000246 (0x246)
[  297.630341] ffff88015b207fe8: ffff880156a5fdc9 (0xffff880156a5fdc9)
[  297.630345] ffff88015b207ff0: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.630348] ffff88015b207ff8: ffff880156a5fdc8 (0xffff880156a5fdc8)
[  297.630352] ffff880156a5fdc8: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630355] ffff880156a5fdd0: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630357] ffff880156a5fdd8: 0000000000000246 (0x246)
[  297.630360] ffff880156a5fde0: ffff88015b215840 (0xffff88015b215840)
[  297.630363] ffff880156a5fde8: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.630366] ffff880156a5fdf0: ffff88015104d500 (0xffff88015104d500)
[  297.630369] ffff880156a5fdf8: fffff52000140c08 (0xfffff52000140c08)
[  297.630372] ffff880156a5fe00: ffffc90000a0603f (0xffffc90000a0603f)
[  297.630375] ffff880156a5fe08: fffff52000140c07 (0xfffff52000140c07)
[  297.630378] ffff880156a5fe10: fffff52000140c08 (0xfffff52000140c08)
[  297.630379] ffff880156a5fe18: 0000000000000000 ...
[  297.630385] ffff880156a5fe20: ffffffffb178d9eb (opt_pre_handler+0x6b/0x80)
[  297.630388] ffff880156a5fe28: dffffc0000000000 (0xdffffc0000000000)
[  297.630391] ffff880156a5fe30: dffffc0000000000 (0xdffffc0000000000)
[  297.630393] ffff880156a5fe38: 0000000000000246 (0x246)
[  297.630396] ffff880156a5fe40: ffffffffffffff10 (0xffffffffffffff10)
[  297.630401] ffff880156a5fe48: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.630404] ffff880156a5fe50: 0000000000000010 (0x10)
[  297.630406] ffff880156a5fe58: 0000000000000246 (0x246)
[  297.630409] ffff880156a5fe60: ffff880156a5fe78 (0xffff880156a5fe78)
[  297.630412] ffff880156a5fe68: 0000000000000018 (0x18)
[  297.630414] ffff880156a5fe70: 0000000000000246 (0x246)
[  297.630417] ffff880156a5fe78: 00000000026aed08 (0x26aed08)
[  297.630419] ffff880156a5fe80: 0000000000000005 (0x5)
[  297.630421] ffff880156a5fe88: 0000000000000003 (0x3)
[  297.630423] ffff880156a5fe90: 0000000000000000 ...
[  297.630425] ffff880156a5fe98: 00000000025d1568 (0x25d1568)
[  297.630428] ffff880156a5fea0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630431] ffff880156a5fea8: ffffffffc0050282 (0xffffffffc0050282)
[  297.630433] ffff880156a5feb0: 00000000025d1568 (0x25d1568)
[  297.630435] ffff880156a5feb8: 0000000000000000 ...
[  297.630437] ffff880156a5fec0: 0000000000000003 (0x3)
[  297.630440] ffff880156a5fec8: 0000000000000005 (0x5)
[  297.630442] ffff880156a5fed0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630445] ffff880156a5fed8: 00000000026aed08 (0x26aed08)
[  297.630448] ffff880156a5fee0: ffff880151b02700 (0xffff880151b02700)
[  297.630450] ffff880156a5fee8: 0000000002675e00 (0x2675e00)
[  297.630453] ffff880156a5fef0: 0000000000000001 (0x1)
[  297.630455] ffff880156a5fef8: 0000000000000002 (0x2)
[  297.630457] ffff880156a5ff00: 0000000000000002 (0x2)
[  297.630460] ffff880156a5ff08: 0000000002675e00 (0x2675e00)
[  297.630462] ffff880156a5ff10: 0000000000000180 (0x180)
[  297.630464] ffff880156a5ff18: 0000000000000000 ...
[  297.630466] ffff880156a5ff20: 000000000272a008 (0x272a008)
[  297.630469] ffff880156a5ff28: ffffffffffffffff (0xffffffffffffffff)
[  297.630473] ffff880156a5ff30: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630475] ffff880156a5ff38: 0000000000000010 (0x10)
[  297.630478] ffff880156a5ff40: 0000000000000293 (0x293)
[  297.630481] ffff880156a5ff48: ffff880156a5ff50 (0xffff880156a5ff50)
[  297.630485] ffff880156a5ff50: ffffffffb1665770 (copy_oldmem_page+0x90/0x90)
[  297.630488] ffff880156a5ff58: 00000000025d1b28 (0x25d1b28)
[  297.630489] ffff880156a5ff60: 0000000000000000 ...
[  297.630492] ffff880156a5ff68: 0000000000000003 (0x3)
[  297.630494] ffff880156a5ff70: 0000000000000005 (0x5)
[  297.630497] ffff880156a5ff78: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630499] ffff880156a5ff80: 00000000026aed08 (0x26aed08)
[  297.630502] ffff880156a5ff88: 0000000000000246 (0x246)
[  297.630504] ffff880156a5ff90: 0000000002675e00 (0x2675e00)
[  297.630506] ffff880156a5ff98: 0000000000000001 (0x1)
[  297.630509] ffff880156a5ffa0: 0000000000000002 (0x2)
[  297.630511] ffff880156a5ffa8: ffffffffffffffda (0xffffffffffffffda)
[  297.630514] ffff880156a5ffb0: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630517] ffff880156a5ffb8: 0000000000000180 (0x180)
[  297.630518] ffff880156a5ffc0: 0000000000000000 ...
[  297.630521] ffff880156a5ffc8: 000000000272a008 (0x272a008)
[  297.630523] ffff880156a5ffd0: 0000000000000002 (0x2)
[  297.630526] ffff880156a5ffd8: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630528] ffff880156a5ffe0: 0000000000000033 (0x33)
[  297.630530] ffff880156a5ffe8: 0000000000000246 (0x246)
[  297.630533] ffff880156a5fff0: 00007ffec7d87db8 (0x7ffec7d87db8)
[  297.630535] ffff880156a5fff8: 000000000000002b (0x2b)

opensnoop(pythong) itself blocks too:

root@test:~# cat /proc/2075/stack
[<ffffffffb79a0a07>] ring_buffer_wait+0x167/0x2e0
[<ffffffffb79a34e7>] wait_on_pipe+0x77/0x80
[<ffffffffb79aa7a1>] tracing_wait_pipe.isra.69+0x51/0xf0
[<ffffffffb79abdf9>] tracing_read_pipe+0x1c9/0x500
[<ffffffffb7ab5e62>] __vfs_read+0xd2/0x370
[<ffffffffb7ab61b7>] vfs_read+0xb7/0x1a0
[<ffffffffb7ab6bd0>] SyS_read+0xa0/0x120
[<ffffffffb8843c37>] entry_SYSCALL_64_fastpath+0x1a/0xa5
[<ffffffffffffffff>] 0xffffffffffffffff

Thanks,
//richard

-- 
sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria
ATU66964118 - FN 374287y

^ permalink raw reply

* [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Andrew Lunn @ 2017-09-25 21:32 UTC (permalink / raw)
  To: David Miller; +Cc: Vivien Didelot, netdev, Andrew Lunn

Ports with the same VLAN must all be in the same bridge. However the
CPU and DSA ports need to be in multiple VLANs spread over multiple
bridges. So exclude them when performing this test.

Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c6678aa9b4ef..674dab71d71c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1100,6 +1100,10 @@ static int mv88e6xxx_port_check_hw_vlan(struct dsa_switch *ds, int port,
 	};
 	int i, err;
 
+	/* DSA and CPU ports have to be members of multiple vlans */
+	if (dsa_is_dsa_port(ds, port) || dsa_is_cpu_port(ds, port))
+		return 0;
+
 	if (!vid_begin)
 		return -EOPNOTSUPP;
 
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH net-next v2] bpf: Optimize lpm trie delete
From: David Miller @ 2017-09-25 21:38 UTC (permalink / raw)
  To: kraigatgoog; +Cc: daniel, ast, daniel, netdev
In-Reply-To: <20170921224329.101928-1-kraigatgoog@gmail.com>

From: Craig Gallek <kraigatgoog@gmail.com>
Date: Thu, 21 Sep 2017 18:43:29 -0400

> From: Craig Gallek <kraig@google.com>
> 
> Before the delete operator was added, this datastructure maintained
> an invariant that intermediate nodes were only present when necessary
> to build the tree.  This patch updates the delete operation to reinstate
> that invariant by removing unnecessary intermediate nodes after a node is
> removed and thus keeping the tree structure at a minimal size.
> 
> Suggested-by: Daniel Mack <daniel@zonque.org>
> Signed-off-by: Craig Gallek <kraig@google.com>

Applied, thank you.

^ permalink raw reply

* Re: [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Vivien Didelot @ 2017-09-25 21:34 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1506375140-2853-1-git-send-email-andrew@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> Ports with the same VLAN must all be in the same bridge. However the
> CPU and DSA ports need to be in multiple VLANs spread over multiple
> bridges. So exclude them when performing this test.
>
> Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

^ permalink raw reply

* Re: [PATCH net-next] bpf/verifier: improve disassembly of BPF_END instructions
From: Daniel Borkmann @ 2017-09-25 21:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Y Song
  Cc: Edward Cree, David Miller, netdev, Jiong Wang, Jakub Kicinski
In-Reply-To: <20170924055016.w6x5tj6kjxjbocpl@ast-mbp>

On 09/24/2017 07:50 AM, Alexei Starovoitov wrote:
> On Fri, Sep 22, 2017 at 09:49:10PM -0700, Y Song wrote:
>> On Fri, Sep 22, 2017 at 9:23 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 22/09/17 16:16, Alexei Starovoitov wrote:
>>>> looks like we're converging on
>>>> "be16/be32/be64/le16/le32/le64 #register" for BPF_END.
>>>> I guess it can live with that. I would prefer more C like syntax
>>>> to match the rest, but llvm parsing point is a strong one.
>>> Yep, agreed.  I'll post a v2 once we've settled BPF_NEG.
>>>> For BPG_NEG I prefer to do it in C syntax like interpreter does:
>>>>          ALU_NEG:
>>>>                  DST = (u32) -DST;
>>>>          ALU64_NEG:
>>>>                  DST = -DST;
>>>> Yonghong, does it mean that asmparser will equally suffer?
>>> Correction to my earlier statements: verifier will currently disassemble
>>>   neg as:
>>> (87) r0 neg 0
>>> (84) (u32) r0 neg (u32) 0
>>>   because it pretends 'neg' is a compound-assignment operator like +=.
>>> The analogy with be16 and friends would be to use
>>>      neg64 r0
>>>      neg32 r0
>>>   whereas the analogy with everything else would be
>>>      r0 = -r0
>>>      r0 = (u32) -r0
>>>   as Alexei says.
>>> I'm happy to go with Alexei's version if it doesn't cause problems for llvm.
>>
>> I got some time to do some prototyping in llvm and it looks like that
>> I am able to
>> resolve the issue and we are able to use more C-like syntax. That is:
>> for bswap:
>>       r1 = (be16) (u16) r1
>>       or
>>       r1 = (be16) r1
>>       or
>>       r1 = be16 r1
>> for neg:
>>       r0 = -r0
>>       (for 32bit support, llvm may output "w0 = -w0" in the future. But
>> since it is not
>>        enabled yet, you can continue to output "r0 = (u32) -r0".)
>>
>> Not sure which syntax is best for bswap. The "r1 = (be16) (u16) r1" is most
>> explicit in its intention.
>>
>> Attaching my llvm patch as well and cc'ing Jiong and Jakub so they can see my
>> implementation and the relative discussion here. (In this patch, I did
>> not implement
>> bswap for little endian yet.) Maybe they can provide additional comments.
>
> This is awesome. In such case I'd like to swing back to the C syntax for bpf_end :)
> Any of these
>    r1 = (be16) (u16) r1
>    or
>    r1 = (be16) r1
>    or
>    r1 = be16 r1
> are better than just
>    be16 r1
> I like 1st the most, since it's explicit in terms of what happens with upper bits,
> but 2nd is also ok. 3rd is not quite C-like.

But above cast to be16 also doesn't seem quite C-like in terms
of what we're actually doing... 3rd option would be my personal
preference even if it doesn't look C-like, but otoh we also have
'call' etc which is neither.

^ permalink raw reply

* Re: [PATCH net 0/2] l2tp: fix some races in session deletion
From: David Miller @ 2017-09-25 21:45 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, tparkin, sd
In-Reply-To: <cover.1506086081.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 22 Sep 2017 15:39:22 +0200

> L2TP provides several interfaces for deleting sessions. Using two of
> them concurrently can lead to use-after-free bugs.
> 
> Patch #2 uses a flag to prevent double removal of L2TP sessions.
> Patch #1 fixes a bug found in the way. Fixing this bug is also
> necessary for patch #2 to handle all cases.
> 
> This issue is similar to the tunnel deletion bug being worked on by
> Sabrina: https://patchwork.ozlabs.org/patch/814173/

Series applied and queued up for -stable, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox