Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-09-25 20:18 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <3f824b0e-65f9-c69c-5421-2c5f6b349b09@redhat.com>

On 09/22/2017 12:03 AM, Jason Wang wrote:
> 
> 
> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>> patch goes even further which disables rx polling during processing tx.
>>> Please try it to see if it has any difference.
>> Unfortunately, this patch doesn't seem to have made a difference.  I
>> tried runs with both this patch and the previous patch applied, as well
>> as only this patch applied for comparison (numbers from vhost thread of
>> sending VM):
>>
>> 4.12    4.13     patch1   patch2   patch1+2
>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>
>> In each case, the regression in throughput was still present.
> 
> This probably means some other cases of the wakeups were missed. Could
> you please record the callers of __wake_up_sync_key()?
> 

Hi Jason,

With your 2 previous patches applied, every call to __wake_up_sync_key
(for both sender and server vhost threads) shows the following stack trace:

     vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
<-sock_def_readable
     vhost-11478-11520 [002] ....   312.927230: <stack trace>
 => dev_hard_start_xmit
 => sch_direct_xmit
 => __dev_queue_xmit
 => br_dev_queue_push_xmit
 => br_forward_finish
 => __br_forward
 => br_handle_frame_finish
 => br_handle_frame
 => __netif_receive_skb_core
 => netif_receive_skb_internal
 => tun_get_user
 => tun_sendmsg
 => handle_tx
 => vhost_worker
 => kthread
 => kernel_thread_starter
 => kernel_thread_starter

>>
>>> And two questions:
>>> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
>> Verified that the second set of guests are not actually required, I can
>> see the regression with only 2 VMs.
>>
>>> - Can enable batching in the tap of sending VM improve the performance
>>> (ethtool -C $tap rx-frames 64)
>> I tried this, but it did not help (actually seemed to make things a
>> little worse)
>>
> 
>  I still can't see a reason that can lead more wakeups, will take more
> time to look at this issue and keep you posted.
> 
> Thanks
> 

^ permalink raw reply

* Re: [PATCH] r8152: add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-25 20:17 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Grant Grundler, Hayes Wang, David S . Miller, LKML, linux-usb,
	netdev
In-Reply-To: <1506351074.7827.13.camel@suse.com>

[grrhmail...sorry! resending as plain text]

Hallo Oliver!

On Mon, Sep 25, 2017 at 7:51 AM, Oliver Neukum <oneukum@suse.com> wrote:
> Am Freitag, den 22.09.2017, 12:06 -0700 schrieb Grant Grundler:
> > This Linksys dongle by default comes up in cdc_ether mode.
> > This patch allows r8152 to claim the device:
> >    Bus 002 Device 002: ID 13b1:0041 Linksys
>
> Hi,
>
> have you tested this in case cdc_ether is for some reason already
> loaded?

I did not consider testing this case since it's not possible on a
normal chromeos system (the entire root file system is signed for
normal users and get's rebooted after an update).  I could test this
in developer mode of course.

Did you expect both driver probe routines to claim the device and
wreak havoc with the device?

> The patch seems to enable r8152 but does not disable cdc_ether.

Correct. r8152 happens to claim the device before cdc_ether does - I
thought because cdc_ether is a class driver and only gets picked up
after vendor specific drivers are probed.  Is that correct?

I didn't realize cdc_ether has a blacklist to make sure
RTL8152|RTL8153 devices are not picked up by cdc_ether. Would you
prefer I add this device to the blacklist in the same patch?

cheers,
grant

>
>         Regards
>                 Oliver
>

^ permalink raw reply

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Miller @ 2017-09-25 20:11 UTC (permalink / raw)
  To: dsahern; +Cc: eric.dumazet, netdev, stephen
In-Reply-To: <ce328dfd-0b54-883a-36a8-91ec344f86ad@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 25 Sep 2017 10:14:23 -0600

> I made a simple program this morning and ran it under perf.

If possible please submit this for selftests.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: John Fastabend @ 2017-09-25 19:47 UTC (permalink / raw)
  To: Daniel Borkmann, Andy Gospodarek
  Cc: davem, alexei.starovoitov, peter.waskiewicz.jr, jakub.kicinski,
	netdev, mchan
In-Reply-To: <59C94FF4.8070900@iogearbox.net>

On 09/25/2017 11:50 AM, Daniel Borkmann wrote:
> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
> [...]
>> First, thanks for this detailed description.  It was helpful to read
>> along with the patches.
>>
>> My only concern about this area being generic is that you are now in a
>> state where any bpf program must know about all the bpf programs in the
>> receive pipeline before it can properly parse what is stored in the
>> meta-data and add it to an skb (or perform any other action).
>> Especially if each program adds it's own meta-data along the way.
>>
>> Maybe this isn't a big concern based on the number of users of this
>> today, but it just starts to seem like a concern as there are these
>> hints being passed between layers that are challenging to track due to a
>> lack of a standard format for passing data between.
> 
> Btw, we do have similar kind of programmable scratch buffer also today
> wrt skb cb[] that you can program from tc side, the perf ring buffer,
> which doesn't have any fixed layout for the slots, or a per-cpu map
> where you can transfer data between tail calls for example, then tail
> calls themselves that need to coordinate, or simply mangling of packets
> itself if you will, but more below to your use case ...
> 
>> The main reason I bring this up is that Michael and I had discussed and
>> designed a way for drivers to communicate between each other that rx
>> resources could be freed after a tx completion on an XDP_REDIRECT
>> action.  Much like this code, it involved adding an new element to
>> struct xdp_md that could point to the important information.  Now that
>> there is a generic way to handle this, it would seem nice to be able to
>> leverage it, but I'm not sure how reliable this meta-data area would be
>> without the ability to mark it in some manner.
>>
>> For additional background, the minimum amount of data needed in the case
>> Michael and I were discussing was really 2 words.  One to serve as a
>> pointer to an rx_ring structure and one to have a counter to the rx
>> producer entry.  This data could be acessed by the driver processing the
>> tx completions and callback to the driver that received the frame off
>> the wire
>> to perform any needed processing.  (For those curious this would also
>> require a
>> new callback/netdev op to act on this data stored in the XDP buffer.)
> 
> What you describe above doesn't seem to be fitting to the use-case of
> this set, meaning the area here is fully programmable out of the BPF
> program, the infrastructure you're describing is some sort of means of
> communication between drivers for the XDP_REDIRECT, and should be
> outside of the control of the BPF program to mangle.
> 
> You could probably reuse the base infra here and make a part of that
> inaccessible for the program with some sort of a fixed layout, but I
> haven't seen your code yet to be able to fully judge. Intention here
> is to allow for programmability within the BPF prog in a generic way,
> such that based on the use-case it can be populated in specific ways
> and propagated to the skb w/o having to define a fixed layout and
> bloat xdp_buff all the way to an skb while still retaining all the
> flexibility.
> 
> Thanks,
> Daniel

Hi Andy,

I'm guessing this data needs to be passed from the input dev to the
output dev based on your description. If the driver data
is pushed after the BPF program is run but before the xdp_do_flush_map
call no other BPF programs can be run on that xdp_buff. It
should be safe at this point to use the metadata region directly
from the driver. We would just need to add a few helpers for the
drivers to use for this maybe, xdp_metadata_write_drv,
xdp_meadata_read_drv. I think this would work for your use case?
The data structure would have to be agreed upon by all the drivers
but would not be UAPI because it would only be exposed in the
driver. So we would be free to change/update it as needed.

Thanks,
John

^ permalink raw reply

* Re: [ovs-dev] [PATCH net-next v9] openvswitch: enable NSH support
From: Eric Garver @ 2017-09-25 19:28 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, jbenc, davem
In-Reply-To: <1506348969-6233-1-git-send-email-yi.y.yang@intel.com>

On Mon, Sep 25, 2017 at 10:16:09PM +0800, Yi Yang wrote:
> v8->v9
>  - Fix build error reported by daily intel build
>    because nsh module isn't selected by openvswitch
> 
> v7->v8
>  - Rework nested value and mask for OVS_KEY_ATTR_NSH
>  - Change pop_nsh to adapt to nsh kernel module
>  - Fix many issues per comments from Jiri Benc
> 
> v6->v7
>  - Remove NSH GSO patches in v6 because Jiri Benc
>    reworked it as another patch series and they have
>    been merged.
>  - Change it to adapt to nsh kernel module added by NSH
>    GSO patch series
> 
> v5->v6
>  - Fix the rest comments for v4.
>  - Add NSH GSO support for VxLAN-gpe + NSH and
>    Eth + NSH.
> 
> v4->v5
>  - Fix many comments by Jiri Benc and Eric Garver
>    for v4.
> 
> v3->v4
>  - Add new NSH match field ttl
>  - Update NSH header to the latest format
>    which will be final format and won't change
>    per its author's confirmation.
>  - Fix comments for v3.
> 
> v2->v3
>  - Change OVS_KEY_ATTR_NSH to nested key to handle
>    length-fixed attributes and length-variable
>    attriubte more flexibly.
>  - Remove struct ovs_action_push_nsh completely
>  - Add code to handle nested attribute for SET_MASKED
>  - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
>    to transfer NSH header data.
>  - Fix comments and coding style issues by Jiri and Eric
> 
> v1->v2
>  - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
>  - Dynamically allocate struct ovs_action_push_nsh for
>    length-variable metadata.
> 
> OVS master and 2.8 branch has merged NSH userspace
> patch series, this patch is to enable NSH support
> in kernel data path in order that OVS can support
> NSH in compat mode by porting this.
> 
> Signed-off-by: Yi Yang <yi.y.yang@intel.com>
> ---
>  include/net/nsh.h                |   3 +
>  include/uapi/linux/openvswitch.h |  29 ++++
>  net/nsh/nsh.c                    |  53 +++++++
>  net/openvswitch/Kconfig          |   1 +
>  net/openvswitch/actions.c        | 112 ++++++++++++++
>  net/openvswitch/flow.c           |  51 ++++++
>  net/openvswitch/flow.h           |  11 ++
>  net/openvswitch/flow_netlink.c   | 324 ++++++++++++++++++++++++++++++++++++++-
>  net/openvswitch/flow_netlink.h   |   5 +
>  9 files changed, 588 insertions(+), 1 deletion(-)
> 
[...]
> diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
> index 58fb827..54334ca 100644
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -14,6 +14,59 @@
>  #include <net/nsh.h>
>  #include <net/tun_proto.h>
>  
> +int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)
> +{
> +	struct nshhdr *nsh_hdr;
> +	size_t length = nsh_hdr_len(src_nsh_hdr);
> +	u8 next_proto;
> +
> +	if (skb->mac_len) {
> +		next_proto = TUN_P_ETHERNET;
> +	} else {
> +		next_proto = tun_p_from_eth_p(skb->protocol);
> +		if (!next_proto)
> +			return -EAFNOSUPPORT;
> +	}
> +
> +	/* Add the NSH header */
> +	if (skb_cow_head(skb, length) < 0)
> +		return -ENOMEM;
> +
> +	skb_push(skb, length);
> +	nsh_hdr = (struct nshhdr *)(skb->data);
> +	memcpy(nsh_hdr, src_nsh_hdr, length);
> +	nsh_hdr->np = next_proto;
> +
> +	skb->protocol = htons(ETH_P_NSH);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

I think you mean to reset network_header before mac_len.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_push_nsh);
> +
> +int skb_pop_nsh(struct sk_buff *skb)
> +{
> +	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
> +	size_t length;
> +	u16 inner_proto;
> +
> +	inner_proto = tun_p_to_eth_p(nsh_hdr->np);
> +	if (!inner_proto)
> +		return -EAFNOSUPPORT;
> +
> +	length = nsh_hdr_len(nsh_hdr);
> +	skb_pull(skb, length);

Do you need to verify you can actually pull length bytes? I don't see
any guarantee.

> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

Reset network before mac_len.

> +	skb->protocol = inner_proto;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_pop_nsh);
> +
>  static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
>  				       netdev_features_t features)
>  {
[...]
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556..d026b85 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
[...]
> @@ -602,6 +642,59 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
>  	return 0;
>  }
>  
> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct nlattr *a)
> +{
> +	struct nshhdr *nsh_hdr;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	struct ovs_key_nsh key;
> +	struct ovs_key_nsh mask;
> +
> +	err = nsh_key_from_nlattr(a, &key, &mask);
> +	if (err)
> +		return err;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				  sizeof(struct nshhdr));

This calls pskb_may_pull(), but you're not pulling any data here.

> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);
> +
> +	flags = nsh_get_flags(nsh_hdr);
> +	flags = OVS_MASKED(flags, key.flags, mask.flags);
> +	flow_key->nsh.flags = flags;
> +	ttl = nsh_get_ttl(nsh_hdr);
> +	ttl = OVS_MASKED(ttl, key.ttl, mask.ttl);
> +	flow_key->nsh.ttl = ttl;
> +	nsh_set_flags_and_ttl(nsh_hdr, flags, ttl);
> +	nsh_hdr->path_hdr = OVS_MASKED(nsh_hdr->path_hdr, key.path_hdr,
> +				       mask.path_hdr);
> +	flow_key->nsh.path_hdr = nsh_hdr->path_hdr;
> +	switch (nsh_hdr->mdtype) {
> +	case NSH_M_TYPE1:
> +		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
> +			nsh_hdr->md1.context[i] =
> +			    OVS_MASKED(nsh_hdr->md1.context[i], key.context[i],
> +				       mask.context[i]);
> +		}
> +		memcpy(flow_key->nsh.context, nsh_hdr->md1.context,
> +		       sizeof(nsh_hdr->md1.context));
> +		break;
> +	case NSH_M_TYPE2:
> +		memset(flow_key->nsh.context, 0,
> +		       sizeof(flow_key->nsh.context));
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /* Must follow skb_ensure_writable() since that can move the skb data. */
>  static void set_tp_port(struct sk_buff *skb, __be16 *port,
>  			__be16 new_port, __sum16 *check)
[...]

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Jiri Pirko @ 2017-09-25 19:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-3-xiyou.wangcong@gmail.com>

Mon, Sep 25, 2017 at 07:13:51PM CEST, xiyou.wangcong@gmail.com wrote:
>Instead of calling u32_lookup_ht() in a loop to find
>a unused handle, just switch to idr API to allocate
>new handles. u32 filters are special as the handle
>could contain a hash table id and a key id, so we
>need two IDR to allocate each of them.
>
>Cc: Chris Mi <chrism@mellanox.com>
>Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>---

[...]

>@@ -311,19 +313,19 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
> 	return u32_lookup_key(ht, handle);
> }
> 
>-static u32 gen_new_htid(struct tc_u_common *tp_c)
>+static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
> {
>-	int i = 0x800;
>+	unsigned long idr_index;
>+	int err;
> 
>-	/* hgenerator only used inside rtnl lock it is safe to increment
>+	/* This is only used inside rtnl lock it is safe to increment
> 	 * without read _copy_ update semantics
> 	 */
>-	do {
>-		if (++tp_c->hgenerator == 0x7FF)
>-			tp_c->hgenerator = 1;
>-	} while (--i > 0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20));
>-
>-	return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0;
>+	err = idr_alloc_ext(&tp_c->handle_idr, ptr, &idr_index,
>+			    1, 0x7FF, GFP_KERNEL);

Interesting, any idea why this is not 0x7FFFFFFF as well?

I wonder if we could have 0x7FFFFFFF magic defined somewhere.

Otherwise, "patchset" looks good. Thank you for taking care of this!

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Daniel Borkmann @ 2017-09-25 18:50 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: davem, alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev, mchan
In-Reply-To: <20170925181000.GA60144@C02RW35GFVH8.dhcp.broadcom.net>

On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
[...]
> First, thanks for this detailed description.  It was helpful to read
> along with the patches.
>
> My only concern about this area being generic is that you are now in a
> state where any bpf program must know about all the bpf programs in the
> receive pipeline before it can properly parse what is stored in the
> meta-data and add it to an skb (or perform any other action).
> Especially if each program adds it's own meta-data along the way.
>
> Maybe this isn't a big concern based on the number of users of this
> today, but it just starts to seem like a concern as there are these
> hints being passed between layers that are challenging to track due to a
> lack of a standard format for passing data between.

Btw, we do have similar kind of programmable scratch buffer also today
wrt skb cb[] that you can program from tc side, the perf ring buffer,
which doesn't have any fixed layout for the slots, or a per-cpu map
where you can transfer data between tail calls for example, then tail
calls themselves that need to coordinate, or simply mangling of packets
itself if you will, but more below to your use case ...

> The main reason I bring this up is that Michael and I had discussed and
> designed a way for drivers to communicate between each other that rx
> resources could be freed after a tx completion on an XDP_REDIRECT
> action.  Much like this code, it involved adding an new element to
> struct xdp_md that could point to the important information.  Now that
> there is a generic way to handle this, it would seem nice to be able to
> leverage it, but I'm not sure how reliable this meta-data area would be
> without the ability to mark it in some manner.
>
> For additional background, the minimum amount of data needed in the case
> Michael and I were discussing was really 2 words.  One to serve as a
> pointer to an rx_ring structure and one to have a counter to the rx
> producer entry.  This data could be acessed by the driver processing the
> tx completions and callback to the driver that received the frame off the wire
> to perform any needed processing.  (For those curious this would also require a
> new callback/netdev op to act on this data stored in the XDP buffer.)

What you describe above doesn't seem to be fitting to the use-case of
this set, meaning the area here is fully programmable out of the BPF
program, the infrastructure you're describing is some sort of means of
communication between drivers for the XDP_REDIRECT, and should be
outside of the control of the BPF program to mangle.

You could probably reuse the base infra here and make a part of that
inaccessible for the program with some sort of a fixed layout, but I
haven't seen your code yet to be able to fully judge. Intention here
is to allow for programmability within the BPF prog in a generic way,
such that based on the use-case it can be populated in specific ways
and propagated to the skb w/o having to define a fixed layout and
bloat xdp_buff all the way to an skb while still retaining all the
flexibility.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next 2/7] nfp: compile flower vxlan tunnel metadata match fields
From: Or Gerlitz @ 2017-09-25 18:35 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers,
	John Hurley
In-Reply-To: <1506335021-32024-3-git-send-email-simon.horman@netronome.com>

On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
<simon.horman@netronome.com> wrote:
> From: John Hurley <john.hurley@netronome.com>
>
> Compile ovs-tc flower vxlan metadata match fields for offloading. Only

anything in the npf kernel bits has direct relation to ovs? what?

> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
> @@ -52,8 +52,25 @@
>          BIT(FLOW_DISSECTOR_KEY_PORTS) | \
>          BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
>          BIT(FLOW_DISSECTOR_KEY_VLAN) | \
> +        BIT(FLOW_DISSECTOR_KEY_ENC_KEYID) | \
> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) | \
> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) | \

this series takes care of IPv6 tunnels too?

^ permalink raw reply

* Re: [PATCH net-next 7/7] nfp: flower vxlan neighbour keep-alive
From: Or Gerlitz @ 2017-09-25 18:32 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers,
	John Hurley
In-Reply-To: <1506335021-32024-8-git-send-email-simon.horman@netronome.com>

On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
<simon.horman@netronome.com> wrote:
> From: John Hurley <john.hurley@netronome.com>
>
> Periodically receive messages containing the destination IPs of tunnels
> that have recently forwarded traffic. Update the neighbour entries 'used'
> value for these IPs next hop.

Are you proactively sending keep alive messages from the driver or the
fw? what's wrong with the probes sent by the kernel NUD subsystem?

In our driver we also update the used value for neighs of offloaded
tunnels, we do it based on flow counters for the offloaded tunnels
which is an evidence for activity. Any reason for you not to apply a
similar practice?

Or.

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Jiri Benc @ 2017-09-25 18:14 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, e, davem
In-Reply-To: <1506348969-6233-1-git-send-email-yi.y.yang@intel.com>

On Mon, 25 Sep 2017 22:16:09 +0800, Yi Yang wrote:
> +	skb->protocol = htons(ETH_P_NSH);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

The last two lines are swapped. Network header needs to be reset before
mac_len.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_push_nsh);
> +
> +int skb_pop_nsh(struct sk_buff *skb)
> +{
> +	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
> +	size_t length;
> +	u16 inner_proto;

__be16 inner_proto.

> +static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> +		    const struct nshhdr *src_nsh_hdr)
> +{
> +	int err;
> +
> +	err = skb_push_nsh(skb, src_nsh_hdr);
> +	if (err)
> +		return err;
> +
> +	key->eth.type = htons(ETH_P_NSH);

I wonder why you have this assignment here. The key is invalidated,
thus nothing should rely on key->eth.type. However, looking at the code
and ovs_fragment in particular, I'm not sure that's the case. Could you
please explain why it is needed? And why the reverse of it is not
needed in pop_nsh?

> +
> +	/* safe right before invalidate_flow_key */
> +	key->mac_proto = MAC_PROTO_NONE;
> +	invalidate_flow_key(key);
> +	return 0;
> +}
[...]
> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct nlattr *a)
> +{
> +	struct nshhdr *nsh_hdr;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	struct ovs_key_nsh key;
> +	struct ovs_key_nsh mask;
> +
> +	err = nsh_key_from_nlattr(a, &key, &mask);
> +	if (err)
> +		return err;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				  sizeof(struct nshhdr));

Whitespace nit: the sizeof should be aligned to skb_network_offset.

> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);

Please use the nsh_hdr function (I'm sure I already requested that in
the previous review?). It means renaming the nsh_hdr variable.

> @@ -1210,6 +1307,21 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  		case OVS_ACTION_ATTR_POP_ETH:
>  			err = pop_eth(skb, key);
>  			break;
> +
> +		case OVS_ACTION_ATTR_PUSH_NSH: {
> +			u8 buffer[NSH_HDR_MAX_LEN];
> +			struct nshhdr *nsh_hdr = (struct nshhdr *)buffer;
> +			const struct nshhdr *src_nsh_hdr = nsh_hdr;
> +
> +			nsh_hdr_from_nlattr(nla_data(a), nsh_hdr,
> +					    NSH_HDR_MAX_LEN);
> +			err = push_nsh(skb, key, src_nsh_hdr);

Is the src_nsh_hdr variable really necessary? Cannot be nsh_hdr passed
directly to push_nsh, relying on automatic retyping to const?

By the way, nsh_hdr is a poor name for a variable, it clashes with the
nsh_hdr function. For clarity, please rename the variables in the whole
patch to something else.

> +static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +	struct nshhdr *nsh_hdr;
> +	unsigned int nh_ofs = skb_network_offset(skb);
> +	u8 version, length;
> +	int err;
> +
> +	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);

Again, rename the variable and use the nsh_hdr function.

> +	version = nsh_get_ver(nsh_hdr);
> +	length = nsh_hdr_len(nsh_hdr);
> +
> +	if (version != 0)
> +		return -EINVAL;
> +
> +	err = check_header(skb, nh_ofs + length);
> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);

Ditto.

> +struct ovs_key_nsh {
> +	u8 flags;
> +	u8 ttl;
> +	u8 mdtype;
> +	u8 np;
> +	__be32 path_hdr;
> +	__be32 context[NSH_MD1_CONTEXT_SIZE];
> +};

This should be:

struct ovs_key_nsh {
	struct ovs_nsh_key_base base;
	__be32 context[NSH_MD1_CONTEXT_SIZE];
};

to capture the real dependency. Implicitly depending on ovs_key_nsh
starting with the same fields as ovs_nsh_key_base will only lead to bugs
in the future.

> +int nsh_hdr_from_nlattr(const struct nlattr *attr,
> +			struct nshhdr *nsh, size_t size)
> +{
> +	struct nlattr *a;
> +	int rem;
> +	u8 flags = 0;
> +	u8 ttl = 0;
> +	int mdlen = 0;
> +
> +	/* validate_nsh has check this, so we needn't do duplicate check here
> +	 */
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base =
> +				(struct ovs_nsh_key_base *)nla_data(a);

It's not necessary to retype void * explicitly. Just assign it.

> +			flags = base->flags;
> +			ttl = base->ttl;
> +			nsh->np = base->np;
> +			nsh->mdtype = base->mdtype;
> +			nsh->path_hdr = base->path_hdr;
> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD1: {
> +			const struct ovs_nsh_key_md1 *md1 =
> +				(struct ovs_nsh_key_md1 *)nla_data(a);

It's not necessary to retype void * explicitly. Just assign it.

> +			struct nsh_md1_ctx *md1_dst = &nsh->md1;
> +
> +			mdlen = nla_len(a);
> +			memcpy(md1_dst, md1, mdlen);

Why the variable? Just memcpy to &nsh->md1.

> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD2: {
> +			const struct u8 *md2 = nla_data(a);
> +			struct nsh_md2_tlv *md2_dst = &nsh->md2;
> +
> +			mdlen = nla_len(a);
> +			memcpy(md2_dst, md2, mdlen);

Ditto.

> +int nsh_key_from_nlattr(const struct nlattr *attr,
> +			struct ovs_key_nsh *nsh, struct ovs_key_nsh *nsh_mask)
> +{
> +	struct nlattr *a;
> +	int rem;
> +
> +	/* validate_nsh has check this, so we needn't do duplicate check here
> +	 */
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base =
> +				(struct ovs_nsh_key_base *)nla_data(a);

It's not necessary to retype void * explicitly. Just assign it.

> +			const struct ovs_nsh_key_base *base_mask = base + 1;
> +
> +			memcpy(nsh, base, sizeof(*base));
> +			memcpy(nsh, base_mask, sizeof(*base_mask));

The second memcpy should copy to nsh_mask, not nsh.

If you modify struct ovs_key_nsh as suggested above, this becomes a
simple:

nsh->base = *base;
nsh_mask->base = *base_mask;

I'm perfectly fine with memcpy, too, though.

> +static int nsh_key_put_from_nlattr(const struct nlattr *attr,
> +				   struct sw_flow_match *match, bool is_mask,
> +				   bool is_push_nsh, bool log)
> +{
> +	struct nlattr *a;
> +	int rem;
> +	bool has_base = false;
> +	bool has_md1 = false;
> +	bool has_md2 = false;
> +	u8 mdtype = 0;
> +	int mdlen = 0;
> +
> +	if (unlikely(is_push_nsh && is_mask))
> +		return -EINVAL;

Wrap this in WARN_ON. (And note that you don't need unlikely with
WARN_ON.)

> +		case OVS_NSH_KEY_ATTR_MD2:
> +			if (!is_push_nsh) /* Not supported MD type 2 yet */
> +				return -ENOTSUPP;
> +
> +			has_md2 = true;
> +			mdlen = nla_len(a);
> +			if ((mdlen > NSH_CTX_HDRS_MAX_LEN) ||
> +			    (mdlen <= 0)) {
> +				WARN_ON_ONCE(1);

But here, the WARN_ON_ONCE is completely inappropriate. This condition
does not indicate a kernel bug but rather invalid data from the user
space. OVS_NLERR should be here instead.

> +		if (is_push_nsh &&
> +		    (!has_base || (!has_md1 && !has_md2))) {
> +			OVS_NLERR(
> +			    1,
> +			    "push nsh attributes are invalid for type %d.",
> +			    mdtype
> +			);

"Missing nsh base and/or metadata attribute." or something like that
would be a better error message.

> +static int nsh_key_to_nlattr(const struct ovs_key_nsh *nsh, bool is_mask,
> +			     struct sk_buff *skb)
> +{
> +	struct nlattr *start;
> +	struct ovs_nsh_key_base base;
> +	struct ovs_nsh_key_md1 md1;
> +
> +	memcpy(&base, nsh, sizeof(base));
> +
> +	if (is_mask || nsh->mdtype == NSH_M_TYPE1)
> +		memcpy(md1.context, nsh->context, sizeof(md1));
> +
> +	start = nla_nest_start(skb, OVS_KEY_ATTR_NSH);
> +	if (!start)
> +		return -EMSGSIZE;
> +
> +	if (nla_put(skb, OVS_NSH_KEY_ATTR_BASE, sizeof(base), &base))

The 'base' variable is not needed, let alone copy to it, just use
&nsh->base here.

> +		goto nla_put_failure;
> +
> +	if (is_mask || nsh->mdtype == NSH_M_TYPE1) {
> +		if (nla_put(skb, OVS_NSH_KEY_ATTR_MD1, sizeof(md1), &md1))

Likewise, no need for the copy.

> +	return ((ret != 0) ? false : true);

Too little coffee or too late in the night, right? ;-)

 Jiri

^ permalink raw reply

* [PATCH iproute2] tc: fix ipv6 filter selector attribute for some prefix lengths
From: Yulia Kartseva @ 2017-09-25 18:12 UTC (permalink / raw)
  To: netdev; +Cc: shemminger

Wrong TCA_U32_SEL attribute packing if prefixLen AND 0x1f equals 0x1f.
These are  /31, /63, /95 and /127 prefix lengths.

Example:
# tc filter add dev eth0 protocol ipv6 parent b: prio 2307 u32 match
ip6 dst face:b00f::/31
# tc filter show dev eth0
filter parent b: protocol ipv6 pref 2307 u32
filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
key ht 800 bkt 0
  match faceb00f/ffffffff at 24


The correct match would be "faceb00e/fffffffe": don't count the last
bit of the 4th byte as the network prefix. With fix:

# tc filter show dev eth0
filter parent b: protocol ipv6 pref 2307 u32
filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
key ht 800 bkt 0
  match faceb00e/fffffffe at 24

 tc/f_u32.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tc/f_u32.c b/tc/f_u32.c
index 5815be9..14b9588 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -385,8 +385,7 @@ static int parse_ip6_addr(int *argc_p, char ***argv_p,

  plen = addr.bitlen;
  for (i = 0; i < plen; i += 32) {
- /* if (((i + 31) & ~0x1F) <= plen) { */
- if (i + 31 <= plen) {
+ if (i + 31 < plen) {
  res = pack_key(sel, addr.data[i / 32],
        0xFFFFFFFF, off + 4 * (i / 32), offmask);
  if (res < 0)

^ permalink raw reply related

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Andy Gospodarek @ 2017-09-25 18:10 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev, mchan
In-Reply-To: <458f9c13ab58abb1a15627906d03c33c42b02a7c.1506297988.git.daniel@iogearbox.net>

On Mon, Sep 25, 2017 at 02:25:51AM +0200, Daniel Borkmann wrote:
> This work enables generic transfer of metadata from XDP into skb. The
> basic idea is that we can make use of the fact that the resulting skb
> must be linear and already comes with a larger headroom for supporting
> bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
> on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
> for adjusting a new pointer called xdp->data_meta. Thus, the packet has
> a flexible and programmable room for meta data, followed by the actual
> packet data. struct xdp_buff is therefore laid out that we first point
> to data_hard_start, then data_meta directly prepended to data followed
> by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
> account whether we have meta data already prepended and if so, memmove()s
> this along with the given offset provided there's enough room.
> 
> xdp->data_meta is optional and programs are not required to use it. The
> rationale is that when we process the packet in XDP (e.g. as DoS filter),
> we can push further meta data along with it for the XDP_PASS case, and
> give the guarantee that a clsact ingress BPF program on the same device
> can pick this up for further post-processing. Since we work with skb
> there, we can also set skb->mark, skb->priority or other skb meta data
> out of BPF, thus having this scratch space generic and programmable
> allows for more flexibility than defining a direct 1:1 transfer of
> potentially new XDP members into skb (it's also more efficient as we
> don't need to initialize/handle each of such new members).  The facility
> also works together with GRO aggregation. The scratch space at the head
> of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
> yet supporting xdp->data_meta can simply be set up with xdp->data_meta
> as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
> such that the subsequent match against xdp->data for later access is
> guaranteed to fail.
> 
> The verifier treats xdp->data_meta/xdp->data the same way as we treat
> xdp->data/xdp->data_end pointer comparisons. The requirement for doing
> the compare against xdp->data is that it hasn't been modified from it's
> original address we got from ctx access. It may have a range marking
> already from prior successful xdp->data/xdp->data_end pointer comparisons
> though.

First, thanks for this detailed description.  It was helpful to read
along with the patches.

My only concern about this area being generic is that you are now in a
state where any bpf program must know about all the bpf programs in the
receive pipeline before it can properly parse what is stored in the
meta-data and add it to an skb (or perform any other action).
Especially if each program adds it's own meta-data along the way.

Maybe this isn't a big concern based on the number of users of this
today, but it just starts to seem like a concern as there are these
hints being passed between layers that are challenging to track due to a
lack of a standard format for passing data between.

The main reason I bring this up is that Michael and I had discussed and
designed a way for drivers to communicate between each other that rx
resources could be freed after a tx completion on an XDP_REDIRECT
action.  Much like this code, it involved adding an new element to
struct xdp_md that could point to the important information.  Now that
there is a generic way to handle this, it would seem nice to be able to
leverage it, but I'm not sure how reliable this meta-data area would be
without the ability to mark it in some manner.

For additional background, the minimum amount of data needed in the case
Michael and I were discussing was really 2 words.  One to serve as a
pointer to an rx_ring structure and one to have a counter to the rx
producer entry.  This data could be acessed by the driver processing the
tx completions and callback to the driver that received the frame off the wire
to perform any needed processing.  (For those curious this would also require a
new callback/netdev op to act on this data stored in the XDP buffer.)

IIUC, I could use this meta_data area to store this information, but would it
also be useful to create some type field/marker that could also be stored in
the meta_data to indicate what type of information is there?  I hate to propose
such a thing as it may add unneeded complexity, but I just wanted to make sure
to say something before it was too late as there may be more users of this
right away.  :-)

> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   1 +
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   1 +
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   1 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   1 +
>  .../net/ethernet/netronome/nfp/nfp_net_common.c    |   1 +
>  drivers/net/ethernet/qlogic/qede/qede_fp.c         |   1 +
>  drivers/net/tun.c                                  |   1 +
>  drivers/net/virtio_net.c                           |   2 +
>  include/linux/bpf.h                                |   1 +
>  include/linux/filter.h                             |  21 +++-
>  include/linux/skbuff.h                             |  68 +++++++++++-
>  include/uapi/linux/bpf.h                           |  13 ++-
>  kernel/bpf/verifier.c                              | 114 ++++++++++++++++-----
>  net/bpf/test_run.c                                 |   1 +
>  net/core/dev.c                                     |  31 +++++-
>  net/core/filter.c                                  |  77 +++++++++++++-
>  net/core/skbuff.c                                  |   2 +
>  19 files changed, 297 insertions(+), 42 deletions(-)
[...]

^ permalink raw reply

* [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()
From: Wei Wang @ 2017-09-25 17:35 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Martin KaFai Lau, Wei Wang

From: Wei Wang <weiwan@google.com>

fib6_del() generates WARN_ON() when rt->dst.obsolete > 0. This does not
make sense because it is possible that the route passed in is already
deleted by some other thread and rt->dst.obsolete is set to
DST_OBSOLETE_DEAD.
So this commit deletes this WARN_ON() and also remove the
"#ifdef RT6_DEBUG >= 2" condition so that if the route is already
obsolete, we return right at the beginning of fib6_del().

Syzkaller hit this WARN_ON() in the following call trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:180
 __warn+0x1c4/0x1d9 kernel/panic.c:541
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:fib6_del+0x947/0xca0 net/ipv6/ip6_fib.c:1477
RSP: 0018:ffff8801db2074d8 EFLAGS: 00010206
RAX: ffff8801d1500080 RBX: ffff8801d01638c0 RCX: 0000000000000000
RDX: 0000000000000100 RSI: ffff8801db207650 RDI: ffff8801d0163924
RBP: ffff8801db2075f0 R08: ffffffff86df5f98 R09: 0000000000000002
R10: ffff8801db2074b8 R11: 1ffff1003a2a026b R12: dffffc0000000000
R13: ffff8801db207650 R14: ffff8801a0748180 R15: 1ffff1003b640ea5
 __ip6_del_rt+0xc7/0x120 net/ipv6/route.c:2136
 ip6_del_rt+0x132/0x1a0 net/ipv6/route.c:2149
 ip6_link_failure+0x244/0x380 net/ipv6/route.c:1359
 dst_link_failure include/net/dst.h:454 [inline]
 ndisc_error_report+0xae/0x180 net/ipv6/ndisc.c:682
 neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
 neigh_timer_handler+0x883/0xca0 net/core/neighbour.c:969
 call_timer_fn+0x233/0x830 kernel/time/timer.c:1268
 expire_timers kernel/time/timer.c:1307 [inline]
 __run_timers+0x7fd/0xb90 kernel/time/timer.c:1601
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x2f5/0xba3 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:638 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:1044
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:702
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x56/0x70 kernel/locking/spinlock.c:199
RSP: 0018:ffff8801d0407040 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
RAX: dffffc0000000000 RBX: ffff8801db225780 RCX: 0000000000000000
RDX: 1ffffffff0b59433 RSI: 0000000000000001 RDI: ffffffff85aca198
RBP: ffff8801d0407048 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6820400
R13: 1ffff1003a080e11 R14: ffff8801d1500080 R15: ffff8801d1500080
 </IRQ>
 finish_lock_switch kernel/sched/sched.h:1334 [inline]
 finish_task_switch+0x1d3/0x740 kernel/sched/core.c:2638
 context_switch kernel/sched/core.c:2774 [inline]
 __schedule+0x8f0/0x2070 kernel/sched/core.c:3332
 schedule+0x108/0x440 kernel/sched/core.c:3391
 schedule_hrtimeout_range_clock+0x23e/0x810 kernel/time/hrtimer.c:1708
 schedule_hrtimeout_range+0x2a/0x40 kernel/time/hrtimer.c:1753
 poll_schedule_timeout+0x10f/0x1f0 fs/select.c:242
 do_select+0x11ea/0x1710 fs/select.c:581
 core_sys_select+0x480/0x960 fs/select.c:655
 do_pselect fs/select.c:732 [inline]
 SYSC_pselect6 fs/select.c:773 [inline]
 SyS_pselect6+0x54a/0x650 fs/select.c:758
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x45f181
RSP: 002b:00007f91306e1db0 EFLAGS: 00000246 ORIG_RAX: 000000000000010e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045f181
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000086 R08: 00007f91306e1db0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdd9621670
R13: 00007f91306e29c0 R14: 00007f9130eac040 R15: 0000000000000003

Note: there is no Fixes tag because this bug was introduced long ago.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/ip6_fib.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e5308d7cbd75..693bcd7ef6d2 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1592,13 +1592,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
 	struct net *net = info->nl_net;
 	struct rt6_info **rtp;
 
-#if RT6_DEBUG >= 2
-	if (rt->dst.obsolete > 0) {
-		WARN_ON(fn);
-		return -ENOENT;
-	}
-#endif
-	if (!fn || rt == net->ipv6.ip6_null_entry)
+	if (!fn || rt->dst.obsolete > 0 || rt == net->ipv6.ip6_null_entry)
 		return -ENOENT;
 
 	WARN_ON(!(fn->fn_flags & RTN_RTINFO));
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Cong Wang @ 2017-09-25 17:13 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Chris Mi, Jamal Hadi Salim

Instead of calling u32_lookup_ht() in a loop to find
a unused handle, just switch to idr API to allocate
new handles. u32 filters are special as the handle
could contain a hash table id and a key id, so we
need two IDR to allocate each of them.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_u32.c | 108 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 67 insertions(+), 41 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 10b8d851fc6b..094d224411a9 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -46,6 +46,7 @@
 #include <net/act_api.h>
 #include <net/pkt_cls.h>
 #include <linux/netdevice.h>
+#include <linux/idr.h>
 
 struct tc_u_knode {
 	struct tc_u_knode __rcu	*next;
@@ -82,6 +83,7 @@ struct tc_u_hnode {
 	struct tc_u_common	*tp_c;
 	int			refcnt;
 	unsigned int		divisor;
+	struct idr		handle_idr;
 	struct rcu_head		rcu;
 	/* The 'ht' field MUST be the last field in structure to allow for
 	 * more entries allocated at end of structure.
@@ -93,7 +95,7 @@ struct tc_u_common {
 	struct tc_u_hnode __rcu	*hlist;
 	struct Qdisc		*q;
 	int			refcnt;
-	u32			hgenerator;
+	struct idr		handle_idr;
 	struct hlist_node	hnode;
 	struct rcu_head		rcu;
 };
@@ -311,19 +313,19 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
 	return u32_lookup_key(ht, handle);
 }
 
-static u32 gen_new_htid(struct tc_u_common *tp_c)
+static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
 {
-	int i = 0x800;
+	unsigned long idr_index;
+	int err;
 
-	/* hgenerator only used inside rtnl lock it is safe to increment
+	/* This is only used inside rtnl lock it is safe to increment
 	 * without read _copy_ update semantics
 	 */
-	do {
-		if (++tp_c->hgenerator == 0x7FF)
-			tp_c->hgenerator = 1;
-	} while (--i > 0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20));
-
-	return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0;
+	err = idr_alloc_ext(&tp_c->handle_idr, ptr, &idr_index,
+			    1, 0x7FF, GFP_KERNEL);
+	if (err)
+		return 0;
+	return (u32)(idr_index | 0x800) << 20;
 }
 
 static struct hlist_head *tc_u_common_hash;
@@ -366,8 +368,9 @@ static int u32_init(struct tcf_proto *tp)
 		return -ENOBUFS;
 
 	root_ht->refcnt++;
-	root_ht->handle = tp_c ? gen_new_htid(tp_c) : 0x80000000;
+	root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x80000000;
 	root_ht->prio = tp->prio;
+	idr_init(&root_ht->handle_idr);
 
 	if (tp_c == NULL) {
 		tp_c = kzalloc(sizeof(*tp_c), GFP_KERNEL);
@@ -377,6 +380,7 @@ static int u32_init(struct tcf_proto *tp)
 		}
 		tp_c->q = tp->q;
 		INIT_HLIST_NODE(&tp_c->hnode);
+		idr_init(&tp_c->handle_idr);
 
 		h = tc_u_hash(tp);
 		hlist_add_head(&tp_c->hnode, &tc_u_common_hash[h]);
@@ -565,6 +569,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 					 rtnl_dereference(n->next));
 			tcf_unbind_filter(tp, &n->res);
 			u32_remove_hw_knode(tp, n->handle);
+			idr_remove_ext(&ht->handle_idr, n->handle);
 			call_rcu(&n->rcu, u32_delete_key_freepf_rcu);
 		}
 	}
@@ -586,6 +591,8 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 	     hn = &phn->next, phn = rtnl_dereference(*hn)) {
 		if (phn == ht) {
 			u32_clear_hw_hnode(tp, ht);
+			idr_destroy(&ht->handle_idr);
+			idr_remove_ext(&tp_c->handle_idr, ht->handle);
 			RCU_INIT_POINTER(*hn, ht->next);
 			kfree_rcu(ht, rcu);
 			return 0;
@@ -633,6 +640,7 @@ static void u32_destroy(struct tcf_proto *tp)
 			kfree_rcu(ht, rcu);
 		}
 
+		idr_destroy(&tp_c->handle_idr);
 		kfree(tp_c);
 	}
 
@@ -701,27 +709,21 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last)
 	return ret;
 }
 
-#define NR_U32_NODE (1<<12)
-static u32 gen_new_kid(struct tc_u_hnode *ht, u32 handle)
+static u32 gen_new_kid(struct tc_u_hnode *ht, u32 htid)
 {
-	struct tc_u_knode *n;
-	unsigned long i;
-	unsigned long *bitmap = kzalloc(BITS_TO_LONGS(NR_U32_NODE) * sizeof(unsigned long),
-					GFP_KERNEL);
-	if (!bitmap)
-		return handle | 0xFFF;
-
-	for (n = rtnl_dereference(ht->ht[TC_U32_HASH(handle)]);
-	     n;
-	     n = rtnl_dereference(n->next))
-		set_bit(TC_U32_NODE(n->handle), bitmap);
-
-	i = find_next_zero_bit(bitmap, NR_U32_NODE, 0x800);
-	if (i >= NR_U32_NODE)
-		i = find_next_zero_bit(bitmap, NR_U32_NODE, 1);
+	unsigned long idr_index;
+	u32 start = htid | 0x800;
+	u32 max = htid | 0xFFF;
+	u32 min = htid;
+
+	if (idr_alloc_ext(&ht->handle_idr, NULL, &idr_index,
+			  start, max + 1, GFP_KERNEL)) {
+		if (idr_alloc_ext(&ht->handle_idr, NULL, &idr_index,
+				  min + 1, max + 1, GFP_KERNEL))
+			return max;
+	}
 
-	kfree(bitmap);
-	return handle | (i >= NR_U32_NODE ? 0xFFF : i);
+	return (u32)idr_index;
 }
 
 static const struct nla_policy u32_policy[TCA_U32_MAX + 1] = {
@@ -806,6 +808,7 @@ static void u32_replace_knode(struct tcf_proto *tp, struct tc_u_common *tp_c,
 		if (pins->handle == n->handle)
 			break;
 
+	idr_replace_ext(&ht->handle_idr, n, n->handle);
 	RCU_INIT_POINTER(n->next, pins->next);
 	rcu_assign_pointer(*ins, n);
 }
@@ -937,22 +940,33 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 			return -EINVAL;
 		if (TC_U32_KEY(handle))
 			return -EINVAL;
-		if (handle == 0) {
-			handle = gen_new_htid(tp->data);
-			if (handle == 0)
-				return -ENOMEM;
-		}
 		ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);
 		if (ht == NULL)
 			return -ENOBUFS;
+		if (handle == 0) {
+			handle = gen_new_htid(tp->data, ht);
+			if (handle == 0) {
+				kfree(ht);
+				return -ENOMEM;
+			}
+		} else {
+			err = idr_alloc_ext(&tp_c->handle_idr, ht, NULL,
+					    handle, handle + 1, GFP_KERNEL);
+			if (err) {
+				kfree(ht);
+				return err;
+			}
+		}
 		ht->tp_c = tp_c;
 		ht->refcnt = 1;
 		ht->divisor = divisor;
 		ht->handle = handle;
 		ht->prio = tp->prio;
+		idr_init(&ht->handle_idr);
 
 		err = u32_replace_hw_hnode(tp, ht, flags);
 		if (err) {
+			idr_remove_ext(&tp_c->handle_idr, handle);
 			kfree(ht);
 			return err;
 		}
@@ -986,24 +1000,33 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 		if (TC_U32_HTID(handle) && TC_U32_HTID(handle^htid))
 			return -EINVAL;
 		handle = htid | TC_U32_NODE(handle);
+		err = idr_alloc_ext(&ht->handle_idr, NULL, NULL,
+				    handle, handle + 1,
+				    GFP_KERNEL);
+		if (err)
+			return err;
 	} else
 		handle = gen_new_kid(ht, htid);
 
-	if (tb[TCA_U32_SEL] == NULL)
-		return -EINVAL;
+	if (tb[TCA_U32_SEL] == NULL) {
+		err = -EINVAL;
+		goto erridr;
+	}
 
 	s = nla_data(tb[TCA_U32_SEL]);
 
 	n = kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key), GFP_KERNEL);
-	if (n == NULL)
-		return -ENOBUFS;
+	if (n == NULL) {
+		err = -ENOBUFS;
+		goto erridr;
+	}
 
 #ifdef CONFIG_CLS_U32_PERF
 	size = sizeof(struct tc_u32_pcnt) + s->nkeys * sizeof(u64);
 	n->pf = __alloc_percpu(size, __alignof__(struct tc_u32_pcnt));
 	if (!n->pf) {
-		kfree(n);
-		return -ENOBUFS;
+		err = -ENOBUFS;
+		goto errfree;
 	}
 #endif
 
@@ -1066,9 +1089,12 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 errout:
 	tcf_exts_destroy(&n->exts);
 #ifdef CONFIG_CLS_U32_PERF
+errfree:
 	free_percpu(n->pf);
 #endif
 	kfree(n);
+erridr:
+	idr_remove_ext(&ht->handle_idr, handle);
 	return err;
 }
 
-- 
2.13.0

^ permalink raw reply related

* [Patch net-next v2] net_sched: use idr to allocate basic filter handles
From: Cong Wang @ 2017-09-25 17:13 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Chris Mi, Jamal Hadi Salim

Instead of calling basic_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_basic.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index d89ebafd2239..cfeb6f158566 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -17,13 +17,14 @@
 #include <linux/errno.h>
 #include <linux/rtnetlink.h>
 #include <linux/skbuff.h>
+#include <linux/idr.h>
 #include <net/netlink.h>
 #include <net/act_api.h>
 #include <net/pkt_cls.h>
 
 struct basic_head {
-	u32			hgenerator;
 	struct list_head	flist;
+	struct idr		handle_idr;
 	struct rcu_head		rcu;
 };
 
@@ -78,6 +79,7 @@ static int basic_init(struct tcf_proto *tp)
 	if (head == NULL)
 		return -ENOBUFS;
 	INIT_LIST_HEAD(&head->flist);
+	idr_init(&head->handle_idr);
 	rcu_assign_pointer(tp->root, head);
 	return 0;
 }
@@ -99,8 +101,10 @@ static void basic_destroy(struct tcf_proto *tp)
 	list_for_each_entry_safe(f, n, &head->flist, link) {
 		list_del_rcu(&f->link);
 		tcf_unbind_filter(tp, &f->res);
+		idr_remove_ext(&head->handle_idr, f->handle);
 		call_rcu(&f->rcu, basic_delete_filter);
 	}
+	idr_destroy(&head->handle_idr);
 	kfree_rcu(head, rcu);
 }
 
@@ -111,6 +115,7 @@ static int basic_delete(struct tcf_proto *tp, void *arg, bool *last)
 
 	list_del_rcu(&f->link);
 	tcf_unbind_filter(tp, &f->res);
+	idr_remove_ext(&head->handle_idr, f->handle);
 	call_rcu(&f->rcu, basic_delete_filter);
 	*last = list_empty(&head->flist);
 	return 0;
@@ -154,6 +159,7 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	struct nlattr *tb[TCA_BASIC_MAX + 1];
 	struct basic_filter *fold = (struct basic_filter *) *arg;
 	struct basic_filter *fnew;
+	unsigned long idr_index;
 
 	if (tca[TCA_OPTIONS] == NULL)
 		return -EINVAL;
@@ -179,30 +185,31 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	err = -EINVAL;
 	if (handle) {
 		fnew->handle = handle;
-	} else if (fold) {
-		fnew->handle = fold->handle;
+		if (!fold) {
+			err = idr_alloc_ext(&head->handle_idr, fnew, &idr_index,
+					    handle, handle + 1, GFP_KERNEL);
+			if (err)
+				goto errout;
+		}
 	} else {
-		unsigned int i = 0x80000000;
-		do {
-			if (++head->hgenerator == 0x7FFFFFFF)
-				head->hgenerator = 1;
-		} while (--i > 0 && basic_get(tp, head->hgenerator));
-
-		if (i <= 0) {
-			pr_err("Insufficient number of handles\n");
+		err = idr_alloc_ext(&head->handle_idr, fnew, &idr_index,
+				    1, 0x7FFFFFFF, GFP_KERNEL);
+		if (err)
 			goto errout;
-		}
-
-		fnew->handle = head->hgenerator;
+		fnew->handle = idr_index;
 	}
 
 	err = basic_set_parms(net, tp, fnew, base, tb, tca[TCA_RATE], ovr);
-	if (err < 0)
+	if (err < 0) {
+		if (!fold)
+			idr_remove_ext(&head->handle_idr, fnew->handle);
 		goto errout;
+	}
 
 	*arg = fnew;
 
 	if (fold) {
+		idr_replace_ext(&head->handle_idr, fnew, fnew->handle);
 		list_replace_rcu(&fold->link, &fnew->link);
 		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, basic_delete_filter);
-- 
2.13.0

^ permalink raw reply related

* [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 17:13 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Daniel Borkmann, Chris Mi, Jamal Hadi Salim

Instead of calling cls_bpf_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_bpf.c | 57 ++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 29 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 520c5027646a..e19c3c3c27cc 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/idr.h>
 
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
@@ -32,7 +33,7 @@ MODULE_DESCRIPTION("TC BPF based classifier");
 
 struct cls_bpf_head {
 	struct list_head plist;
-	u32 hgen;
+	struct idr handle_idr;
 	struct rcu_head rcu;
 };
 
@@ -238,6 +239,7 @@ static int cls_bpf_init(struct tcf_proto *tp)
 		return -ENOBUFS;
 
 	INIT_LIST_HEAD_RCU(&head->plist);
+	idr_init(&head->handle_idr);
 	rcu_assign_pointer(tp->root, head);
 
 	return 0;
@@ -264,6 +266,9 @@ static void cls_bpf_delete_prog_rcu(struct rcu_head *rcu)
 
 static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
+	struct cls_bpf_head *head = rtnl_dereference(tp->root);
+
+	idr_remove_ext(&head->handle_idr, prog->handle);
 	cls_bpf_stop_offload(tp, prog);
 	list_del_rcu(&prog->link);
 	tcf_unbind_filter(tp, &prog->res);
@@ -287,6 +292,7 @@ static void cls_bpf_destroy(struct tcf_proto *tp)
 	list_for_each_entry_safe(prog, tmp, &head->plist, link)
 		__cls_bpf_delete(tp, prog);
 
+	idr_destroy(&head->handle_idr);
 	kfree_rcu(head, rcu);
 }
 
@@ -421,27 +427,6 @@ static int cls_bpf_set_parms(struct net *net, struct tcf_proto *tp,
 	return 0;
 }
 
-static u32 cls_bpf_grab_new_handle(struct tcf_proto *tp,
-				   struct cls_bpf_head *head)
-{
-	unsigned int i = 0x80000000;
-	u32 handle;
-
-	do {
-		if (++head->hgen == 0x7FFFFFFF)
-			head->hgen = 1;
-	} while (--i > 0 && cls_bpf_get(tp, head->hgen));
-
-	if (unlikely(i == 0)) {
-		pr_err("Insufficient number of handles\n");
-		handle = 0;
-	} else {
-		handle = head->hgen;
-	}
-
-	return handle;
-}
-
 static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 			  struct tcf_proto *tp, unsigned long base,
 			  u32 handle, struct nlattr **tca,
@@ -451,6 +436,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	struct cls_bpf_prog *oldprog = *arg;
 	struct nlattr *tb[TCA_BPF_MAX + 1];
 	struct cls_bpf_prog *prog;
+	unsigned long idr_index;
 	int ret;
 
 	if (tca[TCA_OPTIONS] == NULL)
@@ -476,21 +462,30 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 		}
 	}
 
-	if (handle == 0)
-		prog->handle = cls_bpf_grab_new_handle(tp, head);
-	else
+	if (handle == 0) {
+		ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
+				    1, 0x7FFFFFFF, GFP_KERNEL);
+		if (ret)
+			goto errout;
+		prog->handle = idr_index;
+	} else {
+		if (!oldprog) {
+			ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
+					    handle, handle + 1, GFP_KERNEL);
+			if (ret)
+				goto errout;
+		}
 		prog->handle = handle;
-	if (prog->handle == 0) {
-		ret = -EINVAL;
-		goto errout;
 	}
 
 	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
 	if (ret < 0)
-		goto errout;
+		goto errout_idr;
 
 	ret = cls_bpf_offload(tp, prog, oldprog);
 	if (ret) {
+		if (!oldprog)
+			idr_remove_ext(&head->handle_idr, prog->handle);
 		__cls_bpf_delete_prog(prog);
 		return ret;
 	}
@@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
 	if (oldprog) {
+		idr_replace_ext(&head->handle_idr, prog, handle);
 		list_replace_rcu(&oldprog->link, &prog->link);
 		tcf_unbind_filter(tp, &oldprog->res);
 		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
@@ -509,6 +505,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	*arg = prog;
 	return 0;
 
+errout_idr:
+	if (!oldprog)
+		idr_remove_ext(&head->handle_idr, prog->handle);
 errout:
 	tcf_exts_destroy(&prog->exts);
 	kfree(prog);
-- 
2.13.0

^ permalink raw reply related

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Simon Horman @ 2017-09-25 17:04 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers,
	John Hurley
In-Reply-To: <CAJ3xEMgfRuD2Y-P2eTKmXpciiopA8js4zbO_nNOA6J62sn--mw@mail.gmail.com>

On Mon, Sep 25, 2017 at 06:25:03PM +0300, Or Gerlitz wrote:
> On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
> <simon.horman@netronome.com> wrote:
> > From: Simon Horman <simon.horman@netronome.com>
> >
> > John says:
> >
> > This patch set allows offloading of TC flower match and set tunnel fields
> > to the NFP. The initial focus is on VXLAN traffic. Due to the current
> > state of the NFP firmware, only VXLAN traffic on well known port 4789 is
> > handled. The match and action fields must explicity set this value to be
> > supported. Tunnel end point information is also offloaded to the NFP for
> > both encapsulation and decapsulation. The NFP expects 3 separate data sets
> > to be supplied.
> 
> > For decapsulation, 2 separate lists exist; a list of MAC addresses
> > referenced by an index comprised of the port number, and a list of IP
> > addresses. These IP addresses are not connected to a MAC or port.
> 
> Do these IP addresses exist on the host kernel SW stack? can the same
> set of TC rules be fully functional and generate the same traffic
> pattern when set to run in SW (skip_hw)?

Hi Or,

I asked John (now CCed) about this and his response was:

The MAC addresses are extracted from the netdevs already loaded in the
kernel and are monitored for any changes. The IP addresses are slightly
different in that they are extracted from the rules themselves. We make the
assumption that, if a packet is decapsulated at the end point and a match
is attempted on the IP address, that this IP address should be recognised
in the kernel. That being the case, the same traffic pattern should be
witnessed if the skip_hw flag is applied.

^ permalink raw reply

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: Eric Dumazet @ 2017-09-25 16:52 UTC (permalink / raw)
  To: David Ahern; +Cc: David Miller, netdev, Stephen Hemminger
In-Reply-To: <ce328dfd-0b54-883a-36a8-91ec344f86ad@gmail.com>

On Mon, 2017-09-25 at 10:14 -0600, David Ahern wrote:

> Thanks for the test.
> 
> I made a simple program this morning and ran it under perf. With the
> above suggestion the rb_erase has a high cost because it always deletes
> the root node. Your method 1 has a high cost on rb_first which is
> expected given its definition and it is run on each removal. Both
> options increase in time with the number of entries in the tree.
> 
> Your method 2 is fairly constant from 10,000 entries to 10M entries
> which makes sense: a one time cost at finding rb_first and then always
> removing a bottom node so rb_erase is light.
> 
> As for the change:
> 
> Acked-by: David Ahern <dsahern@gmail.com>

Thanks a lot for double checking !

^ permalink raw reply

* Re: [Patch net-next] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 16:49 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Kernel Network Developers, Daniel Borkmann, Chris Mi,
	Jamal Hadi Salim
In-Reply-To: <CAM_iQpVOnzqAs6sT8=n+n2zFUNd_K4evph3=68NL1CtBN=AAmg@mail.gmail.com>

On Mon, Sep 25, 2017 at 9:46 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Sat, Sep 23, 2017 at 5:15 PM, David Miller <davem@davemloft.net> wrote:
>> If I read the code properly, the existing code allows IDs from '1' to
>> and including '0x7ffffffe', whereas the new code allows up to and
>> including '0x7ffffffff'.
>>
>> Whether intentional or not this is a change in behavior.  If it's OK,
>> it should be at least documented either in a comment or in the
>> commit log itself.
>>
>> Similarly for your other scheduler IDR changes.
>
> Good catch! Obviously off-by-one. I agree we should keep
> the old behavior no matter the reason.
>
> I will update the patches.
>

By the way, same problem for commit fe2502e49b58606580c77.
:)

^ permalink raw reply

* Re: [Patch net-next] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 16:46 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Kernel Network Developers, Daniel Borkmann, Chris Mi,
	Jamal Hadi Salim
In-Reply-To: <20170923.171534.1037317649165173409.davem@davemloft.net>

On Sat, Sep 23, 2017 at 5:15 PM, David Miller <davem@davemloft.net> wrote:
> If I read the code properly, the existing code allows IDs from '1' to
> and including '0x7ffffffe', whereas the new code allows up to and
> including '0x7ffffffff'.
>
> Whether intentional or not this is a change in behavior.  If it's OK,
> it should be at least documented either in a comment or in the
> commit log itself.
>
> Similarly for your other scheduler IDR changes.

Good catch! Obviously off-by-one. I agree we should keep
the old behavior no matter the reason.

I will update the patches.

Thanks.

^ permalink raw reply

* Re: [PATCH,v3,net-next 2/2] tun: enable napi_gro_frags() for TUN/TAP driver
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-09-25 16:27 UTC (permalink / raw)
  To: Petar Penkov
  Cc: linux-netdev, Eric Dumazet, Willem Bruijn, David Miller,
	Petar Bozhidarov Penkov
In-Reply-To: <20170922204915.7889-3-peterpenkov96@gmail.com>

On Fri, Sep 22, 2017 at 1:49 PM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> Add a TUN/TAP receive mode that exercises the napi_gro_frags()
> interface. This mode is available only in TAP mode, as the interface
> expects packets with Ethernet headers.
>
> Furthermore, packets follow the layout of the iovec_iter that was
> received. The first iovec is the linear data, and every one after the
> first is a fragment. If there are more fragments than the max number,
> drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
> dissector code and to verify that the header resides in the linear data.
>
> The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
> This is imposed because this mode is intended for testing via tools like
> syzkaller and packetdrill, and the increased flexibility it provides can
> introduce security vulnerabilities. This flag is accepted only if the
> device is in TAP mode and has the IFF_NAPI flag set as well. This is
> done because both of these are explicit requirements for correct
> operation in this mode.
>
> Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>

Thank you Petar.

Acked-by: Mahesh Bandewar <maheshb@google,com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: davem@davemloft.net
> Cc: ppenkov@stanford.edu
> ---
>  drivers/net/tun.c           | 134 ++++++++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/if_tun.h |   1 +
>  2 files changed, 129 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index f16407242b18..9880b3bc8fa5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -75,6 +75,7 @@
>  #include <linux/skb_array.h>
>  #include <linux/bpf.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/mutex.h>
>
>  #include <linux/uaccess.h>
>
> @@ -121,7 +122,8 @@ do {                                                                \
>  #define TUN_VNET_BE     0x40000000
>
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
> -                     IFF_MULTI_QUEUE | IFF_NAPI)
> +                     IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> +
>  #define GOODCOPY_LEN 128
>
>  #define FLT_EXACT_COUNT 8
> @@ -173,6 +175,7 @@ struct tun_file {
>                 unsigned int ifindex;
>         };
>         struct napi_struct napi;
> +       struct mutex napi_mutex;        /* Protects access to the above napi */
>         struct list_head next;
>         struct tun_struct *detached;
>         struct skb_array tx_array;
> @@ -277,6 +280,7 @@ static void tun_napi_init(struct tun_struct *tun, struct tun_file *tfile,
>                 netif_napi_add(tun->dev, &tfile->napi, tun_napi_poll,
>                                NAPI_POLL_WEIGHT);
>                 napi_enable(&tfile->napi);
> +               mutex_init(&tfile->napi_mutex);
>         }
>  }
>
> @@ -292,6 +296,11 @@ static void tun_napi_del(struct tun_struct *tun, struct tun_file *tfile)
>                 netif_napi_del(&tfile->napi);
>  }
>
> +static bool tun_napi_frags_enabled(const struct tun_struct *tun)
> +{
> +       return READ_ONCE(tun->flags) & IFF_NAPI_FRAGS;
> +}
> +
>  #ifdef CONFIG_TUN_VNET_CROSS_LE
>  static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
>  {
> @@ -1036,7 +1045,8 @@ static void tun_poll_controller(struct net_device *dev)
>          * supports polling, which enables bridge devices in virt setups to
>          * still use netconsole
>          * If NAPI is enabled, however, we need to schedule polling for all
> -        * queues.
> +        * queues unless we are using napi_gro_frags(), which we call in
> +        * process context and not in NAPI context.
>          */
>         struct tun_struct *tun = netdev_priv(dev);
>
> @@ -1044,6 +1054,9 @@ static void tun_poll_controller(struct net_device *dev)
>                 struct tun_file *tfile;
>                 int i;
>
> +               if (tun_napi_frags_enabled(tun))
> +                       return;
> +
>                 rcu_read_lock();
>                 for (i = 0; i < tun->numqueues; i++) {
>                         tfile = rcu_dereference(tun->tfiles[i]);
> @@ -1266,6 +1279,64 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
>         return mask;
>  }
>
> +static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile,
> +                                           size_t len,
> +                                           const struct iov_iter *it)
> +{
> +       struct sk_buff *skb;
> +       size_t linear;
> +       int err;
> +       int i;
> +
> +       if (it->nr_segs > MAX_SKB_FRAGS + 1)
> +               return ERR_PTR(-ENOMEM);
> +
> +       local_bh_disable();
> +       skb = napi_get_frags(&tfile->napi);
> +       local_bh_enable();
> +       if (!skb)
> +               return ERR_PTR(-ENOMEM);
> +
> +       linear = iov_iter_single_seg_count(it);
> +       err = __skb_grow(skb, linear);
> +       if (err)
> +               goto free;
> +
> +       skb->len = len;
> +       skb->data_len = len - linear;
> +       skb->truesize += skb->data_len;
> +
> +       for (i = 1; i < it->nr_segs; i++) {
> +               size_t fragsz = it->iov[i].iov_len;
> +               unsigned long offset;
> +               struct page *page;
> +               void *data;
> +
> +               if (fragsz == 0 || fragsz > PAGE_SIZE) {
> +                       err = -EINVAL;
> +                       goto free;
> +               }
> +
> +               local_bh_disable();
> +               data = napi_alloc_frag(fragsz);
> +               local_bh_enable();
> +               if (!data) {
> +                       err = -ENOMEM;
> +                       goto free;
> +               }
> +
> +               page = virt_to_head_page(data);
> +               offset = data - page_address(page);
> +               skb_fill_page_desc(skb, i - 1, page, offset, fragsz);
> +       }
> +
> +       return skb;
> +free:
> +       /* frees skb and all frags allocated with napi_alloc_frag() */
> +       napi_free_frags(&tfile->napi);
> +       return ERR_PTR(err);
> +}
> +
>  /* prepad is the amount to reserve at front.  len is length after that.
>   * linear is a hint as to how much to copy (usually headers). */
>  static struct sk_buff *tun_alloc_skb(struct tun_file *tfile,
> @@ -1478,6 +1549,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>         int err;
>         u32 rxhash;
>         int skb_xdp = 1;
> +       bool frags = tun_napi_frags_enabled(tun);
>
>         if (!(tun->dev->flags & IFF_UP))
>                 return -EIO;
> @@ -1535,7 +1607,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                         zerocopy = true;
>         }
>
> -       if (tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
> +       if (!frags && tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
>                 /* For the packet that is not easy to be processed
>                  * (e.g gso or jumbo packet), we will do it at after
>                  * skb was created with generic XDP routine.
> @@ -1556,10 +1628,24 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                                 linear = tun16_to_cpu(tun, gso.hdr_len);
>                 }
>
> -               skb = tun_alloc_skb(tfile, align, copylen, linear, noblock);
> +               if (frags) {
> +                       mutex_lock(&tfile->napi_mutex);
> +                       skb = tun_napi_alloc_frags(tfile, copylen, from);
> +                       /* tun_napi_alloc_frags() enforces a layout for the skb.
> +                        * If zerocopy is enabled, then this layout will be
> +                        * overwritten by zerocopy_sg_from_iter().
> +                        */
> +                       zerocopy = false;
> +               } else {
> +                       skb = tun_alloc_skb(tfile, align, copylen, linear,
> +                                           noblock);
> +               }
> +
>                 if (IS_ERR(skb)) {
>                         if (PTR_ERR(skb) != -EAGAIN)
>                                 this_cpu_inc(tun->pcpu_stats->rx_dropped);
> +                       if (frags)
> +                               mutex_unlock(&tfile->napi_mutex);
>                         return PTR_ERR(skb);
>                 }
>
> @@ -1571,6 +1657,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                 if (err) {
>                         this_cpu_inc(tun->pcpu_stats->rx_dropped);
>                         kfree_skb(skb);
> +                       if (frags) {
> +                               tfile->napi.skb = NULL;
> +                               mutex_unlock(&tfile->napi_mutex);
> +                       }
> +
>                         return -EFAULT;
>                 }
>         }
> @@ -1578,6 +1669,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>         if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
>                 this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
>                 kfree_skb(skb);
> +               if (frags) {
> +                       tfile->napi.skb = NULL;
> +                       mutex_unlock(&tfile->napi_mutex);
> +               }
> +
>                 return -EINVAL;
>         }
>
> @@ -1603,7 +1699,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                 skb->dev = tun->dev;
>                 break;
>         case IFF_TAP:
> -               skb->protocol = eth_type_trans(skb, tun->dev);
> +               if (!frags)
> +                       skb->protocol = eth_type_trans(skb, tun->dev);
>                 break;
>         }
>
> @@ -1638,7 +1735,23 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>
>         rxhash = __skb_get_hash_symmetric(skb);
>
> -       if (tun->flags & IFF_NAPI) {
> +       if (frags) {
> +               /* Exercise flow dissector code path. */
> +               u32 headlen = eth_get_headlen(skb->data, skb_headlen(skb));
> +
> +               if (headlen > skb_headlen(skb) || headlen < ETH_HLEN) {
> +                       this_cpu_inc(tun->pcpu_stats->rx_dropped);
> +                       napi_free_frags(&tfile->napi);
> +                       mutex_unlock(&tfile->napi_mutex);
> +                       WARN_ON(1);
> +                       return -ENOMEM;
> +               }
> +
> +               local_bh_disable();
> +               napi_gro_frags(&tfile->napi);
> +               local_bh_enable();
> +               mutex_unlock(&tfile->napi_mutex);
> +       } else if (tun->flags & IFF_NAPI) {
>                 struct sk_buff_head *queue = &tfile->sk.sk_write_queue;
>                 int queue_len;
>
> @@ -2061,6 +2174,15 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>         if (tfile->detached)
>                 return -EINVAL;
>
> +       if ((ifr->ifr_flags & IFF_NAPI_FRAGS)) {
> +               if (!capable(CAP_NET_ADMIN))
> +                       return -EPERM;
> +
> +               if (!(ifr->ifr_flags & IFF_NAPI) ||
> +                   (ifr->ifr_flags & TUN_TYPE_MASK) != IFF_TAP)
> +                       return -EINVAL;
> +       }
> +
>         dev = __dev_get_by_name(net, ifr->ifr_name);
>         if (dev) {
>                 if (ifr->ifr_flags & IFF_TUN_EXCL)
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 30b6184884eb..365ade5685c9 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -61,6 +61,7 @@
>  #define IFF_TUN                0x0001
>  #define IFF_TAP                0x0002
>  #define IFF_NAPI       0x0010
> +#define IFF_NAPI_FRAGS 0x0020
>  #define IFF_NO_PI      0x1000
>  /* This flag has no real effect */
>  #define IFF_ONE_QUEUE  0x2000
> --
> 2.11.0
>

^ permalink raw reply

* Re: [PATCH,v3,net-next 1/2] tun: enable NAPI for TUN/TAP driver
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-09-25 16:23 UTC (permalink / raw)
  To: Petar Penkov
  Cc: linux-netdev, Eric Dumazet, Willem Bruijn, David Miller,
	Petar Bozhidarov Penkov
In-Reply-To: <20170922204915.7889-2-peterpenkov96@gmail.com>

On Fri, Sep 22, 2017 at 1:49 PM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> Changes TUN driver to use napi_gro_receive() upon receiving packets
> rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
> changes and operation is not affected if the flag is disabled.  SKBs
> are constructed upon packet arrival and are queued to be processed
> later.
>
> The new path was evaluated with a benchmark with the following setup:
> Open two tap devices and a receiver thread that reads in a loop for
> each device. Start one sender thread and pin all threads to different
> CPUs. Send 1M minimum UDP packets to each device and measure sending
> time for each of the sending methods:
>         napi_gro_receive():     4.90s
>         netif_rx_ni():          4.90s
>         netif_receive_skb():    7.20s
>
> Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>

Thank you Petar.

Acked-by: Mahesh Bandewar <maheshb@google.com>

> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: davem@davemloft.net
> Cc: ppenkov@stanford.edu
> ---
>  drivers/net/tun.c           | 133 +++++++++++++++++++++++++++++++++++++++-----
>  include/uapi/linux/if_tun.h |   1 +
>  2 files changed, 119 insertions(+), 15 deletions(-)
>

^ permalink raw reply

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Ahern @ 2017-09-25 16:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Stephen Hemminger
In-Reply-To: <1506317240.6617.5.camel@edumazet-glaptop3.roam.corp.google.com>

On 9/24/17 11:27 PM, Eric Dumazet wrote:
> On Sun, 2017-09-24 at 20:05 -0600, David Ahern wrote:
>> On 9/24/17 7:57 PM, David Ahern wrote:
> 
>>> Hi Eric:
>>>
>>> I'm guessing the cost is in the rb_first and rb_next computations. Did
>>> you consider something like this:
>>>
>>>         struct rb_root *root
>>>         struct rb_node **p = &root->rb_node;
>>>
>>>         while (*p != NULL) {
>>>                 struct foobar *fb;
>>>
>>>                 fb = container_of(*p, struct foobar, rb_node);
>>>                 // fb processing
>> 		  rb_erase(&nh->rb_node, root);
>>
>>>                 p = &root->rb_node;
>>>         }
>>>
>>
>> Oops, dropped the rb_erase in my consolidating the code to this snippet.
> 
> Hi David
> 
> This gives about same numbers than method_1
> 
> I tried with 10^7 skbs in the tree :
> 
> Your suggestion takes 66ns per skb, while the one I chose takes 37ns per
> skb.

Thanks for the test.

I made a simple program this morning and ran it under perf. With the
above suggestion the rb_erase has a high cost because it always deletes
the root node. Your method 1 has a high cost on rb_first which is
expected given its definition and it is run on each removal. Both
options increase in time with the number of entries in the tree.

Your method 2 is fairly constant from 10,000 entries to 10M entries
which makes sense: a one time cost at finding rb_first and then always
removing a bottom node so rb_erase is light.

As for the change:

Acked-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [PATCH net-next] inetpeer: speed up inetpeer_invalidate_tree()
From: Eric Dumazet @ 2017-09-25 16:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

From: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inetpeer.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e7eb590c86ce2b33654c17c61619de74ff07bfd1..6e5626cc366c150b13fd44a8e36169c2fb54476d 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -284,14 +284,17 @@ EXPORT_SYMBOL(inet_peer_xrlim_allow);
 
 void inetpeer_invalidate_tree(struct inet_peer_base *base)
 {
-	struct inet_peer *p, *n;
+	struct rb_node *p = rb_first(&base->rb_root);
 
-	rbtree_postorder_for_each_entry_safe(p, n, &base->rb_root, rb_node) {
-		inet_putpeer(p);
+	while (p) {
+		struct inet_peer *peer = rb_entry(p, struct inet_peer, rb_node);
+
+		p = rb_next(p);
+		rb_erase(&peer->rb_node, &base->rb_root);
+		inet_putpeer(peer);
 		cond_resched();
 	}
 
-	base->rb_root = RB_ROOT;
 	base->total = 0;
 }
 EXPORT_SYMBOL(inetpeer_invalidate_tree);

^ permalink raw reply related

* Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Yonghong Song @ 2017-09-25 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov, David Miller, peterz
  Cc: rostedt, daniel, netdev, kernel-team
In-Reply-To: <b9753171-8e70-92f2-9852-8d3a33f2c07c@fb.com>



On 9/21/17 8:15 AM, Alexei Starovoitov wrote:
> On 9/20/17 4:07 PM, David Miller wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Wed, 20 Sep 2017 19:26:51 +0200
>>
>>> Dave, could we have this in a topic tree of sorts, because I have a
>>> pending series to rework all the timekeeping and it might be nice to not
>>> have sfr run into all sorts of conflicts.
>>
>> If you want to merge it into your tree that's fine.
>>
>> But it means that any further development done on top of this
>> work will need to go via you as well.
> 
> can we merge this set of patches into both net-next and tip ?
> We did such tricks in the past to avoid merge conflicts and
> everything went fine during merge window.

Hi, Peter,

Any suggestion for merging this patch? Alexei's idea sounds
reasonable?

Thanks,

Yonghong

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox