Netdev List
 help / color / mirror / Atom feed
* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:50 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928173415.15551-3-jiri@resnulli.us>

On 28/09/17 20:34, Jiri Pirko wrote:
> From: Yotam Gigi <yotamg@mellanox.com>
> 
> In order to allow the ipmr module to do partial multicast forwarding
> according to the device parent ID, add the device parent ID field to the
> VIF struct. This way, the forwarding path can use the parent ID field
> without invoking switchdev calls, which requires the RTNL lock.
> 
> When a new VIF is added, set the device parent ID field in it by invoking
> the switchdev_port_attr_get call.
> 
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/linux/mroute.h | 2 ++
>  net/ipv4/ipmr.c        | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
> index b072a84..a46577f 100644
> --- a/include/linux/mroute.h
> +++ b/include/linux/mroute.h
> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>  
>  struct vif_device {
>  	struct net_device 	*dev;			/* Device we are using */
> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
> +	bool		dev_parent_id_valid;
>  	unsigned long	bytes_in,bytes_out;
>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 292a8e8..4566c54 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -67,6 +67,7 @@
>  #include <net/fib_rules.h>
>  #include <linux/netconf.h>
>  #include <net/nexthop.h>
> +#include <net/switchdev.h>
>  
>  struct ipmr_rule {
>  	struct fib_rule		common;
> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  		   struct vifctl *vifc, int mrtsock)
>  {
>  	int vifi = vifc->vifc_vifi;
> +	struct switchdev_attr attr = {
> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
> +	};
>  	struct vif_device *v = &mrt->vif_table[vifi];
>  	struct net_device *dev;
>  	struct in_device *in_dev;
> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  
>  	/* Fill in the VIF structures */
>  
> +	attr.orig_dev = dev;
> +	if (!switchdev_port_attr_get(dev, &attr)) {
> +		v->dev_parent_id_valid = true;
> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
> +	}
>  	v->rate_limit = vifc->vifc_rate_limit;
>  	v->local = vifc->vifc_lcl_addr.s_addr;
>  	v->remote = vifc->vifc_rmt_addr.s_addr;
> 

One more thing - what happens on vif delete, then add with the same vif index of another
device that doesn't have a parent id ? I think the vif will be stuck with its parent_id
when it gets set.

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jason Wang @ 2017-09-29  9:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: jakub.kicinski, Michael S. Tsirkin, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Alexei Starovoitov,
	Andy Gospodarek
In-Reply-To: <150660343811.2808.7680200486950101509.stgit@firesoul>



On 2017年09月28日 20:57, Jesper Dangaard Brouer wrote:
> +};
> +
> +/* Convert xdp_buff to xdp_pkt */
> +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> +{
> +	struct xdp_pkt *xdp_pkt;
> +	int headroom;
> +
> +	/* Assure headroom is available for storing info */
> +	headroom = xdp->data - xdp->data_hard_start;
> +	if (headroom < sizeof(*xdp_pkt))
> +		return NULL;

Hi Jesper:

Do you consider this as a trick or a long term solution? Is it better to 
store XDP in a circular buffer? (I'm asking since I meet similar issue 
when doing xdp_xmit for tun).

> +
> +	/* Store info in top of packet */
> +	xdp_pkt = xdp->data_hard_start;
> +
> +	xdp_pkt->data = xdp->data;
> +	xdp_pkt->len  = xdp->data_end - xdp->data;
> +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> +

Is wmb() needed here?

> +	return xdp_pkt;
> +}

Thanks

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:45 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <7618b2e8-e7f9-d2c6-b13c-53aef0f50de0@cumulusnetworks.com>

On 29/09/17 12:29, Nikolay Aleksandrov wrote:
> On 28/09/17 20:34, Jiri Pirko wrote:
>> From: Yotam Gigi <yotamg@mellanox.com>
>>
>> In order to allow the ipmr module to do partial multicast forwarding
>> according to the device parent ID, add the device parent ID field to the
>> VIF struct. This way, the forwarding path can use the parent ID field
>> without invoking switchdev calls, which requires the RTNL lock.
>>
>> When a new VIF is added, set the device parent ID field in it by invoking
>> the switchdev_port_attr_get call.
>>
>> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
>> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/linux/mroute.h | 2 ++
>>  net/ipv4/ipmr.c        | 9 +++++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index b072a84..a46577f 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>>  
>>  struct vif_device {
>>  	struct net_device 	*dev;			/* Device we are using */
>> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
>> +	bool		dev_parent_id_valid;
>>  	unsigned long	bytes_in,bytes_out;
>>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 292a8e8..4566c54 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -67,6 +67,7 @@
>>  #include <net/fib_rules.h>
>>  #include <linux/netconf.h>
>>  #include <net/nexthop.h>
>> +#include <net/switchdev.h>
>>  
>>  struct ipmr_rule {
>>  	struct fib_rule		common;
>> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>>  		   struct vifctl *vifc, int mrtsock)
>>  {
>>  	int vifi = vifc->vifc_vifi;
>> +	struct switchdev_attr attr = {
>> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>> +	};
>>  	struct vif_device *v = &mrt->vif_table[vifi];
>>  	struct net_device *dev;
>>  	struct in_device *in_dev;
>> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>>  
>>  	/* Fill in the VIF structures */
>>  
>> +	attr.orig_dev = dev;
>> +	if (!switchdev_port_attr_get(dev, &attr)) {
>> +		v->dev_parent_id_valid = true;
>> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
> 
> Hmm, shouldn't you set dev_parent_id.id_len too ? It would seem netdev_phys_item_id_same()
> uses it in the comparison and without the len it would always look like they're the same
> because memcmp will simply return 0 with count = 0.

Also maybe we can use the non-zero id_len as a signal that it was set and drop the dev_parent_id_valid
field altogether, it would seem there's no valid reason to have id_len == 0 and yet expect a valid
parent_id.

> 
>> +	}
>>  	v->rate_limit = vifc->vifc_rate_limit;
>>  	v->local = vifc->vifc_lcl_addr.s_addr;
>>  	v->remote = vifc->vifc_rmt_addr.s_addr;
>>
> 

^ permalink raw reply

* Re: [PATCH net-next RFC 0/9] net: dsa: PTP timestamping for mv88e6xxx
From: Richard Cochran @ 2017-09-29  9:43 UTC (permalink / raw)
  To: Brandon Streiff
  Cc: netdev, linux-kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot, Erik Hons
In-Reply-To: <1506612341-18061-1-git-send-email-brandon.streiff@ni.com>

Brandon,

On Thu, Sep 28, 2017 at 10:25:32AM -0500, Brandon Streiff wrote:
> - Patch #2: We expose the switch time as a PTP clock but don't support
>   adjustment (max_adj=0).

The driver should implement a cyclecounter/timecounter.

> Our platform adjusted a systemwide oscillator
>   from userspace, so we didn't need adjustment at this layer, but other
>   PTP clock drivers support this and we probably should too.

We don't currently have any way to support this kind of HW or anything
like an external VCO.  I would like to find a way to do this, but that
is a different kettle of fish as it might require changes in the PHC
subsystem.  For this driver, I think we should get it merged using the
cyclecounter/timecounter (as that will benefit lots of users) and
worry about the external oscillator later.
 
> Feedback is appreciated.

I happy to see this series.  I just finished porting an out-of-tree
PHC driver for the Marvell mv88e635x, and I want to mainline it, but I
also have a few uglies.

Unfortunately I am in the middle of a move right now, and so my review
of this series might have to wait a bit.  However, I am looking
forward to comparing notes, and then getting this support in.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next 0/3] support changing steering policies in tuntap
From: Jason Wang @ 2017-09-29  9:41 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, Network Development, LKML
In-Reply-To: <CAF=yD-KuDZK0-YUfYde=dk_6M6fgj_opuiK2VTfCKDM_=MqDcw@mail.gmail.com>



On 2017年09月29日 00:09, Willem de Bruijn wrote:
> On Thu, Sep 28, 2017 at 3:23 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年09月28日 07:25, Willem de Bruijn wrote:
>>>>> In the future, both simple and sophisticated policy like RSS or other
>>>>> guest
>>>>> driven steering policies could be done on top.
>>>> IMHO there should be a more practical example before adding all this
>>>> indirection. And it would be nice to understand why this queue selection
>>>> needs to be tun specific.
>>> I was thinking the same and this reminds me of the various strategies
>>> implemented in packet fanout. tun_cpu_select_queue is analogous to
>>> fanout_demux_cpu though it is tun-specific in that it requires
>>> tun->numqueues.
>>
>> Right, the main idea is to introduce a way to change flow steering policy
>> for tun. I think fanout policy could be implemented through the API
>> introduced in this series. (Current flow caches based automatic steering
>> method is tun specific).
>>
>>> Fanout accrued various strategies until it gained an eBPF variant. Just
>>> supporting BPF is probably sufficient here, too.
>>
>> Technically yes, but for tun, it also serve for virt. We probably still need
>> some hard coded policy which could be changed by guest until we can accept
>> an BPF program from guest I think?
> When would a guest choose the policy? As long as this is under control
> of a host user, possibly unprivileged, allowing BPF here is moot, as any
> user can run socket filter BPF already. Programming from the guest is
> indeed different. I don't fully understand that use case.

The problem is userspace (qemu) know little about what kind of workloads 
will be done by guest, so we need guest controllable method here since 
it knows the best steering policy. Rethink about this, instead of 
passing eBPF from guest, qemu can have some pre-defined sets of polices. 
I will change the cpu id based to eBPF based in V2.

Thanks

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Hannes Frederic Sowa @ 2017-09-29  9:40 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, netdev,
	jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek, edumazet
In-Reply-To: <1506677857.2478.5.camel@redhat.com>

Paolo Abeni <pabeni@redhat.com> writes:

> On Fri, 2017-09-29 at 09:56 +0200, Hannes Frederic Sowa wrote:
>> [adding Paolo, Eric]
>> 
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> 
>> > On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
>> 
>> [...]
>> 
>> > > +	wake_up_process(rcpu->kthread);
>> > 
>> > In general the whole thing looks like 'threaded NAPI' that Hannes was
>> > proposing some time back. I liked it back then and I like it now.
>> > I don't remember what were the objections back then.
>> > Something scheduler related?
>> > Adding Hannes.
>
> Beyond the added scheduling complexity, the threaded NAPI
> implementation proposed some time ago also possibly introduced OoO
> packet delivery, because the NAPI threads were left unbound to any CPU.

Right, yes, but that can be resolved. The problem was just in that
particular patch.

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Paolo Abeni @ 2017-09-29  9:37 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, netdev, jakub.kicinski,
	Michael S. Tsirkin, Jason Wang, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Andy Gospodarek, edumazet
In-Reply-To: <8737760wg5.fsf@stressinduktion.org>

On Fri, 2017-09-29 at 09:56 +0200, Hannes Frederic Sowa wrote:
> [adding Paolo, Eric]
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> 
> [...]
> 
> > > +	wake_up_process(rcpu->kthread);
> > 
> > In general the whole thing looks like 'threaded NAPI' that Hannes was
> > proposing some time back. I liked it back then and I like it now.
> > I don't remember what were the objections back then.
> > Something scheduler related?
> > Adding Hannes.

Beyond the added scheduling complexity, the threaded NAPI
implementation proposed some time ago also possibly introduced OoO
packet delivery, because the NAPI threads were left unbound to any CPU.

Cheers,

Paolo

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:29 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928173415.15551-3-jiri@resnulli.us>

On 28/09/17 20:34, Jiri Pirko wrote:
> From: Yotam Gigi <yotamg@mellanox.com>
> 
> In order to allow the ipmr module to do partial multicast forwarding
> according to the device parent ID, add the device parent ID field to the
> VIF struct. This way, the forwarding path can use the parent ID field
> without invoking switchdev calls, which requires the RTNL lock.
> 
> When a new VIF is added, set the device parent ID field in it by invoking
> the switchdev_port_attr_get call.
> 
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/linux/mroute.h | 2 ++
>  net/ipv4/ipmr.c        | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
> index b072a84..a46577f 100644
> --- a/include/linux/mroute.h
> +++ b/include/linux/mroute.h
> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>  
>  struct vif_device {
>  	struct net_device 	*dev;			/* Device we are using */
> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
> +	bool		dev_parent_id_valid;
>  	unsigned long	bytes_in,bytes_out;
>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 292a8e8..4566c54 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -67,6 +67,7 @@
>  #include <net/fib_rules.h>
>  #include <linux/netconf.h>
>  #include <net/nexthop.h>
> +#include <net/switchdev.h>
>  
>  struct ipmr_rule {
>  	struct fib_rule		common;
> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  		   struct vifctl *vifc, int mrtsock)
>  {
>  	int vifi = vifc->vifc_vifi;
> +	struct switchdev_attr attr = {
> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
> +	};
>  	struct vif_device *v = &mrt->vif_table[vifi];
>  	struct net_device *dev;
>  	struct in_device *in_dev;
> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  
>  	/* Fill in the VIF structures */
>  
> +	attr.orig_dev = dev;
> +	if (!switchdev_port_attr_get(dev, &attr)) {
> +		v->dev_parent_id_valid = true;
> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);

Hmm, shouldn't you set dev_parent_id.id_len too ? It would seem netdev_phys_item_id_same()
uses it in the comparison and without the len it would always look like they're the same
because memcmp will simply return 0 with count = 0.

> +	}
>  	v->rate_limit = vifc->vifc_rate_limit;
>  	v->local = vifc->vifc_lcl_addr.s_addr;
>  	v->remote = vifc->vifc_rmt_addr.s_addr;
> 

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net v2] i40e: Fix limit imprecise of the number of MAC/VLAN that can be added for VFs
From: wangyunjian @ 2017-09-29  9:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jeff Kirsher, Sergei Shtylyov, Netdev, caihe,
	intel-wired-lan
In-Reply-To: <CAKgT0UcsnLUWU8qedNrv+bhbgJ6SC9zWyc7oR9gObcYo2atN5g@mail.gmail.com>



> -----Original Message-----
> From: Alexander Duyck [mailto:alexander.duyck@gmail.com]
> Sent: Thursday, September 28, 2017 11:44 PM
> To: wangyunjian <wangyunjian@huawei.com>
> Cc: David Miller <davem@davemloft.net>; Jeff Kirsher
> <jeffrey.t.kirsher@intel.com>; Sergei Shtylyov
> <sergei.shtylyov@cogentembedded.com>; Netdev
> <netdev@vger.kernel.org>; caihe <caihe@huawei.com>; intel-wired-lan
> <intel-wired-lan@lists.osuosl.org>
> Subject: Re: [Intel-wired-lan] [PATCH net v2] i40e: Fix limit imprecise of the
> number of MAC/VLAN that can be added for VFs
> 
> On Wed, Sep 27, 2017 at 7:01 PM, w00273186 <wangyunjian@huawei.com>
> wrote:
> > From: Yunjian Wang <wangyunjian@huawei.com>
> >
> > Now it doesn't limit the number of MAC/VLAN strictly. When there is more
> > elements in the virtchnl MAC/VLAN list, it can still add successfully.
> 
> You could still add but should you. I'm not clear from this patch
> description what this is supposed to be addressing. If you enable the
> "trust" flag for a VF via the "ip link set dev <iface> vf <vfnum>
> trust on" it can make use of any resources on the device, but without
> that there is an upper limit that is supposed to be enforced to
> prevent the VF from making use of an excessive amount of resources.
> That is what is being enforced by the code you are moving out of the
> way below.

I don't enable the "trust" flag for a VF. But this script can successfully add
MACs more than I40E_VC_MAX_MAC_ADDR_PER_VF(12) in VM. It has
same problem with VLAN.

Test script:

for((i=10;i<50;i++))
do
    ipmaddr add 01:00:5e:01:02:$i  dev eth0
done

for ((i=1;i<40;i++))
do
    ip link add link eth0 name eth0.$i type vlan id $i
done

> 
> > Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 27 +++++++++++++--
> -------
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> > index 4d1e670..285b96a 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> > @@ -2065,11 +2065,6 @@ static inline int i40e_check_vf_permission(struct
> i40e_vf *vf, u8 *macaddr)
> >                 dev_err(&pf->pdev->dev,
> >                         "VF attempting to override administratively set MAC address,
> reload the VF driver to resume normal operation\n");
> >                 ret = -EPERM;
> > -       } else if ((vf->num_mac >= I40E_VC_MAX_MAC_ADDR_PER_VF) &&
> > -                  !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > -               dev_err(&pf->pdev->dev,
> > -                       "VF is not trusted, switch the VF to trusted to add more
> functionality\n");
> > -               ret = -EPERM;
> >         }
> >         return ret;
> >  }
> > @@ -2128,6 +2123,15 @@ static int i40e_vc_add_mac_addr_msg(struct
> i40e_vf *vf, u8 *msg, u16 msglen)
> >                 } else {
> >                         vf->num_mac++;
> >                 }
> > +
> > +               if ((vf->num_mac >= I40E_VC_MAX_MAC_ADDR_PER_VF) &&
> > +                   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > +                       dev_err(&pf->pdev->dev,
> > +                               "VF is not trusted, switch the VF to trusted to add more
> functionality\n");
> > +                       ret = -EPERM;
> > +                       spin_unlock_bh(&vsi->mac_filter_hash_lock);
> > +                       goto error_param;
> > +               }
> >         }
> >         spin_unlock_bh(&vsi->mac_filter_hash_lock);
> >
> 
> This doesn't make any sense. You are doing the checks after you have
> already added the MAC. The only part you aren't doing is sending the
> message to the VF indicating that the request was successful.
> 
> > @@ -2221,12 +2225,6 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf
> *vf, u8 *msg, u16 msglen)
> >         i40e_status aq_ret = 0;
> >         int i;
> >
> > -       if ((vf->num_vlan >= I40E_VC_MAX_VLAN_PER_VF) &&
> > -           !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > -               dev_err(&pf->pdev->dev,
> > -                       "VF is not trusted, switch the VF to trusted to add more VLAN
> addresses\n");
> > -               goto error_param;
> > -       }
> >         if (!test_bit(I40E_VF_STATE_ACTIVE, &vf->vf_states) ||
> >             !i40e_vc_isvalid_vsi_id(vf, vsi_id)) {
> >                 aq_ret = I40E_ERR_PARAM;
> > @@ -2269,6 +2267,13 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf
> *vf, u8 *msg, u16 msglen)
> >                         dev_err(&pf->pdev->dev,
> >                                 "Unable to add VLAN filter %d for VF %d, error %d\n",
> >                                 vfl->vlan_id[i], vf->vf_id, ret);
> > +               if ((vf->num_vlan >= I40E_VC_MAX_VLAN_PER_VF) &&
> > +                   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > +                       dev_err(&pf->pdev->dev,
> > +                               "VF is not trusted, switch the VF to trusted to add more
> VLAN addresses\n");
> > +                       aq_ret = -EPERM;
> > +                       goto error_param;
> > +               }
> >         }
> >
> >  error_param:
> 
> Same here. You are doing this after the call to i40e_vsi_add_vlan. The
> code makes no sense here. This bit of code is supposed to be
> preventing a VF from abusing resources if the VF is not privelaged.

^ permalink raw reply

* RE: [PATCH RFC 3/5] Add KSZ8795 switch driver
From: David Laight @ 2017-09-29  9:14 UTC (permalink / raw)
  To: 'Andrew Lunn', Tristram.Ha@microchip.com
  Cc: muvarov@gmail.com, pavel@ucw.cz, nathan.leigh.conrad@gmail.com,
	vivien.didelot@savoirfairelinux.com, f.fainelli@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Woojung.Huh@microchip.com
In-Reply-To: <20170928193416.GH14940@lunn.ch>

From: Andrew Lunn
> Sent: 28 September 2017 20:34
...
> > There are 34 counters.  In normal case using generic bus I/O or PCI to read them
> > is very quick, but the switch is mostly accessed using SPI, or even I2C.  As the SPI
> > access is very slow.
> 
> How slow is it? The Marvell switches all use MDIO. It is probably a
> bit faster than I2C, but it is a lot slower than MMIO or PCI.
> 
> ethtool -S lan0 takes about 25ms.

Is the SPI access software bit-banged?
Doing that with software delays isn't friendly to the rest of the system.
(Hardware guys please note...)

One possibility is to rate-limit the stats reading.
Then an application cannot completely 'hog' the SPI bandwidth.

	David

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Jesper Dangaard Brouer @ 2017-09-29  9:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek, hannes, brouer
In-Reply-To: <20170929032146.vs5v454wjs4niu4k@ast-mbp>

On Thu, 28 Sep 2017 20:21:47 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> > The 'cpumap' is primary used as a backend map for XDP BPF helper
> > call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> > 
> > This patch implement the main part of the map.  It is not connected to
> > the XDP redirect system yet, and no SKB allocation are done yet.
> > 
> > The main concern in this patch is to ensure the datapath can run
> > without any locking.  This adds complexity to the setup and tear-down
> > procedure, which assumptions are extra carefully documented in the
> > code comments.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/linux/bpf_types.h      |    1 
> >  include/uapi/linux/bpf.h       |    1 
> >  kernel/bpf/Makefile            |    1 
> >  kernel/bpf/cpumap.c            |  547 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/bpf/syscall.c           |    8 +
> >  tools/include/uapi/linux/bpf.h |    1 
> >  6 files changed, 558 insertions(+), 1 deletion(-)
> >  create mode 100644 kernel/bpf/cpumap.c
> > 
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 6f1a567667b8..814c1081a4a9 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -41,4 +41,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
> >  #ifdef CONFIG_STREAM_PARSER
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
> >  #endif
> > +BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
> >  #endif
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e43491ac4823..f14e15702533 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -111,6 +111,7 @@ enum bpf_map_type {
> >  	BPF_MAP_TYPE_HASH_OF_MAPS,
> >  	BPF_MAP_TYPE_DEVMAP,
> >  	BPF_MAP_TYPE_SOCKMAP,
> > +	BPF_MAP_TYPE_CPUMAP,
> >  };
> >  
> >  enum bpf_prog_type {
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 897daa005b23..dba0bd33a43c 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -4,6 +4,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
> >  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
> >  ifeq ($(CONFIG_NET),y)
> >  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> > +obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
> >  ifeq ($(CONFIG_STREAM_PARSER),y)
> >  obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
> >  endif
> > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> > new file mode 100644
> > index 000000000000..f0948af82e65
> > --- /dev/null
> > +++ b/kernel/bpf/cpumap.c
> > @@ -0,0 +1,547 @@
> > +/* bpf/cpumap.c
> > + *
> > + * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
> > + * Released under terms in GPL version 2.  See COPYING.
> > + */
> > +
> > +/* The 'cpumap' is primary used as a backend map for XDP BPF helper
> > + * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> > + *
> > + * Unlike devmap which redirect XDP frames out another NIC device,
> > + * this map type redirect raw XDP frames to another CPU.  The remote
> > + * CPU will do SKB-allocation and call the normal network stack.
> > + *
> > + * This is a scalability and isolation mechanism, that allow
> > + * separating the early driver network XDP layer, from the rest of the
> > + * netstack, and assigning dedicated CPUs for this stage.  This
> > + * basically allows for 10G wirespeed pre-filtering via bpf.
> > + */
> > +#include <linux/bpf.h>
> > +#include <linux/filter.h>
> > +#include <linux/ptr_ring.h>
> > +
> > +#include <linux/sched.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/kthread.h>
> > +
> > +/*
> > + * General idea: XDP packets getting XDP redirected to another CPU,
> > + * will maximum be stored/queued for one driver ->poll() call.  It is
> > + * guaranteed that setting flush bit and flush operation happen on
> > + * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
> > + * which queue in bpf_cpu_map_entry contains packets.
> > + */
> > +
> > +#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
> > +struct xdp_bulk_queue {
> > +	void *q[CPU_MAP_BULK_SIZE];
> > +	unsigned int count;
> > +};
> > +
> > +/* Struct for every remote "destination" CPU in map */
> > +struct bpf_cpu_map_entry {
> > +	u32 cpu;    /* kthread CPU and map index */
> > +	int map_id; /* Back reference to map */
> > +	u32 qsize;  /* Redundant queue size for map lookup */
> > +
> > +	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
> > +	struct xdp_bulk_queue __percpu *bulkq;
> > +
> > +	/* Queue with potential multi-producers, and single-consumer kthread */
> > +	struct ptr_ring *queue;
> > +	struct task_struct *kthread;
> > +	struct work_struct kthread_stop_wq;
> > +
> > +	atomic_t refcnt; /* Control when this struct can be free'ed */
> > +	struct rcu_head rcu;
> > +};
> > +
> > +struct bpf_cpu_map {
> > +	struct bpf_map map;
> > +	/* Below members specific for map type */
> > +	struct bpf_cpu_map_entry **cpu_map;
> > +	unsigned long __percpu *flush_needed;
> > +};
> > +
> > +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> > +			     struct xdp_bulk_queue *bq);
> > +
> > +static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
> > +{
> > +	return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
> > +}
> > +
> > +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> > +{
> > +	struct bpf_cpu_map *cmap;
> > +	u64 cost;
> > +	int err;
> > +
> > +	/* check sanity of attributes */
> > +	if (attr->max_entries == 0 || attr->key_size != 4 ||
> > +	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	cmap = kzalloc(sizeof(*cmap), GFP_USER);
> > +	if (!cmap)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* mandatory map attributes */
> > +	cmap->map.map_type = attr->map_type;
> > +	cmap->map.key_size = attr->key_size;
> > +	cmap->map.value_size = attr->value_size;
> > +	cmap->map.max_entries = attr->max_entries;
> > +	cmap->map.map_flags = attr->map_flags;
> > +	cmap->map.numa_node = bpf_map_attr_numa_node(attr);
> > +
> > +	/* make sure page count doesn't overflow */
> > +	cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
> > +	cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
> > +	if (cost >= U32_MAX - PAGE_SIZE)
> > +		goto free_cmap;
> > +	cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	/* if map size is larger than memlock limit, reject it early */
> > +	err = bpf_map_precharge_memlock(cmap->map.pages);
> > +	if (err)
> > +		goto free_cmap;
> > +
> > +	/* A per cpu bitfield with a bit per possible CPU in map  */
> > +	cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
> > +					    __alignof__(unsigned long));
> > +	if (!cmap->flush_needed)
> > +		goto free_cmap;
> > +
> > +	/* Alloc array for possible remote "destination" CPUs */
> > +	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
> > +					   sizeof(struct bpf_cpu_map_entry *),
> > +					   cmap->map.numa_node);
> > +	if (!cmap->cpu_map)
> > +		goto free_cmap;
> > +
> > +	return &cmap->map;
> > +free_cmap:
> > +	free_percpu(cmap->flush_needed);
> > +	kfree(cmap);
> > +	return ERR_PTR(-ENOMEM);
> > +}
> > +
> > +void __cpu_map_queue_destructor(void *ptr)
> > +{
> > +	/* For now, just catch this as an error */
> > +	if (!ptr)
> > +		return;
> > +	pr_err("ERROR: %s() cpu_map queue was not empty\n", __func__);
> > +	page_frag_free(ptr);
> > +}
> > +
> > +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	if (atomic_dec_and_test(&rcpu->refcnt)) {
> > +		/* The queue should be empty at this point */
> > +		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> > +		kfree(rcpu->queue);
> > +		kfree(rcpu);
> > +	}
> > +}
> > +
> > +static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	atomic_inc(&rcpu->refcnt);
> > +}
> > +
> > +/* called from workqueue, to workaround syscall using preempt_disable */
> > +static void cpu_map_kthread_stop(struct work_struct *work)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu;
> > +
> > +	rcpu = container_of(work, struct bpf_cpu_map_entry, kthread_stop_wq);
> > +	synchronize_rcu(); /* wait for flush in __cpu_map_entry_free() */
> > +	kthread_stop(rcpu->kthread); /* calls put_cpu_map_entry */
> > +}
> > +
> > +static int cpu_map_kthread_run(void *data)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu = data;
> > +
> > +	set_current_state(TASK_INTERRUPTIBLE);
> > +	while (!kthread_should_stop()) {
> > +		struct xdp_pkt *xdp_pkt;
> > +
> > +		schedule();
> > +		/* Do work */
> > +		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > +			/* For now just "refcnt-free" */
> > +			page_frag_free(xdp_pkt);
> > +		}
> > +		__set_current_state(TASK_INTERRUPTIBLE);
> > +	}
> > +	put_cpu_map_entry(rcpu);
> > +
> > +	__set_current_state(TASK_RUNNING);
> > +	return 0;
> > +}
> > +
> > +struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id)
> > +{
> > +	gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN;
> > +	struct bpf_cpu_map_entry *rcpu;
> > +	int numa, err;
> > +
> > +	/* Have map->numa_node, but choose node of redirect target CPU */
> > +	numa = cpu_to_node(cpu);
> > +
> > +	rcpu = kzalloc_node(sizeof(*rcpu), gfp, numa);
> > +	if (!rcpu)
> > +		return NULL;
> > +
> > +	/* Alloc percpu bulkq */
> > +	rcpu->bulkq = __alloc_percpu_gfp(sizeof(*rcpu->bulkq),
> > +					 sizeof(void *), gfp);
> > +	if (!rcpu->bulkq)
> > +		goto fail;
> > +
> > +	/* Alloc queue */
> > +	rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
> > +	if (!rcpu->queue)
> > +		goto fail;
> > +
> > +	err = ptr_ring_init(rcpu->queue, qsize, gfp);
> > +	if (err)
> > +		goto fail;
> > +	rcpu->qsize = qsize;
> > +
> > +	/* Setup kthread */
> > +	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
> > +					       "cpumap/%d/map:%d", cpu, map_id);
> > +	if (IS_ERR(rcpu->kthread))
> > +		goto fail;
> > +
> > +	/* Make sure kthread runs on a single CPU */
> > +	kthread_bind(rcpu->kthread, cpu);  
> 
> is there a check that max_entries <= num_possible_cpu ? I couldn't
> find it. otherwise it will be binding to impossible cpu?

Good point! -- I'll find an appropriate place to add such a limit.


> > +	wake_up_process(rcpu->kthread);  
> 
> In general the whole thing looks like 'threaded NAPI' that Hannes was
> proposing some time back. I liked it back then and I like it now.
> I don't remember what were the objections back then.
> Something scheduler related?
> Adding Hannes.

It is related to the threaded NAPI' idea[1], and I did choose kthreads
because this was used by this patch[1].
(Link to Hannes & Paolo's patch:[1] http://patchwork.ozlabs.org/patch/620657/)

It's less-intrusive, as it's only activated specifically when activating
bpf+XDP+cpumap.  Plus, it's not taking over the calling of napi->poll,
it is "just" making to "cost" of calling napi->poll significantly
smaller, as it moves invoking the network stack to another kthread. And
the choice is done on a per packet level (you don't get more
flexibility than that).

> Still curious about the questions I asked in the other thread
> on what's causing it to be so much better than RPS

Answered in that thread.  It is simply that the RPS-RX CPU have to do
too much work (like memory allocations).  Plus it uses more expensive
IPI calls, where I use wake_up_process() which doesn't do a IPI if it
can see that the remote thread is already running.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next 0/5] bpf: Extend bpf_{prog,map}_info
From: David Miller @ 2017-09-29  5:17 UTC (permalink / raw)
  To: kafai; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20170927213756.1254938-1-kafai@fb.com>

From: Martin KaFai Lau <kafai@fb.com>
Date: Wed, 27 Sep 2017 14:37:51 -0700

> This patch series adds more fields to bpf_prog_info and bpf_map_info.
> Please see individual patch for details.

Great to see progress in the area of eBPF introspection.

Series applied, thanks.

^ permalink raw reply

* [PATCH net v1 1/1] tipc: use only positive error codes in messages
From: Parthasarathy Bhuvaragan @ 2017-09-29  8:02 UTC (permalink / raw)
  To: davem
  Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue,
	parthasarathy.bhuvaragan

In commit e3a77561e7d32 ("tipc: split up function tipc_msg_eval()"),
we have updated the function tipc_msg_lookup_dest() to set the error
codes to negative values at destination lookup failures. Thus when
the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
into the 4 bit error field of the message header as 0xf instead of
TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.

In this commit, we set only positive error code.

Fixes: e3a77561e7d32 ("tipc: split up function tipc_msg_eval()")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
---
 net/tipc/msg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 6ef379f004ac..121e59a1d0e7 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -551,7 +551,7 @@ bool tipc_msg_lookup_dest(struct net *net, struct sk_buff *skb, int *err)
 		return false;
 	if (msg_errcode(msg))
 		return false;
-	*err = -TIPC_ERR_NO_NAME;
+	*err = TIPC_ERR_NO_NAME;
 	if (skb_linearize(skb))
 		return false;
 	msg = buf_msg(skb);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v4 net-next 0/8] flow_dissector: Protocol specific flow dissector offload
From: Hannes Frederic Sowa @ 2017-09-29  7:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, rohit
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Tom Herbert <tom@quantonium.net> writes:

> This patch set adds a new offload type to perform flow dissection for
> specific protocols (either by EtherType or by IP protocol). This is
> primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
> the purposes of parsing the encapsulated packet.
>
> Items in this patch set:
> - Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
>   on the code in the GRE dissect function and the special handling in
>   GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
>   goto proto_again is done)
> - Add infrastructure for protocol specific flow dissection offload
> - Add infrastructure to perform UDP flow dissection. Uses same model of
>   GRO where a flow_dissect callback can be associated with a UDP
>   socket
> - Use the infrastructure to support flow dissection of VXLAN and GUE
>
> Tested:
>
> Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
> that inner packet was being properly dissected.

I have the feeling that this patch series changes the behavior of flower
and thus causes uAPI problems.

flower seems to use the flow dissector results for parsing the inner
packets. In case of vxlan in vxlan encapsulation, which seems to become
more common (sigh!) you let part of the flow specification match on the
most inner header, while the flower ingress filter might want to match
inside the first encapsulation only.

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Hannes Frederic Sowa @ 2017-09-29  7:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, netdev, jakub.kicinski,
	Michael S. Tsirkin, Jason Wang, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Andy Gospodarek, pabeni,
	edumazet
In-Reply-To: <20170929032146.vs5v454wjs4niu4k@ast-mbp>

[adding Paolo, Eric]

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:

[...]

>> +	wake_up_process(rcpu->kthread);
>
> In general the whole thing looks like 'threaded NAPI' that Hannes was
> proposing some time back. I liked it back then and I like it now.
> I don't remember what were the objections back then.
> Something scheduler related?
> Adding Hannes.

Yes.

The main objection from Eric at that time was that user space now starts
to compete with the threaded NAPI threads depending on process
priorities, which are under control of user space. Softirq always runs
first to end. Networking could starve because a process with higher
priority is runnable. At that time Eric found a way to fix the
particular problem, which resulted in commit 4cd13c21b207e80d. Pinning
and other control is also possible from user space, causing more complex
tuning set ups and problems will be harder to debug.

In particular after Eric's patch threaded NAPI proofed itself to be not
useful anymore, because his patch successfully deferred work to the
ksoftirqd more reliable thus allowing the UDP rx queue to get drained by
user space.

> Still curious about the questions I asked in the other thread
> on what's causing it to be so much better than RPS

My guess is that RPS uses expensive IPI to notify the remote
softirq. The batching size on RPS depends on how many packets could get
worked on during one softirq invocation on the source CPU until we wake
up remote CPU(s!), if they are not constantly running.

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jesper Dangaard Brouer @ 2017-09-29  7:46 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Alexei Starovoitov, Andy Gospodarek, brouer
In-Reply-To: <59CD83DD.4060603@iogearbox.net>

On Fri, 29 Sep 2017 01:21:01 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> [...]
> > +/* Convert xdp_buff to xdp_pkt */
> > +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> > +{
> > +	struct xdp_pkt *xdp_pkt;
> > +	int headroom;
> > +
> > +	/* Assure headroom is available for storing info */
> > +	headroom = xdp->data - xdp->data_hard_start;
> > +	if (headroom < sizeof(*xdp_pkt))
> > +		return NULL;
> > +
> > +	/* Store info in top of packet */
> > +	xdp_pkt = xdp->data_hard_start;  
> 
> (You'd also need to handle data_meta here if set, and for below
> cpu_map_build_skb(), e.g. headroom is data_meta-data_hard_start.)

I'll look into this.  The data_meta patchset was in-flight while I
rebased this.

> > +	xdp_pkt->data = xdp->data;
> > +	xdp_pkt->len  = xdp->data_end - xdp->data;
> > +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> > +
> > +	return xdp_pkt;
> > +}
> > +
> > +static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> > +					 struct xdp_pkt *xdp_pkt)
> > +{
> > +	unsigned int frame_size;
> > +	void *pkt_data_start;
> > +	struct sk_buff *skb;
> > +
> > +	/* build_skb need to place skb_shared_info after SKB end, and
> > +	 * also want to know the memory "truesize".  Thus, need to  
> [...]
> >   static int cpu_map_kthread_run(void *data)
> >   {
> > +	const unsigned long busy_poll_jiffies = usecs_to_jiffies(2000);
> > +	unsigned long time_limit = jiffies + busy_poll_jiffies;
> >   	struct bpf_cpu_map_entry *rcpu = data;
> > +	unsigned int empty_cnt = 0;
> >
> >   	set_current_state(TASK_INTERRUPTIBLE);
> >   	while (!kthread_should_stop()) {
> > +		unsigned int processed = 0, drops = 0;
> >   		struct xdp_pkt *xdp_pkt;
> >
> > -		schedule();
> > -		/* Do work */
> > -		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > -			/* For now just "refcnt-free" */
> > -			page_frag_free(xdp_pkt);
> > +		/* Release CPU reschedule checks */
> > +		if ((time_after_eq(jiffies, time_limit) || empty_cnt > 25) &&
> > +		    __ptr_ring_empty(rcpu->queue)) {
> > +			empty_cnt++;
> > +			schedule();
> > +			time_limit = jiffies + busy_poll_jiffies;
> > +			WARN_ON(smp_processor_id() != rcpu->cpu);
> > +		} else {
> > +			cond_resched();
> >   		}
> > +
> > +		/* Process packets in rcpu->queue */
> > +		local_bh_disable();
> > +		/*
> > +		 * The bpf_cpu_map_entry is single consumer, with this
> > +		 * kthread CPU pinned. Lockless access to ptr_ring
> > +		 * consume side valid as no-resize allowed of queue.
> > +		 */
> > +		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
> > +			struct sk_buff *skb;
> > +			int ret;
> > +
> > +			/* Allow busy polling again */
> > +			empty_cnt = 0;
> > +
> > +			skb = cpu_map_build_skb(rcpu, xdp_pkt);
> > +			if (!skb) {
> > +				page_frag_free(xdp_pkt);
> > +				continue;
> > +			}
> > +
> > +			/* Inject into network stack */
> > +			ret = netif_receive_skb(skb);  
> 
> Have you looked into whether it's feasible to reuse GRO
> engine here as well?

This is the first step. I'll work on adding the GRO-engine later. And
it should be feasible.  There are plenty of optimizations in this area
that can do done later ;-)

> 
> > +			if (ret == NET_RX_DROP)
> > +				drops++;
> > +
> > +			/* Limit BH-disable period */
> > +			if (++processed == 8)
> > +				break;
> > +		}
> > +		local_bh_enable();
> > +
> >   		__set_current_state(TASK_INTERRUPTIBLE);
> >   	}
> >   	put_cpu_map_entry(rcpu);  
> [...]



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-29  7:27 UTC (permalink / raw)
  To: Yang, Yi
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jiri Benc,
	e@erig.me, davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <20170929071553.GA19053-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

> > The optimization Yi refers to only affects the slow path translation.
> >
> > OVS 2.8 does not immediately trigger an immediate recirculation after translating
> > encap(nsh,...). There is no need to do so as the flow key of the resulting packet
> > can be determined from the encap() action and its properties. Translation
> > continues with the rewritten flow key and subsequent OpenFlow actions will
> > typically set the new fields in the new NSH header. The push_nsh datapath action
> > (including all NSH header fields) is only generated at the next commit, e.g. for
> > output, cloning, recirculation, encap/decap or another destructive change of
> > the flow key.
> >
> > The implementation of push_nsh in the user-space datapath does not update
> > the miniflow (key) of the packet, only the packet data and some metadata.
> > If the packet needs to be looked up again the slow path triggers recirculation
> > to re-parse the packet. There should be no need for the datapath push_nsh
> > action to try to update the flow key.
> 
> Thanks Jan for clarification, it can still work after removing that
> line, our flows didn't match it after push_nsh, it is output to
> VxLAN-gpe port after push_nsh, I'm not sure if we can match dl_type and NSH
> fields if we don't output and don't recirculate.

No worries, a packet cannot be matched again in the datapath unless it is 
recirculated. And recirculation today always implies re-parsing. 

In the future we want to look into possibilities to optimize performance of 
recirculation, for example by skipping the parsing stage if it is unnecessary.
For that we may need to invalidate the flow key in packet metadata when
the packet is modified without corresponding update of the key itself. But that
is music of the future.

/Jan

^ permalink raw reply

* (unknown), 
From: kelley @ 2017-09-29  7:26 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 40098069241.zip --]
[-- Type: application/zip, Size: 7206 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Yang, Yi @ 2017-09-29  7:15 UTC (permalink / raw)
  To: Jan Scheurich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jiri Benc,
	e@erig.me, davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <CFF8EF42F1132E4CBE2BF0AB6C21C58D7881A337-hqolJogE5njKJFWPz4pdheaU1rCVNFv4@public.gmane.org>

On Fri, Sep 29, 2017 at 07:10:52AM +0000, Jan Scheurich wrote:
> > From: Yang, Yi [mailto:yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
> > Sent: Friday, 29 September, 2017 08:41
> > To: Pravin Shelar <pshelar-LZ6Gd1LRuIk@public.gmane.org>
> > Cc: Jiri Benc <jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org; e@erig.me; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org; Jan Scheurich
> > <jan.scheurich-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH net-next v9] openvswitch: enable NSH support
> > 
> > On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> > > On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > > > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> > > >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> > > >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> > > >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> > > >> > will be recirculated to flow pipeline, it will be reparsed, so
> > > >> > key->eth.type will be set in packet parse function, we needn't handle it
> > > >> > in pop_nsh.
> > > >>
> > > >> This seems to be a very different approach than what we currently have.
> > > >> Looking at the code, the requirement after "destructive" actions such
> > > >> as pushing or popping headers is to recirculate.
> > > >
> > > > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > > > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > > > also cc jan.scheurich-IzeFyvvaP7oU04JRNCRQjg@public.gmane.org
> > > >
> > > > Actucally all the keys before push_nsh are still there after push_nsh,
> > > > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> > > >
> > >
> > >
> > > We should keep existing model for this patch. Later you can submit
> > > optimization patch with specific use cases and performance
> > > improvement. So that we can evaluate code complexity and benefits.
> > 
> > Ok, I'll remove the below line in push_nsh and send out v11, thanks.
> > 
> > 	key->eth.type = htons(ETH_P_NSH);
> 
> The optimization Yi refers to only affects the slow path translation. 
> 
> OVS 2.8 does not immediately trigger an immediate recirculation after translating 
> encap(nsh,...). There is no need to do so as the flow key of the resulting packet 
> can be determined from the encap() action and its properties. Translation 
> continues with the rewritten flow key and subsequent OpenFlow actions will 
> typically set the new fields in the new NSH header. The push_nsh datapath action 
> (including all NSH header fields) is only generated at the next commit, e.g. for 
> output, cloning, recirculation, encap/decap or another destructive change of 
> the flow key.
> 
> The implementation of push_nsh in the user-space datapath does not update
> the miniflow (key) of the packet, only the packet data and some metadata. 
> If the packet needs to be looked up again the slow path triggers recirculation
> to re-parse the packet. There should be no need for the datapath push_nsh 
> action to try to update the flow key.

Thanks Jan for clarification, it can still work after removing that
line, our flows didn't match it after push_nsh, it is output to
VxLAN-gpe port after push_nsh, I'm not sure if we can match dl_type and NSH
fields if we don't output and don't recirculate.

> 
> BR, Jan

^ permalink raw reply

* RE: [PATCH net-next v9] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-29  7:10 UTC (permalink / raw)
  To: Yang, Yi, Pravin Shelar
  Cc: Jiri Benc, netdev@vger.kernel.org, dev@openvswitch.org, e@erig.me,
	davem@davemloft.net
In-Reply-To: <20170929064058.GA16145@localhost.localdomain>

> From: Yang, Yi [mailto:yi.y.yang@intel.com]
> Sent: Friday, 29 September, 2017 08:41
> To: Pravin Shelar <pshelar@ovn.org>
> Cc: Jiri Benc <jbenc@redhat.com>; netdev@vger.kernel.org; dev@openvswitch.org; e@erig.me; davem@davemloft.net; Jan Scheurich
> <jan.scheurich@ericsson.com>
> Subject: Re: [PATCH net-next v9] openvswitch: enable NSH support
> 
> On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> > On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang@intel.com> wrote:
> > > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> > >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> > >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> > >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> > >> > will be recirculated to flow pipeline, it will be reparsed, so
> > >> > key->eth.type will be set in packet parse function, we needn't handle it
> > >> > in pop_nsh.
> > >>
> > >> This seems to be a very different approach than what we currently have.
> > >> Looking at the code, the requirement after "destructive" actions such
> > >> as pushing or popping headers is to recirculate.
> > >
> > > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > > also cc jan.scheurich@ericsson.com.
> > >
> > > Actucally all the keys before push_nsh are still there after push_nsh,
> > > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> > >
> >
> >
> > We should keep existing model for this patch. Later you can submit
> > optimization patch with specific use cases and performance
> > improvement. So that we can evaluate code complexity and benefits.
> 
> Ok, I'll remove the below line in push_nsh and send out v11, thanks.
> 
> 	key->eth.type = htons(ETH_P_NSH);

The optimization Yi refers to only affects the slow path translation. 

OVS 2.8 does not immediately trigger an immediate recirculation after translating 
encap(nsh,...). There is no need to do so as the flow key of the resulting packet 
can be determined from the encap() action and its properties. Translation 
continues with the rewritten flow key and subsequent OpenFlow actions will 
typically set the new fields in the new NSH header. The push_nsh datapath action 
(including all NSH header fields) is only generated at the next commit, e.g. for 
output, cloning, recirculation, encap/decap or another destructive change of 
the flow key.

The implementation of push_nsh in the user-space datapath does not update
the miniflow (key) of the packet, only the packet data and some metadata. 
If the packet needs to be looked up again the slow path triggers recirculation
to re-parse the packet. There should be no need for the datapath push_nsh 
action to try to update the flow key.

BR, Jan

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Jesper Dangaard Brouer @ 2017-09-29  7:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Daniel Borkmann, peter.waskiewicz.jr,
	jakub.kicinski, netdev, Andy Gospodarek, brouer
In-Reply-To: <20170927173233.tuqlutz6t2gwdk53@ast-mbp>

On Wed, 27 Sep 2017 10:32:36 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 27 Sep 2017 06:35:40 -0700
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:  
> > > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >     
> > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > > >> [...]    
> > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > > >>> actually works extremely well).      
> > > >>
> > > >> Meaning you let all the XDP_PASS packets get processed on a
> > > >> different CPU, so you can reserve the whole CPU just for
> > > >> prefiltering, right?     
> > > > 
> > > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > > The trick is using the map-flush point, to transfer packets in bulk to
> > > > the remote CPU (single call IPC is too slow), but at the same time
> > > > flush single packets if NAPI didn't see a bulk.
> > > >     
> > > >> Do you have some numbers to share at this point, just curious when
> > > >> you mention it works extremely well.    
> > > > 
> > > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > > 
> > > >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > > >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > > 
> > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > > the 11.9Mpps DDoS attack is hitting.
> > > > 
> > > > Running XDP/eBPF prog_num:4
> > > > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > > > XDP-RX          0       12,030,471     11,966,982  0          
> > > > XDP-RX          total   12,030,471     11,966,982 
> > > > cpumap-enqueue    0:2   63,488         0           0          
> > > > cpumap-enqueue  sum:2   63,488         0           0          
> > > > cpumap_kthread  2       63,488         0           3          time_exceed
> > > > cpumap_kthread  total   63,488         0           0          
> > > > redirect_err    total   0              0          
> > > > 
> > > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > > Local /Remote
> > > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > > Send   Recv   Size     Size    Time     Rate         
> > > > bytes  Bytes  bytes    bytes   secs.    per sec   
> > > > 
> > > > 16384  87380  1024     1024    10.00    12735.97   
> > > > 16384  87380 
> > > > 
> > > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > > >     
> > > 
> > > Just curious could you also try this with RPS enabled (or does this have
> > > RPS enabled). RPS should effectively do the same thing but higher in the
> > > stack. I'm curious what the delta would be. Might be another interesting
> > > case and fairly easy to setup if you already have the above scripts.  
> > 
> > Yes, I'm essentially competing with RSP, thus such a comparison is very
> > relevant...
> > 
> > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> > other 4 CPUS process packet.
> > 
> > Summary of RPS (Receive Packet Steering) performance:
> >  * End result is 6.3 Mpps max performance
> >  * netperf TCP_CRR is 1 trans/sec.
> >  * Each RX-RPS CPU stall at ~3.2Mpps.
> > 
> > The full test report below with setup:
> > 
> > The mask needed::
> > 
> >  perl -e 'printf "%b\n",0x3C'
> >  111100
> > 
> > RPS setup::
> > 
> >  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> > 
> >  for N in $(seq 0 5) ; do \
> >    sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
> >    sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
> >    grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
> >  done
> > 
> > Reduce RX queues to two ::
> > 
> >  ethtool -L ixgbe1 combined 2
> > 
> > IRQ align to CPU numbers::
> > 
> >  $ ~/setup01.sh
> >  Not root, running with sudo
> >   --- Disable Ethernet flow-control ---
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >   --- Align IRQs ---
> >  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
> >  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
> >  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> > 
> > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> > 
> > Generator is sending: 12,715,782 tx_packets /sec
> > 
> >  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
> >     -d 172.16.0.2 -t8
> > 
> > $ nstat > /dev/null && sleep 1 && nstat
> > #kernel
> > IpInReceives                    6346544            0.0
> > IpInDelivers                    6346544            0.0
> > IpOutRequests                   1020               0.0
> > IcmpOutMsgs                     1020               0.0
> > IcmpOutDestUnreachs             1020               0.0
> > IcmpMsgOutType3                 1020               0.0
> > UdpNoPorts                      6346898            0.0
> > IpExtInOctets                   291964714          0.0
> > IpExtOutOctets                  73440              0.0
> > IpExtInNoECTPkts                6347063            0.0
> > 
> > $ mpstat -P ALL -u -I SCPU -I SUM
> > 
> > Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
> > Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
> > Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
> > Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
> > Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
> > Average:       5    0.00    0.00    0.00   0.67   58.93  40.40
> > 
> > Average:     CPU    intr/s
> > Average:     all 152067.22
> > Average:       0  50064.73
> > Average:       1  50089.35
> > Average:       2  45095.17
> > Average:       3  44875.04
> > Average:       4  44906.32
> > Average:       5  45152.08
> > 
> > Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
> > Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
> > Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
> > Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
> > Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
> > Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
> > Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52
> > 
> > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.
> > 
> > Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
> > Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
> > Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
> > Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
> > Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
> > Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
> > Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
> > Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
> > Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
> > Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
> > Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
> > Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec
> > 
> > Notice, each RX-CPU can only process 3.1Mpps.
> > 
> > RPS RX-CPU(0):
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  .......................................
> >  #
> >     11.72%  000  [k] ixgbe_poll
> >     11.29%  000  [k] _raw_spin_lock
> >     10.35%  000  [k] dev_gro_receive
> >      8.36%  000  [k] __build_skb
> >      7.35%  000  [k] __skb_get_hash
> >      6.22%  000  [k] enqueue_to_backlog
> >      5.89%  000  [k] __skb_flow_dissect
> >      4.43%  000  [k] inet_gro_receive
> >      4.19%  000  [k] ___slab_alloc
> >      3.90%  000  [k] queued_spin_lock_slowpath
> >      3.85%  000  [k] kmem_cache_alloc
> >      3.06%  000  [k] build_skb
> >      2.66%  000  [k] get_rps_cpu
> >      2.57%  000  [k] napi_gro_receive
> >      2.34%  000  [k] eth_type_trans
> >      1.81%  000  [k] __cmpxchg_double_slab.isra.61
> >      1.47%  000  [k] ixgbe_alloc_rx_buffers
> >      1.43%  000  [k] get_partial_node.isra.81
> >      0.84%  000  [k] swiotlb_sync_single
> >      0.74%  000  [k] udp4_gro_receive
> >      0.73%  000  [k] netif_receive_skb_internal
> >      0.72%  000  [k] udp_gro_receive
> >      0.63%  000  [k] skb_gro_reset_offset
> >      0.49%  000  [k] __skb_flow_get_ports
> >      0.48%  000  [k] llist_add_batch
> >      0.36%  000  [k] swiotlb_sync_single_for_cpu
> >      0.34%  000  [k] __slab_alloc
> > 
> > 
> > Remote RPS-CPU(3) getting packets::
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  ..............................................
> >  #
> >     33.02%  003  [k] poll_idle
> >     10.99%  003  [k] __netif_receive_skb_core
> >     10.45%  003  [k] page_frag_free
> >      8.49%  003  [k] ip_rcv
> >      4.19%  003  [k] fib_table_lookup
> >      2.84%  003  [k] __udp4_lib_rcv
> >      2.81%  003  [k] __slab_free

Notice slow-path of SLUB

> >      2.23%  003  [k] __udp4_lib_lookup
> >      2.09%  003  [k] ip_route_input_rcu
> >      2.07%  003  [k] kmem_cache_free
> >      2.06%  003  [k] udp_v4_early_demux
> >      1.73%  003  [k] ip_rcv_finish  
> 
> Very interesting data.

You removed some of the more interesting part of the perf-report, that
showed us hitting more of the SLUB slowpath for SKBs.  The slowpath
consist of many separate function calls, thus it doesn't bubble to the
top (the FlameGraph tool shows them easier).

> So above perf report compares to xdp-redirect-cpu this one:
> Perf top on a CPU(3) that have to alloc and free SKBs etc.
> 
> # Overhead  CPU  Symbol
> # ........  ...  .......................................
> #
>     15.51%  003  [k] fib_table_lookup
>      8.91%  003  [k] cpu_map_kthread_run
>      8.04%  003  [k] build_skb
>      7.88%  003  [k] page_frag_free
>      5.13%  003  [k] kmem_cache_alloc
>      4.76%  003  [k] ip_route_input_rcu
>      4.59%  003  [k] kmem_cache_free
>      4.02%  003  [k] __udp4_lib_rcv
>      3.20%  003  [k] fib_validate_source
>      3.02%  003  [k] __netif_receive_skb_core
>      3.02%  003  [k] udp_v4_early_demux
>      2.90%  003  [k] ip_rcv
>      2.80%  003  [k] ip_rcv_finish
> 
> right?
> and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu
> you can load it up all the way.
> Am I interpreting all this correctly that with RPS cpu0 cannot
> distributed the packets to other cpus fast enough and that's
> a bottleneck?

Yes, exactly. The work needed on the RPS cpu0 is simply too much.

> whereas in redirect-cpu you're doing early packet distribution
> before skb alloc?

Yes, the main point to reducing the CPU cycles spend on the packet for
doing early packet distribution.

> So in other words with redirect-cpu all consumer cpus are doing
> skb alloc and in RPS cpu0 is allocating skbs for all ?

Yes.

> and that's where 6M->12M performance gain comes from?

Yes, basically.  There are many small thing that help this along.  Like
cpumap case always hitting the SLUB fastpath.  Another big thing is
bulking. It is sort of hidden, but the XDP_REDIRECT flush mechanism is
implementing the RX bulking (I've been "screaming" about for the last
couple of years! ;-))

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net-next v11] openvswitch: enable NSH support
From: Yi Yang @ 2017-09-29  7:03 UTC (permalink / raw)
  To: netdev; +Cc: dev, jbenc, e, davem, pshelar, Yi Yang

v10->v11
 - Fix the left three disputable comments for v9
   but not fixed in v10.

v9->v10
 - Change struct ovs_key_nsh to
       struct ovs_nsh_key_base base;
       __be32 context[NSH_MD1_CONTEXT_SIZE];
 - Fix new comments for v9

v8->v9
 - Fix build error reported by daily intel build
   because nsh module isn't selected by openvswitch

v7->v8
 - Rework nested value and mask for OVS_KEY_ATTR_NSH
 - Change pop_nsh to adapt to nsh kernel module
 - Fix many issues per comments from Jiri Benc

v6->v7
 - Remove NSH GSO patches in v6 because Jiri Benc
   reworked it as another patch series and they have
   been merged.
 - Change it to adapt to nsh kernel module added by NSH
   GSO patch series

v5->v6
 - Fix the rest comments for v4.
 - Add NSH GSO support for VxLAN-gpe + NSH and
   Eth + NSH.

v4->v5
 - Fix many comments by Jiri Benc and Eric Garver
   for v4.

v3->v4
 - Add new NSH match field ttl
 - Update NSH header to the latest format
   which will be final format and won't change
   per its author's confirmation.
 - Fix comments for v3.

v2->v3
 - Change OVS_KEY_ATTR_NSH to nested key to handle
   length-fixed attributes and length-variable
   attriubte more flexibly.
 - Remove struct ovs_action_push_nsh completely
 - Add code to handle nested attribute for SET_MASKED
 - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
   to transfer NSH header data.
 - Fix comments and coding style issues by Jiri and Eric

v1->v2
 - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
 - Dynamically allocate struct ovs_action_push_nsh for
   length-variable metadata.

OVS master and 2.8 branch has merged NSH userspace
patch series, this patch is to enable NSH support
in kernel data path in order that OVS can support
NSH in compat mode by porting this.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
---
 include/net/nsh.h                |   3 +
 include/uapi/linux/openvswitch.h |  29 ++++
 net/nsh/nsh.c                    |  59 ++++++++
 net/openvswitch/Kconfig          |   1 +
 net/openvswitch/actions.c        | 109 ++++++++++++++
 net/openvswitch/flow.c           |  51 +++++++
 net/openvswitch/flow.h           |   7 +
 net/openvswitch/flow_netlink.c   | 317 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/flow_netlink.h   |   5 +
 9 files changed, 580 insertions(+), 1 deletion(-)

diff --git a/include/net/nsh.h b/include/net/nsh.h
index a1eaea2..c03b089 100644
--- a/include/net/nsh.h
+++ b/include/net/nsh.h
@@ -304,4 +304,7 @@ static inline void nsh_set_flags_ttl_len(struct nshhdr *nsh, u8 flags,
 			NSH_FLAGS_MASK | NSH_TTL_MASK | NSH_LEN_MASK);
 }
 
+int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *nh);
+int skb_pop_nsh(struct sk_buff *skb);
+
 #endif /* __NET_NSH_H */
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 156ee4c..c1a785c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -333,6 +333,7 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_CT_LABELS,	/* 16-octet connection tracking label */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4,   /* struct ovs_key_ct_tuple_ipv4 */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6,   /* struct ovs_key_ct_tuple_ipv6 */
+	OVS_KEY_ATTR_NSH,       /* Nested set of ovs_nsh_key_* */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ip_tunnel_info */
@@ -491,6 +492,30 @@ struct ovs_key_ct_tuple_ipv6 {
 	__u8   ipv6_proto;
 };
 
+enum ovs_nsh_key_attr {
+	OVS_NSH_KEY_ATTR_UNSPEC,
+	OVS_NSH_KEY_ATTR_BASE,  /* struct ovs_nsh_key_base. */
+	OVS_NSH_KEY_ATTR_MD1,   /* struct ovs_nsh_key_md1. */
+	OVS_NSH_KEY_ATTR_MD2,   /* variable-length octets for MD type 2. */
+	__OVS_NSH_KEY_ATTR_MAX
+};
+
+#define OVS_NSH_KEY_ATTR_MAX (__OVS_NSH_KEY_ATTR_MAX - 1)
+
+struct ovs_nsh_key_base {
+	__u8 flags;
+	__u8 ttl;
+	__u8 mdtype;
+	__u8 np;
+	__be32 path_hdr;
+};
+
+#define NSH_MD1_CONTEXT_SIZE 4
+
+struct ovs_nsh_key_md1 {
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -806,6 +831,8 @@ struct ovs_action_push_eth {
  * packet.
  * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
  * packet.
+ * @OVS_ACTION_ATTR_PUSH_NSH: push NSH header to the packet.
+ * @OVS_ACTION_ATTR_POP_NSH: pop the outermost NSH header off the packet.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -835,6 +862,8 @@ enum ovs_action_attr {
 	OVS_ACTION_ATTR_TRUNC,        /* u32 struct ovs_action_trunc. */
 	OVS_ACTION_ATTR_PUSH_ETH,     /* struct ovs_action_push_eth. */
 	OVS_ACTION_ATTR_POP_ETH,      /* No argument. */
+	OVS_ACTION_ATTR_PUSH_NSH,     /* Nested OVS_NSH_KEY_ATTR_*. */
+	OVS_ACTION_ATTR_POP_NSH,      /* No argument. */
 
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
index 58fb827..5e4f937 100644
--- a/net/nsh/nsh.c
+++ b/net/nsh/nsh.c
@@ -14,6 +14,65 @@
 #include <net/nsh.h>
 #include <net/tun_proto.h>
 
+int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)
+{
+	struct nshhdr *nsh_hdr;
+	size_t length = nsh_hdr_len(src_nsh_hdr);
+	u8 next_proto;
+
+	if (skb->mac_len) {
+		next_proto = TUN_P_ETHERNET;
+	} else {
+		next_proto = tun_p_from_eth_p(skb->protocol);
+		if (!next_proto)
+			return -EAFNOSUPPORT;
+	}
+
+	/* Add the NSH header */
+	if (skb_cow_head(skb, length) < 0)
+		return -ENOMEM;
+
+	skb_push(skb, length);
+	nsh_hdr = (struct nshhdr *)(skb->data);
+	memcpy(nsh_hdr, src_nsh_hdr, length);
+	nsh_hdr->np = next_proto;
+
+	skb->protocol = htons(ETH_P_NSH);
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb_reset_mac_len(skb);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_push_nsh);
+
+int skb_pop_nsh(struct sk_buff *skb)
+{
+	int err;
+	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
+	size_t length;
+	__be16 inner_proto;
+
+	err = skb_ensure_writable(skb, skb_network_offset(skb) +
+				       sizeof(struct nshhdr));
+	if (unlikely(err))
+		return err;
+
+	inner_proto = tun_p_to_eth_p(nsh_hdr->np);
+	if (!inner_proto)
+		return -EAFNOSUPPORT;
+
+	length = nsh_hdr_len(nsh_hdr);
+	skb_pull(skb, length);
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb_reset_mac_len(skb);
+	skb->protocol = inner_proto;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_pop_nsh);
+
 static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
 				       netdev_features_t features)
 {
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index ce94729..2650205 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -14,6 +14,7 @@ config OPENVSWITCH
 	select MPLS
 	select NET_MPLS_GSO
 	select DST_CACHE
+	select NET_NSH
 	---help---
 	  Open vSwitch is a multilayer Ethernet switch targeted at virtualized
 	  environments.  In addition to supporting a variety of features
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a54a556..27b579b7 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -43,6 +43,7 @@
 #include "flow.h"
 #include "conntrack.h"
 #include "vport.h"
+#include "flow_netlink.h"
 
 struct deferred_action {
 	struct sk_buff *skb;
@@ -380,6 +381,43 @@ static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
+		    const struct nshhdr *nh)
+{
+	int err;
+
+	err = skb_push_nsh(skb, nh);
+	if (err)
+		return err;
+
+	/* safe right before invalidate_flow_key */
+	key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
+static int pop_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	int err;
+
+	if (ovs_key_mac_proto(key) != MAC_PROTO_NONE ||
+	    skb->protocol != htons(ETH_P_NSH)) {
+		return -EINVAL;
+	}
+
+	err = skb_pop_nsh(skb);
+	if (err)
+		return err;
+
+	/* safe right before invalidate_flow_key */
+	if (skb->protocol == htons(ETH_P_TEB))
+		key->mac_proto = MAC_PROTO_ETHERNET;
+	else
+		key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
 static void update_ip_l4_checksum(struct sk_buff *skb, struct iphdr *nh,
 				  __be32 addr, __be32 new_addr)
 {
@@ -602,6 +640,59 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	return 0;
 }
 
+static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
+		   const struct nlattr *a)
+{
+	struct nshhdr *nh;
+	int err;
+	u8 flags;
+	u8 ttl;
+	int i;
+
+	struct ovs_key_nsh key;
+	struct ovs_key_nsh mask;
+
+	err = nsh_key_from_nlattr(a, &key, &mask);
+	if (err)
+		return err;
+
+	err = skb_ensure_writable(skb, skb_network_offset(skb) +
+				       sizeof(struct nshhdr));
+	if (unlikely(err))
+		return err;
+
+	nh = nsh_hdr(skb);
+
+	flags = nsh_get_flags(nh);
+	flags = OVS_MASKED(flags, key.base.flags, mask.base.flags);
+	flow_key->nsh.base.flags = flags;
+	ttl = nsh_get_ttl(nh);
+	ttl = OVS_MASKED(ttl, key.base.ttl, mask.base.ttl);
+	flow_key->nsh.base.ttl = ttl;
+	nsh_set_flags_and_ttl(nh, flags, ttl);
+	nh->path_hdr = OVS_MASKED(nh->path_hdr, key.base.path_hdr,
+				  mask.base.path_hdr);
+	flow_key->nsh.base.path_hdr = nh->path_hdr;
+	switch (nh->mdtype) {
+	case NSH_M_TYPE1:
+		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
+			nh->md1.context[i] =
+			    OVS_MASKED(nh->md1.context[i], key.context[i],
+				       mask.context[i]);
+		}
+		memcpy(flow_key->nsh.context, nh->md1.context,
+		       sizeof(nh->md1.context));
+		break;
+	case NSH_M_TYPE2:
+		memset(flow_key->nsh.context, 0,
+		       sizeof(flow_key->nsh.context));
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /* Must follow skb_ensure_writable() since that can move the skb data. */
 static void set_tp_port(struct sk_buff *skb, __be16 *port,
 			__be16 new_port, __sum16 *check)
@@ -1024,6 +1115,10 @@ static int execute_masked_set_action(struct sk_buff *skb,
 				   get_mask(a, struct ovs_key_ethernet *));
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		err = set_nsh(skb, flow_key, a);
+		break;
+
 	case OVS_KEY_ATTR_IPV4:
 		err = set_ipv4(skb, flow_key, nla_data(a),
 			       get_mask(a, struct ovs_key_ipv4 *));
@@ -1210,6 +1305,20 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		case OVS_ACTION_ATTR_POP_ETH:
 			err = pop_eth(skb, key);
 			break;
+
+		case OVS_ACTION_ATTR_PUSH_NSH: {
+			u8 buffer[NSH_HDR_MAX_LEN];
+			struct nshhdr *nh = (struct nshhdr *)buffer;
+
+			nsh_hdr_from_nlattr(nla_data(a), nh,
+					    NSH_HDR_MAX_LEN);
+			err = push_nsh(skb, key, (const struct nshhdr *)nh);
+			break;
+		}
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			err = pop_nsh(skb, key);
+			break;
 		}
 
 		if (unlikely(err)) {
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 8c94cef..5970805 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -46,6 +46,7 @@
 #include <net/ipv6.h>
 #include <net/mpls.h>
 #include <net/ndisc.h>
+#include <net/nsh.h>
 
 #include "conntrack.h"
 #include "datapath.h"
@@ -490,6 +491,52 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	struct nshhdr *nh;
+	unsigned int nh_ofs = skb_network_offset(skb);
+	u8 version, length;
+	int err;
+
+	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
+	if (unlikely(err))
+		return err;
+
+	nh = nsh_hdr(skb);
+	version = nsh_get_ver(nh);
+	length = nsh_hdr_len(nh);
+
+	if (version != 0)
+		return -EINVAL;
+
+	err = check_header(skb, nh_ofs + length);
+	if (unlikely(err))
+		return err;
+
+	nh = (struct nshhdr *)skb_network_header(skb);
+	key->nsh.base.flags = nsh_get_flags(nh);
+	key->nsh.base.ttl = nsh_get_ttl(nh);
+	key->nsh.base.mdtype = nh->mdtype;
+	key->nsh.base.np = nh->np;
+	key->nsh.base.path_hdr = nh->path_hdr;
+	switch (key->nsh.base.mdtype) {
+	case NSH_M_TYPE1:
+		if (length != NSH_M_TYPE1_LEN)
+			return -EINVAL;
+		memcpy(key->nsh.context, nh->md1.context,
+		       sizeof(nh->md1));
+		break;
+	case NSH_M_TYPE2:
+		memset(key->nsh.context, 0,
+		       sizeof(nh->md1));
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /**
  * key_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -735,6 +782,10 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		}
+	} else if (key->eth.type == htons(ETH_P_NSH)) {
+		error = parse_nsh(skb, key);
+		if (error)
+			return error;
 	}
 	return 0;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 1875bba..8eeae749 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -35,6 +35,7 @@
 #include <net/inet_ecn.h>
 #include <net/ip_tunnels.h>
 #include <net/dst_metadata.h>
+#include <net/nsh.h>
 
 struct sk_buff;
 
@@ -66,6 +67,11 @@ struct vlan_head {
 	(offsetof(struct sw_flow_key, recirc_id) +	\
 	FIELD_SIZEOF(struct sw_flow_key, recirc_id))
 
+struct ovs_key_nsh {
+	struct ovs_nsh_key_base base;
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 struct sw_flow_key {
 	u8 tun_opts[IP_TUNNEL_OPTS_MAX];
 	u8 tun_opts_len;
@@ -144,6 +150,7 @@ struct sw_flow_key {
 			};
 		} ipv6;
 	};
+	struct ovs_key_nsh nsh;         /* network service header */
 	struct {
 		/* Connection tracking fields not packed above. */
 		struct {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index e8eb427..77613f4 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -48,6 +48,7 @@
 #include <net/ndisc.h>
 #include <net/mpls.h>
 #include <net/vxlan.h>
+#include <net/tun_proto.h>
 
 #include "flow_netlink.h"
 
@@ -78,9 +79,11 @@ static bool actions_may_change_flow(const struct nlattr *actions)
 		case OVS_ACTION_ATTR_HASH:
 		case OVS_ACTION_ATTR_POP_ETH:
 		case OVS_ACTION_ATTR_POP_MPLS:
+		case OVS_ACTION_ATTR_POP_NSH:
 		case OVS_ACTION_ATTR_POP_VLAN:
 		case OVS_ACTION_ATTR_PUSH_ETH:
 		case OVS_ACTION_ATTR_PUSH_MPLS:
+		case OVS_ACTION_ATTR_PUSH_NSH:
 		case OVS_ACTION_ATTR_PUSH_VLAN:
 		case OVS_ACTION_ATTR_SAMPLE:
 		case OVS_ACTION_ATTR_SET:
@@ -322,12 +325,27 @@ size_t ovs_tun_key_attr_size(void)
 		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
 
+size_t ovs_nsh_key_attr_size(void)
+{
+	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
+	 * updating this function.
+	 */
+	return  nla_total_size(NSH_BASE_HDR_LEN) /* OVS_NSH_KEY_ATTR_BASE */
+		/* OVS_NSH_KEY_ATTR_MD1 and OVS_NSH_KEY_ATTR_MD2 are
+		 * mutually exclusive, so the bigger one can cover
+		 * the small one.
+		 *
+		 * OVS_NSH_KEY_ATTR_MD2
+		 */
+		+ nla_total_size(NSH_CTX_HDRS_MAX_LEN);
+}
+
 size_t ovs_key_attr_size(void)
 {
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 28);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 29);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -341,6 +359,8 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_CT_MARK */
 		+ nla_total_size(16)  /* OVS_KEY_ATTR_CT_LABELS */
 		+ nla_total_size(40)  /* OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6 */
+		+ nla_total_size(0)   /* OVS_KEY_ATTR_NSH */
+		  + ovs_nsh_key_attr_size()
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -373,6 +393,13 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
 	[OVS_TUNNEL_KEY_ATTR_IPV6_DST]      = { .len = sizeof(struct in6_addr) },
 };
 
+static const struct ovs_len_tbl
+ovs_nsh_key_attr_lens[OVS_NSH_KEY_ATTR_MAX + 1] = {
+	[OVS_NSH_KEY_ATTR_BASE] = { .len = sizeof(struct ovs_nsh_key_base) },
+	[OVS_NSH_KEY_ATTR_MD1]  = { .len = sizeof(struct ovs_nsh_key_md1) },
+	[OVS_NSH_KEY_ATTR_MD2]  = { .len = OVS_ATTR_VARIABLE },
+};
+
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
 static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
@@ -405,6 +432,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv4) },
 	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv6) },
+	[OVS_KEY_ATTR_NSH]       = { .len = OVS_ATTR_NESTED,
+				     .next = ovs_nsh_key_attr_lens, },
 };
 
 static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
@@ -1179,6 +1208,221 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 	return 0;
 }
 
+int nsh_hdr_from_nlattr(const struct nlattr *attr,
+			struct nshhdr *nh, size_t size)
+{
+	struct nlattr *a;
+	int rem;
+	u8 flags = 0;
+	u8 ttl = 0;
+	int mdlen = 0;
+
+	/* validate_nsh has check this, so we needn't do duplicate check here
+	 */
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base = nla_data(a);
+
+			flags = base->flags;
+			ttl = base->ttl;
+			nh->np = base->np;
+			nh->mdtype = base->mdtype;
+			nh->path_hdr = base->path_hdr;
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 = nla_data(a);
+
+			mdlen = nla_len(a);
+			memcpy(&nh->md1, md1, mdlen);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2: {
+			const struct u8 *md2 = nla_data(a);
+
+			mdlen = nla_len(a);
+			memcpy(&nh->md2, md2, mdlen);
+			break;
+		}
+		default:
+			return -EINVAL;
+		}
+	}
+
+	/* nsh header length  = NSH_BASE_HDR_LEN + mdlen */
+	nh->ver_flags_ttl_len = 0;
+	nsh_set_flags_ttl_len(nh, flags, ttl, NSH_BASE_HDR_LEN + mdlen);
+
+	return 0;
+}
+
+int nsh_key_from_nlattr(const struct nlattr *attr,
+			struct ovs_key_nsh *nsh, struct ovs_key_nsh *nsh_mask)
+{
+	struct nlattr *a;
+	int rem;
+
+	/* validate_nsh has check this, so we needn't do duplicate check here
+	 */
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base = nla_data(a);
+			const struct ovs_nsh_key_base *base_mask = base + 1;
+
+			nsh->base = *base;
+			nsh_mask->base = *base_mask;
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+			const struct ovs_nsh_key_md1 *md1_mask = md1 + 1;
+
+			memcpy(nsh->context, md1->context, sizeof(*md1));
+			memcpy(nsh_mask->context, md1_mask->context,
+			       sizeof(*md1_mask));
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			/* Not supported yet */
+			return -ENOTSUPP;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int nsh_key_put_from_nlattr(const struct nlattr *attr,
+				   struct sw_flow_match *match, bool is_mask,
+				   bool is_push_nsh, bool log)
+{
+	struct nlattr *a;
+	int rem;
+	bool has_base = false;
+	bool has_md1 = false;
+	bool has_md2 = false;
+	u8 mdtype = 0;
+	int mdlen = 0;
+
+	if (WARN_ON(is_push_nsh && is_mask))
+		return -EINVAL;
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+		int i;
+
+		if (type > OVS_NSH_KEY_ATTR_MAX) {
+			OVS_NLERR(log, "nsh attr %d is out of range max %d",
+				  type, OVS_NSH_KEY_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (!check_attr_len(nla_len(a),
+				    ovs_nsh_key_attr_lens[type].len)) {
+			OVS_NLERR(
+			    log,
+			    "nsh attr %d has unexpected len %d expected %d",
+			    type,
+			    nla_len(a),
+			    ovs_nsh_key_attr_lens[type].len
+			);
+			return -EINVAL;
+		}
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+
+			has_base = true;
+			mdtype = base->mdtype;
+			SW_FLOW_KEY_PUT(match, nsh.base.flags,
+					base->flags, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.ttl,
+					base->ttl, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.mdtype,
+					base->mdtype, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.np,
+					base->np, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.path_hdr,
+					base->path_hdr, is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+
+			has_md1 = true;
+			for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++)
+				SW_FLOW_KEY_PUT(match, nsh.context[i],
+						md1->context[i], is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			if (!is_push_nsh) /* Not supported MD type 2 yet */
+				return -ENOTSUPP;
+
+			has_md2 = true;
+			mdlen = nla_len(a);
+			if (mdlen > NSH_CTX_HDRS_MAX_LEN || mdlen <= 0) {
+				OVS_NLERR(
+				    log,
+				    "Invalid MD length %d for MD type %d",
+				    mdlen,
+				    mdtype
+				);
+				return -EINVAL;
+			}
+			break;
+		default:
+			OVS_NLERR(log, "Unknown nsh attribute %d",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(log, "nsh attribute has %d unknown bytes.", rem);
+		return -EINVAL;
+	}
+
+	if (has_md1 && has_md2) {
+		OVS_NLERR(
+		    1,
+		    "invalid nsh attribute: md1 and md2 are exclusive."
+		);
+		return -EINVAL;
+	}
+
+	if (!is_mask) {
+		if ((has_md1 && mdtype != NSH_M_TYPE1) ||
+		    (has_md2 && mdtype != NSH_M_TYPE2)) {
+			OVS_NLERR(1, "nsh attribute has unmatched MD type %d.",
+				  mdtype);
+			return -EINVAL;
+		}
+
+		if (is_push_nsh &&
+		    (!has_base || (!has_md1 && !has_md2))) {
+			OVS_NLERR(
+			    1,
+			    "push_nsh: missing base or metadata attributes"
+			);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 				u64 attrs, const struct nlattr **a,
 				bool is_mask, bool log)
@@ -1306,6 +1550,13 @@ static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 		attrs &= ~(1 << OVS_KEY_ATTR_ARP);
 	}
 
+	if (attrs & (1 << OVS_KEY_ATTR_NSH)) {
+		if (nsh_key_put_from_nlattr(a[OVS_KEY_ATTR_NSH], match,
+					    is_mask, false, log) < 0)
+			return -EINVAL;
+		attrs &= ~(1 << OVS_KEY_ATTR_NSH);
+	}
+
 	if (attrs & (1 << OVS_KEY_ATTR_MPLS)) {
 		const struct ovs_key_mpls *mpls_key;
 
@@ -1622,6 +1873,34 @@ static int ovs_nla_put_vlan(struct sk_buff *skb, const struct vlan_head *vh,
 	return 0;
 }
 
+static int nsh_key_to_nlattr(const struct ovs_key_nsh *nsh, bool is_mask,
+			     struct sk_buff *skb)
+{
+	struct nlattr *start;
+
+	start = nla_nest_start(skb, OVS_KEY_ATTR_NSH);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (nla_put(skb, OVS_NSH_KEY_ATTR_BASE, sizeof(nsh->base), &nsh->base))
+		goto nla_put_failure;
+
+	if (is_mask || nsh->base.mdtype == NSH_M_TYPE1) {
+		if (nla_put(skb, OVS_NSH_KEY_ATTR_MD1,
+			    sizeof(nsh->context), nsh->context))
+			goto nla_put_failure;
+	}
+
+	/* Don't support MD type 2 yet */
+
+	nla_nest_end(skb, start);
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 			     const struct sw_flow_key *output, bool is_mask,
 			     struct sk_buff *skb)
@@ -1750,6 +2029,9 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 		ipv6_key->ipv6_tclass = output->ip.tos;
 		ipv6_key->ipv6_hlimit = output->ip.ttl;
 		ipv6_key->ipv6_frag = output->ip.frag;
+	} else if (swkey->eth.type == htons(ETH_P_NSH)) {
+		if (nsh_key_to_nlattr(&output->nsh, is_mask, skb))
+			goto nla_put_failure;
 	} else if (swkey->eth.type == htons(ETH_P_ARP) ||
 		   swkey->eth.type == htons(ETH_P_RARP)) {
 		struct ovs_key_arp *arp_key;
@@ -2242,6 +2524,19 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	return err;
 }
 
+static bool validate_nsh(const struct nlattr *attr, bool is_mask,
+			 bool is_push_nsh, bool log)
+{
+	struct sw_flow_match match;
+	struct sw_flow_key key;
+	int ret = 0;
+
+	ovs_match_init(&match, &key, true, NULL);
+	ret = nsh_key_put_from_nlattr(attr, &match, is_mask,
+				      is_push_nsh, log);
+	return !ret;
+}
+
 /* Return false if there are any non-masked bits set.
  * Mask follows data immediately, before any netlink padding.
  */
@@ -2384,6 +2679,11 @@ static int validate_set(const struct nlattr *a,
 
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		if (!validate_nsh(nla_data(a), masked, false, log))
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -2482,6 +2782,8 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			[OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_trunc),
 			[OVS_ACTION_ATTR_PUSH_ETH] = sizeof(struct ovs_action_push_eth),
 			[OVS_ACTION_ATTR_POP_ETH] = 0,
+			[OVS_ACTION_ATTR_PUSH_NSH] = (u32)-1,
+			[OVS_ACTION_ATTR_POP_NSH] = 0,
 		};
 		const struct ovs_action_push_vlan *vlan;
 		int type = nla_type(a);
@@ -2636,6 +2938,19 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			mac_proto = MAC_PROTO_ETHERNET;
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_NSH:
+			mac_proto = MAC_PROTO_NONE;
+			if (!validate_nsh(nla_data(a), false, true, true))
+				return -EINVAL;
+			break;
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			if (key->nsh.base.np == TUN_P_ETHERNET)
+				mac_proto = MAC_PROTO_ETHERNET;
+			else
+				mac_proto = MAC_PROTO_NONE;
+			break;
+
 		default:
 			OVS_NLERR(log, "Unknown Action type %d", type);
 			return -EINVAL;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 929c665..6657606 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -79,4 +79,9 @@ int ovs_nla_put_actions(const struct nlattr *attr,
 void ovs_nla_free_flow_actions(struct sw_flow_actions *);
 void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *);
 
+int nsh_key_from_nlattr(const struct nlattr *attr, struct ovs_key_nsh *nsh,
+			struct ovs_key_nsh *nsh_mask);
+int nsh_hdr_from_nlattr(const struct nlattr *attr, struct nshhdr *nh,
+			size_t size);
+
 #endif /* flow_netlink.h */
-- 
2.5.5

^ permalink raw reply related

* Re: net: macb: fail when there's no PHY
From: Harini Katakam @ 2017-09-29  7:05 UTC (permalink / raw)
  To: Brandon Streiff; +Cc: Grant Edwards, Florian Fainelli, netdev
In-Reply-To: <1506029751-25249-1-git-send-email-brandon.streiff@ni.com>

Hi Brandon,

On Fri, Sep 22, 2017 at 3:05 AM, Brandon Streiff <brandon.streiff@ni.com> wrote:
>> On Thu, Sep 21, 2017 at 01:05:57PM -0700, Florian Fainelli wrote:
<snip>
>
> I have a board that's in a similar boat. My workaround was to undo
> portions of dacdbb4dfc1a with the following patch; this lets me still
> use fixed-link and have MDIO (to configure a switch), but not require
> a PHY.
>
> There was a patch set last year by Harini Katakam ("net: macb: Add MDIO
> driver for accessing multiple PHY devices") that might ultimately be a
> better approach to tackling this problem, although I haven't seen any
> further chatter on it.

That patch was not backward compatible and I was trying to find a better
solution for a common MDIO bus.
I plan to send a new series next month and work on the review comments.

Regards,
Harini

^ permalink raw reply

* [net-next:master 332/339] kernel//bpf/syscall.c:1404:23: warning: cast to pointer from integer of different size
From: kbuild test robot @ 2017-09-29  6:54 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kbuild-all, netdev

[-- Attachment #1: Type: text/plain, Size: 3572 bytes --]

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   fa8fefaa678ea390b873195d19c09930da84a4bb
commit: cb4d2b3f03d8eed90be3a194e5b54b734ec4bbe9 [332/339] bpf: Add name, load_time, uid and map_ids to bpf_prog_info
config: blackfin-allmodconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        git checkout cb4d2b3f03d8eed90be3a194e5b54b734ec4bbe9
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All warnings (new ones prefixed by >>):

   kernel//bpf/syscall.c: In function 'bpf_prog_get_info_by_fd':
>> kernel//bpf/syscall.c:1404:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      u32 *user_map_ids = (u32 *)info.map_ids;
                          ^

vim +1404 kernel//bpf/syscall.c

  1371	
  1372	static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
  1373					   const union bpf_attr *attr,
  1374					   union bpf_attr __user *uattr)
  1375	{
  1376		struct bpf_prog_info __user *uinfo = u64_to_user_ptr(attr->info.info);
  1377		struct bpf_prog_info info = {};
  1378		u32 info_len = attr->info.info_len;
  1379		char __user *uinsns;
  1380		u32 ulen;
  1381		int err;
  1382	
  1383		err = check_uarg_tail_zero(uinfo, sizeof(info), info_len);
  1384		if (err)
  1385			return err;
  1386		info_len = min_t(u32, sizeof(info), info_len);
  1387	
  1388		if (copy_from_user(&info, uinfo, info_len))
  1389			return -EFAULT;
  1390	
  1391		info.type = prog->type;
  1392		info.id = prog->aux->id;
  1393		info.load_time = prog->aux->load_time;
  1394		info.created_by_uid = from_kuid_munged(current_user_ns(),
  1395						       prog->aux->user->uid);
  1396	
  1397		memcpy(info.tag, prog->tag, sizeof(prog->tag));
  1398		memcpy(info.name, prog->aux->name, sizeof(prog->aux->name));
  1399	
  1400		ulen = info.nr_map_ids;
  1401		info.nr_map_ids = prog->aux->used_map_cnt;
  1402		ulen = min_t(u32, info.nr_map_ids, ulen);
  1403		if (ulen) {
> 1404			u32 *user_map_ids = (u32 *)info.map_ids;
  1405			u32 i;
  1406	
  1407			for (i = 0; i < ulen; i++)
  1408				if (put_user(prog->aux->used_maps[i]->id,
  1409					     &user_map_ids[i]))
  1410					return -EFAULT;
  1411		}
  1412	
  1413		if (!capable(CAP_SYS_ADMIN)) {
  1414			info.jited_prog_len = 0;
  1415			info.xlated_prog_len = 0;
  1416			goto done;
  1417		}
  1418	
  1419		ulen = info.jited_prog_len;
  1420		info.jited_prog_len = prog->jited_len;
  1421		if (info.jited_prog_len && ulen) {
  1422			uinsns = u64_to_user_ptr(info.jited_prog_insns);
  1423			ulen = min_t(u32, info.jited_prog_len, ulen);
  1424			if (copy_to_user(uinsns, prog->bpf_func, ulen))
  1425				return -EFAULT;
  1426		}
  1427	
  1428		ulen = info.xlated_prog_len;
  1429		info.xlated_prog_len = bpf_prog_insn_size(prog);
  1430		if (info.xlated_prog_len && ulen) {
  1431			uinsns = u64_to_user_ptr(info.xlated_prog_insns);
  1432			ulen = min_t(u32, info.xlated_prog_len, ulen);
  1433			if (copy_to_user(uinsns, prog->insnsi, ulen))
  1434				return -EFAULT;
  1435		}
  1436	
  1437	done:
  1438		if (copy_to_user(uinfo, &info, info_len) ||
  1439		    put_user(info_len, &uattr->info.info_len))
  1440			return -EFAULT;
  1441	
  1442		return 0;
  1443	}
  1444	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45901 bytes --]

^ permalink raw reply

* Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT
From: Jesper Dangaard Brouer @ 2017-09-29  6:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Alexei Starovoitov, Andy Gospodarek, edumazet, brouer
In-Reply-To: <59CD7B94.8010103@iogearbox.net>

On Fri, 29 Sep 2017 00:45:40 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> > Introducing a new way to redirect XDP frames.  Notice how no driver
> > changes are necessary given the design of XDP_REDIRECT.
> >
> > This redirect map type is called 'cpumap', as it allows redirection
> > XDP frames to remote CPUs.  The remote CPU will do the SKB allocation
> > and start the network stack invocation on that CPU.
> >
> > This is a scalability and isolation mechanism, that allow separating
> > the early driver network XDP layer, from the rest of the netstack, and
> > assigning dedicated CPUs for this stage.  The sysadm control/configure
> > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
> > many queues are configured via ethtool --set-channels.  Benchmarks
> > show that a single CPU can handle approx 11Mpps.  Thus, only assigning
> > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
> > wirespeed smallest packet 14.88Mpps.  Reducing the number of queues
> > have the advantage that more packets being "bulk" available per hard
> > interrupt[1].
> >
> > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> >
> > Use-cases:
> >
> > 1. End-host based pre-filtering for DDoS mitigation.  This is fast
> >     enough to allow software to see and filter all packets wirespeed.
> >     Thus, no packets getting silently dropped by hardware.
> >
> > 2. Given NIC HW unevenly distributes packets across RX queue, this
> >     mechanism can be used for redistribution load across CPUs.  This
> >     usually happens when HW is unaware of a new protocol.  This
> >     resembles RPS (Receive Packet Steering), just faster, but with more
> >     responsibility placed on the BPF program for correct steering.
> >
> > 3. Auto-scaling or power saving via only activating the appropriate
> >     number of remote CPUs for handling the current load.  The cpumap
> >     tracepoints can function as a feedback loop for this purpose.  
> 
> Interesting work, thanks! Still digesting the code a bit. I think
> it pretty much goes into the direction that Eric describes in his
> netdev paper quoted above; not on a generic level though but specific
> to XDP at least; theoretically XDP could just run transparently on
> the CPU doing the filtering, and raw buffers are handed to remote
> CPU with similar batching, but it would need some different config
> interface at minimum.

Good that you noticed this is (implicit) implementing RX bulking, which
is where much of the performance gain originates from.

It is true, I am inspired by Eric's paper (I love it). Do notice that
this is not blocking or interfering with Erics/others continued work in
this area.  This implementation just show that the section "break the
pipe!" idea works very well for XDP. 

More on config knobs below.
 
> Shouldn't we take the CPU(s) running XDP on the RX queues out from
> the normal process scheduler, so that we have a guarantee that user
> space or unrelated kernel tasks cannot interfere with them anymore,
> and we could then turn them into busy polling eventually (e.g. as
> long as XDP is running there and once off could put them back into
> normal scheduling domain transparently)?

We should be careful not to invent networking config knobs that belongs
to other parts of the kernel, like the scheduler.  We already have
ability to control where IRQ's land via procfs smp_affinity.  And if
you want to avoid CPU isolation, we can use the boot cmdline
"isolcpus" (hint like DPDK recommend/use for zero-loss configs).  It is
the userspace tool (or sysadm) loading the XDP program, who is
responsible for having configures the CPU smp_affinity alignment.

Making NAPI busy-poll is out of scope for this patchset. Someone
should work on this separately.  It would just help/improve this kind
of scheme.

I actually think it would be more relevant to add/put the "remote" CPUs
in the 'cpumap' into a separate scheduler group.  To implement stuff
like auto-scaling and power-saving.


> What about RPS/RFS in the sense that once you punt them to remote
> CPU, could we reuse application locality information so they'd end
> up on the right CPU in the first place (w/o backlog detour), or is
> the intent to rather disable it and have some own orchestration
> with relation to the CPU map?

An advanced bpf orchestration could basically implement what you
describe, combined with a userspace side tool that taskset/pin
applications.  To know when a task can move between CPUs, you use the
tracepoints to see when the CPU queue is empty (hint, time_limit=true
and processed=0).

For now, I'm not targeting such advanced use-cases.  My main target is
a customer that have double tagged VLANS, and ixgbe cannot RSS
distribute these, thus they all end-up on queue 0.  And as I
demonstrated (in another email) RPS is too slow to fix this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox