Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: bridge: add per-port group_fwd_mask with less restrictions
From: David Miller @ 2017-09-29  5:04 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, roopa, bridge, stephen
In-Reply-To: <1506517964-17479-1-git-send-email-nikolay@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Wed, 27 Sep 2017 16:12:44 +0300

> We need to be able to transparently forward most link-local frames via
> tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
> mask which restricts the forwarding of STP and LACP, but we need to be able
> to forward these over tunnels and control that forwarding on a per-port
> basis thus add a new per-port group_fwd_mask option which only disallows
> mac pause frames to be forwarded (they're always dropped anyway).
> The patch does not change the current default situation - all of the others
> are still restricted unless configured for forwarding.
> We have successfully tested this patch with LACP and STP forwarding over
> VxLAN and qinq tunnels.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/2] net/sched: support tunnel options in cls_flower and act_tunnel_key
From: David Miller @ 2017-09-29  4:54 UTC (permalink / raw)
  To: simon.horman; +Cc: jiri, jhs, xiyou.wangcong, netdev, oss-drivers
In-Reply-To: <1506500194-17637-1-git-send-email-simon.horman@netronome.com>

From: Simon Horman <simon.horman@netronome.com>
Date: Wed, 27 Sep 2017 10:16:32 +0200

> Users of options:
> 
> * There are eBPF hooks to allow getting on and setting tunnel metadata:
>   bpf_skb_set_tunnel_opt, bpf_skb_get_tunnel_opt.
> 
> * Open vSwitch is able to match and set Geneve and VXLAN-GBP options.
> 
> Neither of the above appear to assume any structure for the data.

I really worry about this.

These metadata option blobs are internal kernel datastructure which we
could change at any point in time.  They are not exported to
userspace as a UAPI.

It's kinda OK for eBPF programs to access this stuff since they are
expected to cope with changes to internal data-structures.

But for anything user facing, this really doesn't work.

^ permalink raw reply

* Re: [PATCH net] ipv6: fix net.ipv6.conf.all interface DAD handlers
From: Matteo Croce @ 2017-09-29 10:47 UTC (permalink / raw)
  To: Erik Kline; +Cc: David Miller, netdev, linux-doc
In-Reply-To: <CAAedzxrLurc=oGy64ePO30j3QQ_sL5kJCEz+RnTk3LMSn_yvyQ@mail.gmail.com>

On Thu, Sep 28, 2017 at 1:22 PM, Erik Kline <ek@google.com> wrote:
> Upon further reflection, doesn't the whole premise of this change
> means that it's no longer possible to selectively disable these
> features if they are set on "all"?  Or are we saying that this mode is
> only support with "default" enable + "ifname" disable?

Hi Erik, thanks for the review.
Yes the behaviour seems wrong when writing all.accept_dad respect what
the documentation says.

BTW the previous behaviour was not defined, I put them in OR because
that's what other handlers do, eg. send_redirects.
If you think that it's better to put them in AND we can change the
documentation accordingly.
What do you think?

-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* netlink backwards compatibility in userspace tools
From: Jason A. Donenfeld @ 2017-09-29 10:22 UTC (permalink / raw)
  To: Netdev, LKML; +Cc: Daniel Kahn Gillmor

Hi guys,

One handy aspect of Netlink is that it's backwards compatible. This
means that you can run old userspace utilities on new kernels, even if
the new kernel supports new features and netlink attributes. The wire
format is stable enough that the data marshaled can be extended
without breaking compat. Neat.

I was wondering, though, what you think the best stance is toward
these old userspace utilities. What should they do if the kernel sends
it netlink attributes that it does not recognize? At the moment, I'm
doing something like this:

static void warn_unrecognized(void)
{
    static bool once = false;
    if (once)
        return;
    once = true;
    fprintf(stderr,
        "Warning: this program received from your kernel one or more\n"
        "attributes that it did not recognize. It is possible that\n"
        "this version of wg(8) is older than your kernel. You may\n"
        "want to update this program.\n");
}

This seems like a somewhat sensible warning, but then I wonder about
distributions like Debian, which has a long stable life cycle, so it
frequently has very old tools (ancient iproute2 for example). Then,
VPS providers have these Debian images run on top of newer kernels.
People in this situation would undoubtedly see the above warning a lot
and not be able to do anything about it. Not horrible, but a bit
annoying. Is this an okay annoyance? Or is it advised to just have no
warning at all? One idea would be to put it behind an environment
variable flag, but I don't like too many nobs.

I'm generally wondering about attitudes toward this kind of userspace
program behavior in response to newer kernels.

Thanks,
Jason

^ permalink raw reply

* RE: [PATCH net-next 03/10] sctp: factor out stream->in allocation
From: David Laight @ 2017-09-29 10:04 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner', netdev@vger.kernel.org
  Cc: linux-sctp@vger.kernel.org, Neil Horman, Vlad Yasevich, Xin Long
In-Reply-To: <a2d837095740147003164fe573794c140001ec71.1506536044.git.marcelo.leitner@gmail.com>

From: Marcelo Ricardo Leitner
> Sent: 28 September 2017 21:25
> Same idea as previous patch.

That needs a proper description.

	David

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:50 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928173415.15551-3-jiri@resnulli.us>

On 28/09/17 20:34, Jiri Pirko wrote:
> From: Yotam Gigi <yotamg@mellanox.com>
> 
> In order to allow the ipmr module to do partial multicast forwarding
> according to the device parent ID, add the device parent ID field to the
> VIF struct. This way, the forwarding path can use the parent ID field
> without invoking switchdev calls, which requires the RTNL lock.
> 
> When a new VIF is added, set the device parent ID field in it by invoking
> the switchdev_port_attr_get call.
> 
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/linux/mroute.h | 2 ++
>  net/ipv4/ipmr.c        | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
> index b072a84..a46577f 100644
> --- a/include/linux/mroute.h
> +++ b/include/linux/mroute.h
> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>  
>  struct vif_device {
>  	struct net_device 	*dev;			/* Device we are using */
> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
> +	bool		dev_parent_id_valid;
>  	unsigned long	bytes_in,bytes_out;
>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 292a8e8..4566c54 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -67,6 +67,7 @@
>  #include <net/fib_rules.h>
>  #include <linux/netconf.h>
>  #include <net/nexthop.h>
> +#include <net/switchdev.h>
>  
>  struct ipmr_rule {
>  	struct fib_rule		common;
> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  		   struct vifctl *vifc, int mrtsock)
>  {
>  	int vifi = vifc->vifc_vifi;
> +	struct switchdev_attr attr = {
> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
> +	};
>  	struct vif_device *v = &mrt->vif_table[vifi];
>  	struct net_device *dev;
>  	struct in_device *in_dev;
> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  
>  	/* Fill in the VIF structures */
>  
> +	attr.orig_dev = dev;
> +	if (!switchdev_port_attr_get(dev, &attr)) {
> +		v->dev_parent_id_valid = true;
> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
> +	}
>  	v->rate_limit = vifc->vifc_rate_limit;
>  	v->local = vifc->vifc_lcl_addr.s_addr;
>  	v->remote = vifc->vifc_rmt_addr.s_addr;
> 

One more thing - what happens on vif delete, then add with the same vif index of another
device that doesn't have a parent id ? I think the vif will be stuck with its parent_id
when it gets set.

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jason Wang @ 2017-09-29  9:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: jakub.kicinski, Michael S. Tsirkin, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Alexei Starovoitov,
	Andy Gospodarek
In-Reply-To: <150660343811.2808.7680200486950101509.stgit@firesoul>



On 2017年09月28日 20:57, Jesper Dangaard Brouer wrote:
> +};
> +
> +/* Convert xdp_buff to xdp_pkt */
> +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> +{
> +	struct xdp_pkt *xdp_pkt;
> +	int headroom;
> +
> +	/* Assure headroom is available for storing info */
> +	headroom = xdp->data - xdp->data_hard_start;
> +	if (headroom < sizeof(*xdp_pkt))
> +		return NULL;

Hi Jesper:

Do you consider this as a trick or a long term solution? Is it better to 
store XDP in a circular buffer? (I'm asking since I meet similar issue 
when doing xdp_xmit for tun).

> +
> +	/* Store info in top of packet */
> +	xdp_pkt = xdp->data_hard_start;
> +
> +	xdp_pkt->data = xdp->data;
> +	xdp_pkt->len  = xdp->data_end - xdp->data;
> +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> +

Is wmb() needed here?

> +	return xdp_pkt;
> +}

Thanks

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:45 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <7618b2e8-e7f9-d2c6-b13c-53aef0f50de0@cumulusnetworks.com>

On 29/09/17 12:29, Nikolay Aleksandrov wrote:
> On 28/09/17 20:34, Jiri Pirko wrote:
>> From: Yotam Gigi <yotamg@mellanox.com>
>>
>> In order to allow the ipmr module to do partial multicast forwarding
>> according to the device parent ID, add the device parent ID field to the
>> VIF struct. This way, the forwarding path can use the parent ID field
>> without invoking switchdev calls, which requires the RTNL lock.
>>
>> When a new VIF is added, set the device parent ID field in it by invoking
>> the switchdev_port_attr_get call.
>>
>> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
>> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/linux/mroute.h | 2 ++
>>  net/ipv4/ipmr.c        | 9 +++++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index b072a84..a46577f 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>>  
>>  struct vif_device {
>>  	struct net_device 	*dev;			/* Device we are using */
>> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
>> +	bool		dev_parent_id_valid;
>>  	unsigned long	bytes_in,bytes_out;
>>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 292a8e8..4566c54 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -67,6 +67,7 @@
>>  #include <net/fib_rules.h>
>>  #include <linux/netconf.h>
>>  #include <net/nexthop.h>
>> +#include <net/switchdev.h>
>>  
>>  struct ipmr_rule {
>>  	struct fib_rule		common;
>> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>>  		   struct vifctl *vifc, int mrtsock)
>>  {
>>  	int vifi = vifc->vifc_vifi;
>> +	struct switchdev_attr attr = {
>> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>> +	};
>>  	struct vif_device *v = &mrt->vif_table[vifi];
>>  	struct net_device *dev;
>>  	struct in_device *in_dev;
>> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>>  
>>  	/* Fill in the VIF structures */
>>  
>> +	attr.orig_dev = dev;
>> +	if (!switchdev_port_attr_get(dev, &attr)) {
>> +		v->dev_parent_id_valid = true;
>> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
> 
> Hmm, shouldn't you set dev_parent_id.id_len too ? It would seem netdev_phys_item_id_same()
> uses it in the comparison and without the len it would always look like they're the same
> because memcmp will simply return 0 with count = 0.

Also maybe we can use the non-zero id_len as a signal that it was set and drop the dev_parent_id_valid
field altogether, it would seem there's no valid reason to have id_len == 0 and yet expect a valid
parent_id.

> 
>> +	}
>>  	v->rate_limit = vifc->vifc_rate_limit;
>>  	v->local = vifc->vifc_lcl_addr.s_addr;
>>  	v->remote = vifc->vifc_rmt_addr.s_addr;
>>
> 

^ permalink raw reply

* Re: [PATCH net-next RFC 0/9] net: dsa: PTP timestamping for mv88e6xxx
From: Richard Cochran @ 2017-09-29  9:43 UTC (permalink / raw)
  To: Brandon Streiff
  Cc: netdev, linux-kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot, Erik Hons
In-Reply-To: <1506612341-18061-1-git-send-email-brandon.streiff@ni.com>

Brandon,

On Thu, Sep 28, 2017 at 10:25:32AM -0500, Brandon Streiff wrote:
> - Patch #2: We expose the switch time as a PTP clock but don't support
>   adjustment (max_adj=0).

The driver should implement a cyclecounter/timecounter.

> Our platform adjusted a systemwide oscillator
>   from userspace, so we didn't need adjustment at this layer, but other
>   PTP clock drivers support this and we probably should too.

We don't currently have any way to support this kind of HW or anything
like an external VCO.  I would like to find a way to do this, but that
is a different kettle of fish as it might require changes in the PHC
subsystem.  For this driver, I think we should get it merged using the
cyclecounter/timecounter (as that will benefit lots of users) and
worry about the external oscillator later.

> Feedback is appreciated.

I happy to see this series.  I just finished porting an out-of-tree
PHC driver for the Marvell mv88e635x, and I want to mainline it, but I
also have a few uglies.

Unfortunately I am in the middle of a move right now, and so my review
of this series might have to wait a bit.  However, I am looking
forward to comparing notes, and then getting this support in.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next 0/3] support changing steering policies in tuntap
From: Jason Wang @ 2017-09-29  9:41 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, Network Development, LKML
In-Reply-To: <CAF=yD-KuDZK0-YUfYde=dk_6M6fgj_opuiK2VTfCKDM_=MqDcw@mail.gmail.com>



On 2017年09月29日 00:09, Willem de Bruijn wrote:
> On Thu, Sep 28, 2017 at 3:23 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年09月28日 07:25, Willem de Bruijn wrote:
>>>>> In the future, both simple and sophisticated policy like RSS or other
>>>>> guest
>>>>> driven steering policies could be done on top.
>>>> IMHO there should be a more practical example before adding all this
>>>> indirection. And it would be nice to understand why this queue selection
>>>> needs to be tun specific.
>>> I was thinking the same and this reminds me of the various strategies
>>> implemented in packet fanout. tun_cpu_select_queue is analogous to
>>> fanout_demux_cpu though it is tun-specific in that it requires
>>> tun->numqueues.
>>
>> Right, the main idea is to introduce a way to change flow steering policy
>> for tun. I think fanout policy could be implemented through the API
>> introduced in this series. (Current flow caches based automatic steering
>> method is tun specific).
>>
>>> Fanout accrued various strategies until it gained an eBPF variant. Just
>>> supporting BPF is probably sufficient here, too.
>>
>> Technically yes, but for tun, it also serve for virt. We probably still need
>> some hard coded policy which could be changed by guest until we can accept
>> an BPF program from guest I think?
> When would a guest choose the policy? As long as this is under control
> of a host user, possibly unprivileged, allowing BPF here is moot, as any
> user can run socket filter BPF already. Programming from the guest is
> indeed different. I don't fully understand that use case.

The problem is userspace (qemu) know little about what kind of workloads 
will be done by guest, so we need guest controllable method here since 
it knows the best steering policy. Rethink about this, instead of 
passing eBPF from guest, qemu can have some pre-defined sets of polices. 
I will change the cpu id based to eBPF based in V2.

Thanks

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Hannes Frederic Sowa @ 2017-09-29  9:40 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, netdev,
	jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek, edumazet
In-Reply-To: <1506677857.2478.5.camel@redhat.com>

Paolo Abeni <pabeni@redhat.com> writes:

> On Fri, 2017-09-29 at 09:56 +0200, Hannes Frederic Sowa wrote:
>> [adding Paolo, Eric]
>> 
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> 
>> > On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
>> 
>> [...]
>> 
>> > > +	wake_up_process(rcpu->kthread);
>> > 
>> > In general the whole thing looks like 'threaded NAPI' that Hannes was
>> > proposing some time back. I liked it back then and I like it now.
>> > I don't remember what were the objections back then.
>> > Something scheduler related?
>> > Adding Hannes.
>
> Beyond the added scheduling complexity, the threaded NAPI
> implementation proposed some time ago also possibly introduced OoO
> packet delivery, because the NAPI threads were left unbound to any CPU.

Right, yes, but that can be resolved. The problem was just in that
particular patch.

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Paolo Abeni @ 2017-09-29  9:37 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, netdev, jakub.kicinski,
	Michael S. Tsirkin, Jason Wang, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Andy Gospodarek, edumazet
In-Reply-To: <8737760wg5.fsf@stressinduktion.org>

On Fri, 2017-09-29 at 09:56 +0200, Hannes Frederic Sowa wrote:
> [adding Paolo, Eric]
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> 
> [...]
> 
> > > +	wake_up_process(rcpu->kthread);
> > 
> > In general the whole thing looks like 'threaded NAPI' that Hannes was
> > proposing some time back. I liked it back then and I like it now.
> > I don't remember what were the objections back then.
> > Something scheduler related?
> > Adding Hannes.

Beyond the added scheduling complexity, the threaded NAPI
implementation proposed some time ago also possibly introduced OoO
packet delivery, because the NAPI threads were left unbound to any CPU.

Cheers,

Paolo

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:29 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928173415.15551-3-jiri@resnulli.us>

On 28/09/17 20:34, Jiri Pirko wrote:
> From: Yotam Gigi <yotamg@mellanox.com>
> 
> In order to allow the ipmr module to do partial multicast forwarding
> according to the device parent ID, add the device parent ID field to the
> VIF struct. This way, the forwarding path can use the parent ID field
> without invoking switchdev calls, which requires the RTNL lock.
> 
> When a new VIF is added, set the device parent ID field in it by invoking
> the switchdev_port_attr_get call.
> 
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/linux/mroute.h | 2 ++
>  net/ipv4/ipmr.c        | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
> index b072a84..a46577f 100644
> --- a/include/linux/mroute.h
> +++ b/include/linux/mroute.h
> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>  
>  struct vif_device {
>  	struct net_device 	*dev;			/* Device we are using */
> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
> +	bool		dev_parent_id_valid;
>  	unsigned long	bytes_in,bytes_out;
>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 292a8e8..4566c54 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -67,6 +67,7 @@
>  #include <net/fib_rules.h>
>  #include <linux/netconf.h>
>  #include <net/nexthop.h>
> +#include <net/switchdev.h>
>  
>  struct ipmr_rule {
>  	struct fib_rule		common;
> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  		   struct vifctl *vifc, int mrtsock)
>  {
>  	int vifi = vifc->vifc_vifi;
> +	struct switchdev_attr attr = {
> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
> +	};
>  	struct vif_device *v = &mrt->vif_table[vifi];
>  	struct net_device *dev;
>  	struct in_device *in_dev;
> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  
>  	/* Fill in the VIF structures */
>  
> +	attr.orig_dev = dev;
> +	if (!switchdev_port_attr_get(dev, &attr)) {
> +		v->dev_parent_id_valid = true;
> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);

Hmm, shouldn't you set dev_parent_id.id_len too ? It would seem netdev_phys_item_id_same()
uses it in the comparison and without the len it would always look like they're the same
because memcmp will simply return 0 with count = 0.

> +	}
>  	v->rate_limit = vifc->vifc_rate_limit;
>  	v->local = vifc->vifc_lcl_addr.s_addr;
>  	v->remote = vifc->vifc_rmt_addr.s_addr;
> 

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net v2] i40e: Fix limit imprecise of the number of MAC/VLAN that can be added for VFs
From: wangyunjian @ 2017-09-29  9:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jeff Kirsher, Sergei Shtylyov, Netdev, caihe,
	intel-wired-lan
In-Reply-To: <CAKgT0UcsnLUWU8qedNrv+bhbgJ6SC9zWyc7oR9gObcYo2atN5g@mail.gmail.com>



> -----Original Message-----
> From: Alexander Duyck [mailto:alexander.duyck@gmail.com]
> Sent: Thursday, September 28, 2017 11:44 PM
> To: wangyunjian <wangyunjian@huawei.com>
> Cc: David Miller <davem@davemloft.net>; Jeff Kirsher
> <jeffrey.t.kirsher@intel.com>; Sergei Shtylyov
> <sergei.shtylyov@cogentembedded.com>; Netdev
> <netdev@vger.kernel.org>; caihe <caihe@huawei.com>; intel-wired-lan
> <intel-wired-lan@lists.osuosl.org>
> Subject: Re: [Intel-wired-lan] [PATCH net v2] i40e: Fix limit imprecise of the
> number of MAC/VLAN that can be added for VFs
> 
> On Wed, Sep 27, 2017 at 7:01 PM, w00273186 <wangyunjian@huawei.com>
> wrote:
> > From: Yunjian Wang <wangyunjian@huawei.com>
> >
> > Now it doesn't limit the number of MAC/VLAN strictly. When there is more
> > elements in the virtchnl MAC/VLAN list, it can still add successfully.
> 
> You could still add but should you. I'm not clear from this patch
> description what this is supposed to be addressing. If you enable the
> "trust" flag for a VF via the "ip link set dev <iface> vf <vfnum>
> trust on" it can make use of any resources on the device, but without
> that there is an upper limit that is supposed to be enforced to
> prevent the VF from making use of an excessive amount of resources.
> That is what is being enforced by the code you are moving out of the
> way below.

I don't enable the "trust" flag for a VF. But this script can successfully add
MACs more than I40E_VC_MAX_MAC_ADDR_PER_VF(12) in VM. It has
same problem with VLAN.

Test script:

for((i=10;i<50;i++))
do
    ipmaddr add 01:00:5e:01:02:$i  dev eth0
done

for ((i=1;i<40;i++))
do
    ip link add link eth0 name eth0.$i type vlan id $i
done

> 
> > Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 27 +++++++++++++--
> -------
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> > index 4d1e670..285b96a 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> > @@ -2065,11 +2065,6 @@ static inline int i40e_check_vf_permission(struct
> i40e_vf *vf, u8 *macaddr)
> >                 dev_err(&pf->pdev->dev,
> >                         "VF attempting to override administratively set MAC address,
> reload the VF driver to resume normal operation\n");
> >                 ret = -EPERM;
> > -       } else if ((vf->num_mac >= I40E_VC_MAX_MAC_ADDR_PER_VF) &&
> > -                  !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > -               dev_err(&pf->pdev->dev,
> > -                       "VF is not trusted, switch the VF to trusted to add more
> functionality\n");
> > -               ret = -EPERM;
> >         }
> >         return ret;
> >  }
> > @@ -2128,6 +2123,15 @@ static int i40e_vc_add_mac_addr_msg(struct
> i40e_vf *vf, u8 *msg, u16 msglen)
> >                 } else {
> >                         vf->num_mac++;
> >                 }
> > +
> > +               if ((vf->num_mac >= I40E_VC_MAX_MAC_ADDR_PER_VF) &&
> > +                   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > +                       dev_err(&pf->pdev->dev,
> > +                               "VF is not trusted, switch the VF to trusted to add more
> functionality\n");
> > +                       ret = -EPERM;
> > +                       spin_unlock_bh(&vsi->mac_filter_hash_lock);
> > +                       goto error_param;
> > +               }
> >         }
> >         spin_unlock_bh(&vsi->mac_filter_hash_lock);
> >
> 
> This doesn't make any sense. You are doing the checks after you have
> already added the MAC. The only part you aren't doing is sending the
> message to the VF indicating that the request was successful.
> 
> > @@ -2221,12 +2225,6 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf
> *vf, u8 *msg, u16 msglen)
> >         i40e_status aq_ret = 0;
> >         int i;
> >
> > -       if ((vf->num_vlan >= I40E_VC_MAX_VLAN_PER_VF) &&
> > -           !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > -               dev_err(&pf->pdev->dev,
> > -                       "VF is not trusted, switch the VF to trusted to add more VLAN
> addresses\n");
> > -               goto error_param;
> > -       }
> >         if (!test_bit(I40E_VF_STATE_ACTIVE, &vf->vf_states) ||
> >             !i40e_vc_isvalid_vsi_id(vf, vsi_id)) {
> >                 aq_ret = I40E_ERR_PARAM;
> > @@ -2269,6 +2267,13 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf
> *vf, u8 *msg, u16 msglen)
> >                         dev_err(&pf->pdev->dev,
> >                                 "Unable to add VLAN filter %d for VF %d, error %d\n",
> >                                 vfl->vlan_id[i], vf->vf_id, ret);
> > +               if ((vf->num_vlan >= I40E_VC_MAX_VLAN_PER_VF) &&
> > +                   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
> > +                       dev_err(&pf->pdev->dev,
> > +                               "VF is not trusted, switch the VF to trusted to add more
> VLAN addresses\n");
> > +                       aq_ret = -EPERM;
> > +                       goto error_param;
> > +               }
> >         }
> >
> >  error_param:
> 
> Same here. You are doing this after the call to i40e_vsi_add_vlan. The
> code makes no sense here. This bit of code is supposed to be
> preventing a VF from abusing resources if the VF is not privelaged.

^ permalink raw reply

* RE: [PATCH RFC 3/5] Add KSZ8795 switch driver
From: David Laight @ 2017-09-29  9:14 UTC (permalink / raw)
  To: 'Andrew Lunn', Tristram.Ha@microchip.com
  Cc: muvarov@gmail.com, pavel@ucw.cz, nathan.leigh.conrad@gmail.com,
	vivien.didelot@savoirfairelinux.com, f.fainelli@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Woojung.Huh@microchip.com
In-Reply-To: <20170928193416.GH14940@lunn.ch>

From: Andrew Lunn
> Sent: 28 September 2017 20:34
...
> > There are 34 counters.  In normal case using generic bus I/O or PCI to read them
> > is very quick, but the switch is mostly accessed using SPI, or even I2C.  As the SPI
> > access is very slow.
> 
> How slow is it? The Marvell switches all use MDIO. It is probably a
> bit faster than I2C, but it is a lot slower than MMIO or PCI.
> 
> ethtool -S lan0 takes about 25ms.

Is the SPI access software bit-banged?
Doing that with software delays isn't friendly to the rest of the system.
(Hardware guys please note...)

One possibility is to rate-limit the stats reading.
Then an application cannot completely 'hog' the SPI bandwidth.

	David

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Jesper Dangaard Brouer @ 2017-09-29  9:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek, hannes, brouer
In-Reply-To: <20170929032146.vs5v454wjs4niu4k@ast-mbp>

On Thu, 28 Sep 2017 20:21:47 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> > The 'cpumap' is primary used as a backend map for XDP BPF helper
> > call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> > 
> > This patch implement the main part of the map.  It is not connected to
> > the XDP redirect system yet, and no SKB allocation are done yet.
> > 
> > The main concern in this patch is to ensure the datapath can run
> > without any locking.  This adds complexity to the setup and tear-down
> > procedure, which assumptions are extra carefully documented in the
> > code comments.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/linux/bpf_types.h      |    1 
> >  include/uapi/linux/bpf.h       |    1 
> >  kernel/bpf/Makefile            |    1 
> >  kernel/bpf/cpumap.c            |  547 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/bpf/syscall.c           |    8 +
> >  tools/include/uapi/linux/bpf.h |    1 
> >  6 files changed, 558 insertions(+), 1 deletion(-)
> >  create mode 100644 kernel/bpf/cpumap.c
> > 
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 6f1a567667b8..814c1081a4a9 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -41,4 +41,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
> >  #ifdef CONFIG_STREAM_PARSER
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
> >  #endif
> > +BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
> >  #endif
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e43491ac4823..f14e15702533 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -111,6 +111,7 @@ enum bpf_map_type {
> >  	BPF_MAP_TYPE_HASH_OF_MAPS,
> >  	BPF_MAP_TYPE_DEVMAP,
> >  	BPF_MAP_TYPE_SOCKMAP,
> > +	BPF_MAP_TYPE_CPUMAP,
> >  };
> >  
> >  enum bpf_prog_type {
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 897daa005b23..dba0bd33a43c 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -4,6 +4,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
> >  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
> >  ifeq ($(CONFIG_NET),y)
> >  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> > +obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
> >  ifeq ($(CONFIG_STREAM_PARSER),y)
> >  obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
> >  endif
> > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> > new file mode 100644
> > index 000000000000..f0948af82e65
> > --- /dev/null
> > +++ b/kernel/bpf/cpumap.c
> > @@ -0,0 +1,547 @@
> > +/* bpf/cpumap.c
> > + *
> > + * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
> > + * Released under terms in GPL version 2.  See COPYING.
> > + */
> > +
> > +/* The 'cpumap' is primary used as a backend map for XDP BPF helper
> > + * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> > + *
> > + * Unlike devmap which redirect XDP frames out another NIC device,
> > + * this map type redirect raw XDP frames to another CPU.  The remote
> > + * CPU will do SKB-allocation and call the normal network stack.
> > + *
> > + * This is a scalability and isolation mechanism, that allow
> > + * separating the early driver network XDP layer, from the rest of the
> > + * netstack, and assigning dedicated CPUs for this stage.  This
> > + * basically allows for 10G wirespeed pre-filtering via bpf.
> > + */
> > +#include <linux/bpf.h>
> > +#include <linux/filter.h>
> > +#include <linux/ptr_ring.h>
> > +
> > +#include <linux/sched.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/kthread.h>
> > +
> > +/*
> > + * General idea: XDP packets getting XDP redirected to another CPU,
> > + * will maximum be stored/queued for one driver ->poll() call.  It is
> > + * guaranteed that setting flush bit and flush operation happen on
> > + * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
> > + * which queue in bpf_cpu_map_entry contains packets.
> > + */
> > +
> > +#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
> > +struct xdp_bulk_queue {
> > +	void *q[CPU_MAP_BULK_SIZE];
> > +	unsigned int count;
> > +};
> > +
> > +/* Struct for every remote "destination" CPU in map */
> > +struct bpf_cpu_map_entry {
> > +	u32 cpu;    /* kthread CPU and map index */
> > +	int map_id; /* Back reference to map */
> > +	u32 qsize;  /* Redundant queue size for map lookup */
> > +
> > +	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
> > +	struct xdp_bulk_queue __percpu *bulkq;
> > +
> > +	/* Queue with potential multi-producers, and single-consumer kthread */
> > +	struct ptr_ring *queue;
> > +	struct task_struct *kthread;
> > +	struct work_struct kthread_stop_wq;
> > +
> > +	atomic_t refcnt; /* Control when this struct can be free'ed */
> > +	struct rcu_head rcu;
> > +};
> > +
> > +struct bpf_cpu_map {
> > +	struct bpf_map map;
> > +	/* Below members specific for map type */
> > +	struct bpf_cpu_map_entry **cpu_map;
> > +	unsigned long __percpu *flush_needed;
> > +};
> > +
> > +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> > +			     struct xdp_bulk_queue *bq);
> > +
> > +static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
> > +{
> > +	return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
> > +}
> > +
> > +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> > +{
> > +	struct bpf_cpu_map *cmap;
> > +	u64 cost;
> > +	int err;
> > +
> > +	/* check sanity of attributes */
> > +	if (attr->max_entries == 0 || attr->key_size != 4 ||
> > +	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	cmap = kzalloc(sizeof(*cmap), GFP_USER);
> > +	if (!cmap)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* mandatory map attributes */
> > +	cmap->map.map_type = attr->map_type;
> > +	cmap->map.key_size = attr->key_size;
> > +	cmap->map.value_size = attr->value_size;
> > +	cmap->map.max_entries = attr->max_entries;
> > +	cmap->map.map_flags = attr->map_flags;
> > +	cmap->map.numa_node = bpf_map_attr_numa_node(attr);
> > +
> > +	/* make sure page count doesn't overflow */
> > +	cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
> > +	cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
> > +	if (cost >= U32_MAX - PAGE_SIZE)
> > +		goto free_cmap;
> > +	cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	/* if map size is larger than memlock limit, reject it early */
> > +	err = bpf_map_precharge_memlock(cmap->map.pages);
> > +	if (err)
> > +		goto free_cmap;
> > +
> > +	/* A per cpu bitfield with a bit per possible CPU in map  */
> > +	cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
> > +					    __alignof__(unsigned long));
> > +	if (!cmap->flush_needed)
> > +		goto free_cmap;
> > +
> > +	/* Alloc array for possible remote "destination" CPUs */
> > +	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
> > +					   sizeof(struct bpf_cpu_map_entry *),
> > +					   cmap->map.numa_node);
> > +	if (!cmap->cpu_map)
> > +		goto free_cmap;
> > +
> > +	return &cmap->map;
> > +free_cmap:
> > +	free_percpu(cmap->flush_needed);
> > +	kfree(cmap);
> > +	return ERR_PTR(-ENOMEM);
> > +}
> > +
> > +void __cpu_map_queue_destructor(void *ptr)
> > +{
> > +	/* For now, just catch this as an error */
> > +	if (!ptr)
> > +		return;
> > +	pr_err("ERROR: %s() cpu_map queue was not empty\n", __func__);
> > +	page_frag_free(ptr);
> > +}
> > +
> > +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	if (atomic_dec_and_test(&rcpu->refcnt)) {
> > +		/* The queue should be empty at this point */
> > +		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> > +		kfree(rcpu->queue);
> > +		kfree(rcpu);
> > +	}
> > +}
> > +
> > +static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	atomic_inc(&rcpu->refcnt);
> > +}
> > +
> > +/* called from workqueue, to workaround syscall using preempt_disable */
> > +static void cpu_map_kthread_stop(struct work_struct *work)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu;
> > +
> > +	rcpu = container_of(work, struct bpf_cpu_map_entry, kthread_stop_wq);
> > +	synchronize_rcu(); /* wait for flush in __cpu_map_entry_free() */
> > +	kthread_stop(rcpu->kthread); /* calls put_cpu_map_entry */
> > +}
> > +
> > +static int cpu_map_kthread_run(void *data)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu = data;
> > +
> > +	set_current_state(TASK_INTERRUPTIBLE);
> > +	while (!kthread_should_stop()) {
> > +		struct xdp_pkt *xdp_pkt;
> > +
> > +		schedule();
> > +		/* Do work */
> > +		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > +			/* For now just "refcnt-free" */
> > +			page_frag_free(xdp_pkt);
> > +		}
> > +		__set_current_state(TASK_INTERRUPTIBLE);
> > +	}
> > +	put_cpu_map_entry(rcpu);
> > +
> > +	__set_current_state(TASK_RUNNING);
> > +	return 0;
> > +}
> > +
> > +struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id)
> > +{
> > +	gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN;
> > +	struct bpf_cpu_map_entry *rcpu;
> > +	int numa, err;
> > +
> > +	/* Have map->numa_node, but choose node of redirect target CPU */
> > +	numa = cpu_to_node(cpu);
> > +
> > +	rcpu = kzalloc_node(sizeof(*rcpu), gfp, numa);
> > +	if (!rcpu)
> > +		return NULL;
> > +
> > +	/* Alloc percpu bulkq */
> > +	rcpu->bulkq = __alloc_percpu_gfp(sizeof(*rcpu->bulkq),
> > +					 sizeof(void *), gfp);
> > +	if (!rcpu->bulkq)
> > +		goto fail;
> > +
> > +	/* Alloc queue */
> > +	rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
> > +	if (!rcpu->queue)
> > +		goto fail;
> > +
> > +	err = ptr_ring_init(rcpu->queue, qsize, gfp);
> > +	if (err)
> > +		goto fail;
> > +	rcpu->qsize = qsize;
> > +
> > +	/* Setup kthread */
> > +	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
> > +					       "cpumap/%d/map:%d", cpu, map_id);
> > +	if (IS_ERR(rcpu->kthread))
> > +		goto fail;
> > +
> > +	/* Make sure kthread runs on a single CPU */
> > +	kthread_bind(rcpu->kthread, cpu);  
> 
> is there a check that max_entries <= num_possible_cpu ? I couldn't
> find it. otherwise it will be binding to impossible cpu?

Good point! -- I'll find an appropriate place to add such a limit.


> > +	wake_up_process(rcpu->kthread);  
> 
> In general the whole thing looks like 'threaded NAPI' that Hannes was
> proposing some time back. I liked it back then and I like it now.
> I don't remember what were the objections back then.
> Something scheduler related?
> Adding Hannes.

It is related to the threaded NAPI' idea[1], and I did choose kthreads
because this was used by this patch[1].
(Link to Hannes & Paolo's patch:[1] http://patchwork.ozlabs.org/patch/620657/)

It's less-intrusive, as it's only activated specifically when activating
bpf+XDP+cpumap.  Plus, it's not taking over the calling of napi->poll,
it is "just" making to "cost" of calling napi->poll significantly
smaller, as it moves invoking the network stack to another kthread. And
the choice is done on a per packet level (you don't get more
flexibility than that).

> Still curious about the questions I asked in the other thread
> on what's causing it to be so much better than RPS

Answered in that thread.  It is simply that the RPS-RX CPU have to do
too much work (like memory allocations).  Plus it uses more expensive
IPI calls, where I use wake_up_process() which doesn't do a IPI if it
can see that the remote thread is already running.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next 0/5] bpf: Extend bpf_{prog,map}_info
From: David Miller @ 2017-09-29  5:17 UTC (permalink / raw)
  To: kafai; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20170927213756.1254938-1-kafai@fb.com>

From: Martin KaFai Lau <kafai@fb.com>
Date: Wed, 27 Sep 2017 14:37:51 -0700

> This patch series adds more fields to bpf_prog_info and bpf_map_info.
> Please see individual patch for details.

Great to see progress in the area of eBPF introspection.

Series applied, thanks.

^ permalink raw reply

* [PATCH net v1 1/1] tipc: use only positive error codes in messages
From: Parthasarathy Bhuvaragan @ 2017-09-29  8:02 UTC (permalink / raw)
  To: davem
  Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue,
	parthasarathy.bhuvaragan

In commit e3a77561e7d32 ("tipc: split up function tipc_msg_eval()"),
we have updated the function tipc_msg_lookup_dest() to set the error
codes to negative values at destination lookup failures. Thus when
the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
into the 4 bit error field of the message header as 0xf instead of
TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.

In this commit, we set only positive error code.

Fixes: e3a77561e7d32 ("tipc: split up function tipc_msg_eval()")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
---
 net/tipc/msg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 6ef379f004ac..121e59a1d0e7 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -551,7 +551,7 @@ bool tipc_msg_lookup_dest(struct net *net, struct sk_buff *skb, int *err)
 		return false;
 	if (msg_errcode(msg))
 		return false;
-	*err = -TIPC_ERR_NO_NAME;
+	*err = TIPC_ERR_NO_NAME;
 	if (skb_linearize(skb))
 		return false;
 	msg = buf_msg(skb);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v4 net-next 0/8] flow_dissector: Protocol specific flow dissector offload
From: Hannes Frederic Sowa @ 2017-09-29  7:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, rohit
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Tom Herbert <tom@quantonium.net> writes:

> This patch set adds a new offload type to perform flow dissection for
> specific protocols (either by EtherType or by IP protocol). This is
> primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
> the purposes of parsing the encapsulated packet.
>
> Items in this patch set:
> - Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
>   on the code in the GRE dissect function and the special handling in
>   GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
>   goto proto_again is done)
> - Add infrastructure for protocol specific flow dissection offload
> - Add infrastructure to perform UDP flow dissection. Uses same model of
>   GRO where a flow_dissect callback can be associated with a UDP
>   socket
> - Use the infrastructure to support flow dissection of VXLAN and GUE
>
> Tested:
>
> Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
> that inner packet was being properly dissected.

I have the feeling that this patch series changes the behavior of flower
and thus causes uAPI problems.

flower seems to use the flow dissector results for parsing the inner
packets. In case of vxlan in vxlan encapsulation, which seems to become
more common (sigh!) you let part of the flow specification match on the
most inner header, while the flower ingress filter might want to match
inside the first encapsulation only.

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Hannes Frederic Sowa @ 2017-09-29  7:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, netdev, jakub.kicinski,
	Michael S. Tsirkin, Jason Wang, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Andy Gospodarek, pabeni,
	edumazet
In-Reply-To: <20170929032146.vs5v454wjs4niu4k@ast-mbp>

[adding Paolo, Eric]

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:

[...]

>> +	wake_up_process(rcpu->kthread);
>
> In general the whole thing looks like 'threaded NAPI' that Hannes was
> proposing some time back. I liked it back then and I like it now.
> I don't remember what were the objections back then.
> Something scheduler related?
> Adding Hannes.

Yes.

The main objection from Eric at that time was that user space now starts
to compete with the threaded NAPI threads depending on process
priorities, which are under control of user space. Softirq always runs
first to end. Networking could starve because a process with higher
priority is runnable. At that time Eric found a way to fix the
particular problem, which resulted in commit 4cd13c21b207e80d. Pinning
and other control is also possible from user space, causing more complex
tuning set ups and problems will be harder to debug.

In particular after Eric's patch threaded NAPI proofed itself to be not
useful anymore, because his patch successfully deferred work to the
ksoftirqd more reliable thus allowing the UDP rx queue to get drained by
user space.

> Still curious about the questions I asked in the other thread
> on what's causing it to be so much better than RPS

My guess is that RPS uses expensive IPI to notify the remote
softirq. The batching size on RPS depends on how many packets could get
worked on during one softirq invocation on the source CPU until we wake
up remote CPU(s!), if they are not constantly running.

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jesper Dangaard Brouer @ 2017-09-29  7:46 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Alexei Starovoitov, Andy Gospodarek, brouer
In-Reply-To: <59CD83DD.4060603@iogearbox.net>

On Fri, 29 Sep 2017 01:21:01 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> [...]
> > +/* Convert xdp_buff to xdp_pkt */
> > +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> > +{
> > +	struct xdp_pkt *xdp_pkt;
> > +	int headroom;
> > +
> > +	/* Assure headroom is available for storing info */
> > +	headroom = xdp->data - xdp->data_hard_start;
> > +	if (headroom < sizeof(*xdp_pkt))
> > +		return NULL;
> > +
> > +	/* Store info in top of packet */
> > +	xdp_pkt = xdp->data_hard_start;  
> 
> (You'd also need to handle data_meta here if set, and for below
> cpu_map_build_skb(), e.g. headroom is data_meta-data_hard_start.)

I'll look into this.  The data_meta patchset was in-flight while I
rebased this.

> > +	xdp_pkt->data = xdp->data;
> > +	xdp_pkt->len  = xdp->data_end - xdp->data;
> > +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> > +
> > +	return xdp_pkt;
> > +}
> > +
> > +static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> > +					 struct xdp_pkt *xdp_pkt)
> > +{
> > +	unsigned int frame_size;
> > +	void *pkt_data_start;
> > +	struct sk_buff *skb;
> > +
> > +	/* build_skb need to place skb_shared_info after SKB end, and
> > +	 * also want to know the memory "truesize".  Thus, need to  
> [...]
> >   static int cpu_map_kthread_run(void *data)
> >   {
> > +	const unsigned long busy_poll_jiffies = usecs_to_jiffies(2000);
> > +	unsigned long time_limit = jiffies + busy_poll_jiffies;
> >   	struct bpf_cpu_map_entry *rcpu = data;
> > +	unsigned int empty_cnt = 0;
> >
> >   	set_current_state(TASK_INTERRUPTIBLE);
> >   	while (!kthread_should_stop()) {
> > +		unsigned int processed = 0, drops = 0;
> >   		struct xdp_pkt *xdp_pkt;
> >
> > -		schedule();
> > -		/* Do work */
> > -		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > -			/* For now just "refcnt-free" */
> > -			page_frag_free(xdp_pkt);
> > +		/* Release CPU reschedule checks */
> > +		if ((time_after_eq(jiffies, time_limit) || empty_cnt > 25) &&
> > +		    __ptr_ring_empty(rcpu->queue)) {
> > +			empty_cnt++;
> > +			schedule();
> > +			time_limit = jiffies + busy_poll_jiffies;
> > +			WARN_ON(smp_processor_id() != rcpu->cpu);
> > +		} else {
> > +			cond_resched();
> >   		}
> > +
> > +		/* Process packets in rcpu->queue */
> > +		local_bh_disable();
> > +		/*
> > +		 * The bpf_cpu_map_entry is single consumer, with this
> > +		 * kthread CPU pinned. Lockless access to ptr_ring
> > +		 * consume side valid as no-resize allowed of queue.
> > +		 */
> > +		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
> > +			struct sk_buff *skb;
> > +			int ret;
> > +
> > +			/* Allow busy polling again */
> > +			empty_cnt = 0;
> > +
> > +			skb = cpu_map_build_skb(rcpu, xdp_pkt);
> > +			if (!skb) {
> > +				page_frag_free(xdp_pkt);
> > +				continue;
> > +			}
> > +
> > +			/* Inject into network stack */
> > +			ret = netif_receive_skb(skb);  
> 
> Have you looked into whether it's feasible to reuse GRO
> engine here as well?

This is the first step. I'll work on adding the GRO-engine later. And
it should be feasible.  There are plenty of optimizations in this area
that can do done later ;-)

> 
> > +			if (ret == NET_RX_DROP)
> > +				drops++;
> > +
> > +			/* Limit BH-disable period */
> > +			if (++processed == 8)
> > +				break;
> > +		}
> > +		local_bh_enable();
> > +
> >   		__set_current_state(TASK_INTERRUPTIBLE);
> >   	}
> >   	put_cpu_map_entry(rcpu);  
> [...]



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-29  7:27 UTC (permalink / raw)
  To: Yang, Yi
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jiri Benc,
	e@erig.me, davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <20170929071553.GA19053-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

> > The optimization Yi refers to only affects the slow path translation.
> >
> > OVS 2.8 does not immediately trigger an immediate recirculation after translating
> > encap(nsh,...). There is no need to do so as the flow key of the resulting packet
> > can be determined from the encap() action and its properties. Translation
> > continues with the rewritten flow key and subsequent OpenFlow actions will
> > typically set the new fields in the new NSH header. The push_nsh datapath action
> > (including all NSH header fields) is only generated at the next commit, e.g. for
> > output, cloning, recirculation, encap/decap or another destructive change of
> > the flow key.
> >
> > The implementation of push_nsh in the user-space datapath does not update
> > the miniflow (key) of the packet, only the packet data and some metadata.
> > If the packet needs to be looked up again the slow path triggers recirculation
> > to re-parse the packet. There should be no need for the datapath push_nsh
> > action to try to update the flow key.
> 
> Thanks Jan for clarification, it can still work after removing that
> line, our flows didn't match it after push_nsh, it is output to
> VxLAN-gpe port after push_nsh, I'm not sure if we can match dl_type and NSH
> fields if we don't output and don't recirculate.

No worries, a packet cannot be matched again in the datapath unless it is 
recirculated. And recirculation today always implies re-parsing. 

In the future we want to look into possibilities to optimize performance of 
recirculation, for example by skipping the parsing stage if it is unnecessary.
For that we may need to invalidate the flow key in packet metadata when
the packet is modified without corresponding update of the key itself. But that
is music of the future.

/Jan

^ permalink raw reply

* (unknown), 
From: kelley @ 2017-09-29  7:26 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 40098069241.zip --]
[-- Type: application/zip, Size: 7206 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Yang, Yi @ 2017-09-29  7:15 UTC (permalink / raw)
  To: Jan Scheurich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jiri Benc,
	e@erig.me, davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <CFF8EF42F1132E4CBE2BF0AB6C21C58D7881A337-hqolJogE5njKJFWPz4pdheaU1rCVNFv4@public.gmane.org>

On Fri, Sep 29, 2017 at 07:10:52AM +0000, Jan Scheurich wrote:
> > From: Yang, Yi [mailto:yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
> > Sent: Friday, 29 September, 2017 08:41
> > To: Pravin Shelar <pshelar-LZ6Gd1LRuIk@public.gmane.org>
> > Cc: Jiri Benc <jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org; e@erig.me; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org; Jan Scheurich
> > <jan.scheurich-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH net-next v9] openvswitch: enable NSH support
> > 
> > On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> > > On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > > > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> > > >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> > > >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> > > >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> > > >> > will be recirculated to flow pipeline, it will be reparsed, so
> > > >> > key->eth.type will be set in packet parse function, we needn't handle it
> > > >> > in pop_nsh.
> > > >>
> > > >> This seems to be a very different approach than what we currently have.
> > > >> Looking at the code, the requirement after "destructive" actions such
> > > >> as pushing or popping headers is to recirculate.
> > > >
> > > > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > > > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > > > also cc jan.scheurich-IzeFyvvaP7oU04JRNCRQjg@public.gmane.org
> > > >
> > > > Actucally all the keys before push_nsh are still there after push_nsh,
> > > > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> > > >
> > >
> > >
> > > We should keep existing model for this patch. Later you can submit
> > > optimization patch with specific use cases and performance
> > > improvement. So that we can evaluate code complexity and benefits.
> > 
> > Ok, I'll remove the below line in push_nsh and send out v11, thanks.
> > 
> > 	key->eth.type = htons(ETH_P_NSH);
> 
> The optimization Yi refers to only affects the slow path translation. 
> 
> OVS 2.8 does not immediately trigger an immediate recirculation after translating 
> encap(nsh,...). There is no need to do so as the flow key of the resulting packet 
> can be determined from the encap() action and its properties. Translation 
> continues with the rewritten flow key and subsequent OpenFlow actions will 
> typically set the new fields in the new NSH header. The push_nsh datapath action 
> (including all NSH header fields) is only generated at the next commit, e.g. for 
> output, cloning, recirculation, encap/decap or another destructive change of 
> the flow key.
> 
> The implementation of push_nsh in the user-space datapath does not update
> the miniflow (key) of the packet, only the packet data and some metadata. 
> If the packet needs to be looked up again the slow path triggers recirculation
> to re-parse the packet. There should be no need for the datapath push_nsh 
> action to try to update the flow key.

Thanks Jan for clarification, it can still work after removing that
line, our flows didn't match it after push_nsh, it is output to
VxLAN-gpe port after push_nsh, I'm not sure if we can match dl_type and NSH
fields if we don't output and don't recirculate.

> 
> BR, Jan

^ permalink raw reply

* RE: [PATCH net-next v9] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-29  7:10 UTC (permalink / raw)
  To: Yang, Yi, Pravin Shelar
  Cc: Jiri Benc, netdev@vger.kernel.org, dev@openvswitch.org, e@erig.me,
	davem@davemloft.net
In-Reply-To: <20170929064058.GA16145@localhost.localdomain>

> From: Yang, Yi [mailto:yi.y.yang@intel.com]
> Sent: Friday, 29 September, 2017 08:41
> To: Pravin Shelar <pshelar@ovn.org>
> Cc: Jiri Benc <jbenc@redhat.com>; netdev@vger.kernel.org; dev@openvswitch.org; e@erig.me; davem@davemloft.net; Jan Scheurich
> <jan.scheurich@ericsson.com>
> Subject: Re: [PATCH net-next v9] openvswitch: enable NSH support
> 
> On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> > On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang@intel.com> wrote:
> > > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> > >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> > >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> > >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> > >> > will be recirculated to flow pipeline, it will be reparsed, so
> > >> > key->eth.type will be set in packet parse function, we needn't handle it
> > >> > in pop_nsh.
> > >>
> > >> This seems to be a very different approach than what we currently have.
> > >> Looking at the code, the requirement after "destructive" actions such
> > >> as pushing or popping headers is to recirculate.
> > >
> > > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > > also cc jan.scheurich@ericsson.com.
> > >
> > > Actucally all the keys before push_nsh are still there after push_nsh,
> > > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> > >
> >
> >
> > We should keep existing model for this patch. Later you can submit
> > optimization patch with specific use cases and performance
> > improvement. So that we can evaluate code complexity and benefits.
> 
> Ok, I'll remove the below line in push_nsh and send out v11, thanks.
> 
> 	key->eth.type = htons(ETH_P_NSH);

The optimization Yi refers to only affects the slow path translation. 

OVS 2.8 does not immediately trigger an immediate recirculation after translating 
encap(nsh,...). There is no need to do so as the flow key of the resulting packet 
can be determined from the encap() action and its properties. Translation 
continues with the rewritten flow key and subsequent OpenFlow actions will 
typically set the new fields in the new NSH header. The push_nsh datapath action 
(including all NSH header fields) is only generated at the next commit, e.g. for 
output, cloning, recirculation, encap/decap or another destructive change of 
the flow key.

The implementation of push_nsh in the user-space datapath does not update
the miniflow (key) of the packet, only the packet data and some metadata. 
If the packet needs to be looked up again the slow path triggers recirculation
to re-parse the packet. There should be no need for the datapath push_nsh 
action to try to update the flow key.

BR, Jan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox