Netdev List
 help / color / mirror / Atom feed
* Re: [net] ixgbevf: Fix secpath usage for IPsec Tx offload
From: David Miller @ 2019-09-13 13:53 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, snelson, jonathan
In-Reply-To: <20190912190734.10560-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 12 Sep 2019 12:07:34 -0700

> Port the same fix for ixgbe to ixgbevf.
> 
> The ixgbevf driver currently does IPsec Tx offloading
> based on an existing secpath. However, the secpath
> can also come from the Rx side, in this case it is
> misinterpreted for Tx offload and the packets are
> dropped with a "bad sa_idx" error. Fix this by using
> the xfrm_offload() function to test for Tx offload.
> 
> CC: Shannon Nelson <snelson@pensando.io>
> Fixes: 7f68d4306701 ("ixgbevf: enable VF IPsec offload operations")
> Reported-by: Jonathan Tooker <jonathan@reliablehosting.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied, and like the ixgbe version of this fix I queued it up for -stable.

Thanks.

^ permalink raw reply

* Re: [PATCH v4 1/2] PTP: introduce new versions of IOCTLs
From: David Miller @ 2019-09-13 13:57 UTC (permalink / raw)
  To: felipe.balbi; +Cc: richardcochran, christopher.s.hall, netdev, linux-kernel
In-Reply-To: <20190911061622.774006-1-felipe.balbi@linux.intel.com>

From: Felipe Balbi <felipe.balbi@linux.intel.com>
Date: Wed, 11 Sep 2019 09:16:21 +0300

> The current version of the IOCTL have a small problem which prevents us
> from extending the API by making use of reserved fields. In these new
> IOCTLs, we are now making sure that flags and rsv fields are zero which
> will allow us to extend the API in the future.
> 
> Reviewed-by: Richard Cochran <richardcochran@gmail.com>
> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH v4 2/2] PTP: add support for one-shot output
From: David Miller @ 2019-09-13 13:57 UTC (permalink / raw)
  To: felipe.balbi; +Cc: richardcochran, christopher.s.hall, netdev, linux-kernel
In-Reply-To: <20190911061622.774006-2-felipe.balbi@linux.intel.com>

From: Felipe Balbi <felipe.balbi@linux.intel.com>
Date: Wed, 11 Sep 2019 09:16:22 +0300

> Some controllers allow for a one-shot output pulse, in contrast to
> periodic output. Now that we have extensible versions of our IOCTLs, we
> can finally make use of the 'flags' field to pass a bit telling driver
> that if we want one-shot pulse output.
> 
> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH] brcmsmac: Use DIV_ROUND_CLOSEST directly to make it readable
From: Kalle Valo @ 2019-09-13 14:20 UTC (permalink / raw)
  To: zhong jiang; +Cc: davem, zhongjiang, linux-wireless, netdev, linux-kernel
In-Reply-To: <1567700648-28162-1-git-send-email-zhongjiang@huawei.com>

zhong jiang <zhongjiang@huawei.com> wrote:

> The kernel.h macro DIV_ROUND_CLOSEST performs the computation (x + d/2)/d
> but is perhaps more readable.
> 
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>

Patch applied to wireless-drivers-next.git, thanks.

3dfb22003f98 brcmsmac: Use DIV_ROUND_CLOSEST directly to make it readable

-- 
https://patchwork.kernel.org/patch/11133663/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


^ permalink raw reply

* Re: [PATCH] libertas: use mesh_wdev->ssid instead of priv->mesh_ssid
From: Kalle Valo @ 2019-09-13 14:24 UTC (permalink / raw)
  To: Lubomir Rintel
  Cc: libertas-dev, linux-wireless, netdev, linux-kernel,
	Lubomir Rintel
In-Reply-To: <20190907151855.2637984-1-lkundrak@v3.sk>

Lubomir Rintel <lkundrak@v3.sk> wrote:

> With the commit e86dc1ca4676 ("Libertas: cfg80211 support") we've lost
> the ability to actually set the Mesh SSID from userspace.
> NL80211_CMD_SET_INTERFACE with NL80211_ATTR_MESH_ID sets the mesh point
> interface's ssid field. Let's use that one for the Libertas Mesh
> operation
> 
> Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>

Patch applied to wireless-drivers-next.git, thanks.

2199c9817670 libertas: use mesh_wdev->ssid instead of priv->mesh_ssid

-- 
https://patchwork.kernel.org/patch/11136509/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


^ permalink raw reply

* Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO
From: Thomas Higdon @ 2019-09-13 14:29 UTC (permalink / raw)
  To: Dave Taht
  Cc: Neal Cardwell, netdev@vger.kernel.org, Jonathan Lemon, Dave Jones,
	Eric Dumazet, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <CAA93jw7q71mpenRMD0dWiVNap1WKD6O4+GCBagcPa5OhHTMErw@mail.gmail.com>

On Thu, Sep 12, 2019 at 10:14:33AM +0100, Dave Taht wrote:
> On Thu, Sep 12, 2019 at 1:59 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon <tph@fb.com> wrote:
> > >
> > > Neal Cardwell mentioned that rcv_wnd would be useful for helping
> > > diagnose whether a flow is receive-window-limited at a given instant.
> > >
> > > This serves the purpose of adding an additional __u32 to avoid the
> > > would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
> > >
> > > Signed-off-by: Thomas Higdon <tph@fb.com>
> > > ---
> >
> > Thanks, Thomas.
> >
> > I know that when I mentioned this before I mentioned the idea of both
> > tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side
> > receive window) in tcp_info, and did not express a preference between
> > the two. Now that we are faced with a decision between the two,
> > personally I think it would be a little more useful to start with
> > tp->snd_wnd. :-)
> >
> > Two main reasons:
> >
> > (1) Usually when we're diagnosing TCP performance problems, we do so
> > from the sender, since the sender makes most of the
> > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > From the sender-side the thing that would be most useful is to see
> > tp->snd_wnd, the receive window that the receiver has advertised to
> > the sender.
> 
> I am under the impression, that particularly in the mobile space, that
> network behavior
> is often governed by rcv_wnd. At least, there's been so many papers on
> this that I'd
> tended to assume so.
> 
> Given a desire to do both vars, is there a *third* u32 we could add to
> fill in the next hole? :)
> ecn marks?

Neal makes some good points -- there is a fair amount of existing
information for deriving receive window. It seems like snd_wnd would be
more valuable at this moment. For the purpose of pairing up these __u32s
to get something we can commit, I propose that we go with
the rcv_ooopack/snd_wnd pair for now, and when something comes up later,
one might consider pairing up rcv_wnd.

^ permalink raw reply

* Re: [PATCH 2/2] dt-bindings: net: dwmac: document 'mac-mode' property
From: Rob Herring @ 2019-09-13 14:36 UTC (permalink / raw)
  To: Alexandru Ardelean
  Cc: netdev, devicetree, linux-kernel, davem, peppe.cavallaro,
	alexandre.torgue, --cc=andrew
In-Reply-To: <20190906130256.10321-2-alexandru.ardelean@analog.com>

On Fri, Sep 06, 2019 at 04:02:56PM +0300, Alexandru Ardelean wrote:
> This change documents the 'mac-mode' property that was introduced in the
> 'stmmac' driver to support passive mode converters that can sit in-between
> the MAC & PHY.
> 
> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
> ---
>  Documentation/devicetree/bindings/net/snps,dwmac.yaml | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
> index c78be15704b9..ebe4537a7cce 100644
> --- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml
> +++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
> @@ -112,6 +112,14 @@ properties:
>    reset-names:
>      const: stmmaceth
>  
> +  mac-mode:
> +    maxItems: 1

Is this an array because {min,max}Items is for arrays? It should be 
defined as a string with possible values.

As this property is the same as another, you can do this:

$ref: ethernet-controller.yaml#/properties/phy-connection-type

Unless only a small subset of those values are valid here, then you may 
want to list them here.

> +    description:
> +      The property is identical to 'phy-mode', and assumes that there is mode
> +      converter in-between the MAC & PHY (e.g. GMII-to-RGMII). This converter
> +      can be passive (no SW requirement), and requires that the MAC operate
> +      in a different mode than the PHY in order to function.
> +
>    snps,axi-config:
>      $ref: /schemas/types.yaml#definitions/phandle
>      description:
> -- 
> 2.20.1
> 


^ permalink raw reply

* Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO
From: Neal Cardwell @ 2019-09-13 14:37 UTC (permalink / raw)
  To: Thomas Higdon
  Cc: Dave Taht, netdev@vger.kernel.org, Jonathan Lemon, Dave Jones,
	Eric Dumazet, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20190913142936.GA84687@tph-mbp>

On Fri, Sep 13, 2019 at 10:29 AM Thomas Higdon <tph@fb.com> wrote:
>
> On Thu, Sep 12, 2019 at 10:14:33AM +0100, Dave Taht wrote:
> > On Thu, Sep 12, 2019 at 1:59 AM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon <tph@fb.com> wrote:
> > > >
> > > > Neal Cardwell mentioned that rcv_wnd would be useful for helping
> > > > diagnose whether a flow is receive-window-limited at a given instant.
> > > >
> > > > This serves the purpose of adding an additional __u32 to avoid the
> > > > would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
> > > >
> > > > Signed-off-by: Thomas Higdon <tph@fb.com>
> > > > ---
> > >
> > > Thanks, Thomas.
> > >
> > > I know that when I mentioned this before I mentioned the idea of both
> > > tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side
> > > receive window) in tcp_info, and did not express a preference between
> > > the two. Now that we are faced with a decision between the two,
> > > personally I think it would be a little more useful to start with
> > > tp->snd_wnd. :-)
> > >
> > > Two main reasons:
> > >
> > > (1) Usually when we're diagnosing TCP performance problems, we do so
> > > from the sender, since the sender makes most of the
> > > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > > From the sender-side the thing that would be most useful is to see
> > > tp->snd_wnd, the receive window that the receiver has advertised to
> > > the sender.
> >
> > I am under the impression, that particularly in the mobile space, that
> > network behavior
> > is often governed by rcv_wnd. At least, there's been so many papers on
> > this that I'd
> > tended to assume so.
> >
> > Given a desire to do both vars, is there a *third* u32 we could add to
> > fill in the next hole? :)
> > ecn marks?
>
> Neal makes some good points -- there is a fair amount of existing
> information for deriving receive window. It seems like snd_wnd would be
> more valuable at this moment. For the purpose of pairing up these __u32s
> to get something we can commit, I propose that we go with
> the rcv_ooopack/snd_wnd pair for now, and when something comes up later,
> one might consider pairing up rcv_wnd.

FWIW that sounds like a great plan to me. Thanks, Thomas!

neal

^ permalink raw reply

* net: phy: micrel KSZ9031 ifdown ifup issue
From: Paul Thomas @ 2019-09-13 14:42 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S. Miller,
	netdev

Hello,

I think I'm seeing an issue with the PHY hardware or PHY driver. What
happens is sometimes (but not always) when I do 'ip link set eth0
down' followed by 'ip link set eth0 up' I don't ever see an
auto-negotiation again. LEDs don't come on, ethtool reports 'Link
detected: no'. Even physically unplugging and plugging the network
cable doesn't bring it back. I have to do a reboot to get the
networking back.

When the networking is started I don't see any issue forcing
negotiations by unplugging and plugging the cable. I get standard
messages like this all day long:
[   21.031793] 003: macb ff0b0000.ethernet eth0: link down
[   26.142835] 003: macb ff0b0000.ethernet eth0: link up (1000/Full)

One thing that makes me think this is the PHY is that we have another
Ethernet port using the DP83867 PHY and I can always do ifdown/ifup
with it.

This is using a 5.2.10 kernel on arm64 zynqmp platform with the macb driver.

Is this something anyone else has seen? I know there is some Errata
with this part, but I'm hoping there is something to fix or work
around this. Any thoughts on where to look or add debugging would
appreciated.

thanks,
Paul

^ permalink raw reply

* Re: [PATCH net] udp: correct reuseport selection with connected sockets
From: Paolo Abeni @ 2019-09-13 14:47 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, edumazet, kraig, zabele, mark.keaton, Willem de Bruijn
In-Reply-To: <20190913011639.55895-1-willemdebruijn.kernel@gmail.com>

On Thu, 2019-09-12 at 21:16 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> UDP reuseport groups can hold a mix unconnected and connected sockets.
> Ensure that connections only receive all traffic to their 4-tuple.
> 
> Fast reuseport returns on the first reuseport match on the assumption
> that all matches are equal. Only if connections are present, return to
> the previous behavior of scoring all sockets.
> 
> Record if connections are present and if so (1) treat such connected
> sockets as an independent match from the group, (2) only return
> 2-tuple matches from reuseport and (3) do not return on the first
> 2-tuple reuseport match to allow for a higher scoring match later.
> 
> New field has_conns is set without locks. No other fields in the
> bitmap are modified at runtime and the field is only ever set
> unconditionally, so an RMW cannot miss a change.
> 
> Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
> Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> I was unable to compile some older kernels, so the Fixes tag is based
> on basic analysis, not bisected to by the regression test.
> ---
>  include/net/sock_reuseport.h | 20 +++++++++++++++++++-
>  net/core/sock_reuseport.c    | 15 +++++++++++++--
>  net/ipv4/datagram.c          |  2 ++
>  net/ipv4/udp.c               |  5 +++--
>  net/ipv6/datagram.c          |  2 ++
>  net/ipv6/udp.c               |  5 +++--
>  6 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> index d9112de85261..43f4a818d88f 100644
> --- a/include/net/sock_reuseport.h
> +++ b/include/net/sock_reuseport.h
> @@ -21,7 +21,8 @@ struct sock_reuseport {
>  	unsigned int		synq_overflow_ts;
>  	/* ID stays the same even after the size of socks[] grows. */
>  	unsigned int		reuseport_id;
> -	bool			bind_inany;
> +	unsigned int		bind_inany:1;
> +	unsigned int		has_conns:1;
>  	struct bpf_prog __rcu	*prog;		/* optional BPF sock selector */
>  	struct sock		*socks[0];	/* array of sock pointers */
>  };
> @@ -37,6 +38,23 @@ extern struct sock *reuseport_select_sock(struct sock *sk,
>  extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
>  extern int reuseport_detach_prog(struct sock *sk);
>  
> +static inline bool reuseport_has_conns(struct sock *sk, bool set)
> +{
> +	struct sock_reuseport *reuse;
> +	bool ret = false;
> +
> +	rcu_read_lock();
> +	reuse = rcu_dereference(sk->sk_reuseport_cb);
> +	if (reuse) {
> +		if (set)
> +			reuse->has_conns = 1;
> +		ret = reuse->has_conns;
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  int reuseport_get_id(struct sock_reuseport *reuse);
>  
>  #endif  /* _SOCK_REUSEPORT_H */
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index 9408f9264d05..f3ceec93f392 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -295,8 +295,19 @@ struct sock *reuseport_select_sock(struct sock *sk,
>  
>  select_by_hash:
>  		/* no bpf or invalid bpf result: fall back to hash usage */
> -		if (!sk2)
> -			sk2 = reuse->socks[reciprocal_scale(hash, socks)];
> +		if (!sk2) {
> +			int i, j;
> +
> +			i = j = reciprocal_scale(hash, socks);
> +			while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) {
> +				i++;
> +				if (i >= reuse->num_socks)
> +					i = 0;
> +				if (i == j)
> +					goto out;
> +			}
> +			sk2 = reuse->socks[i];
> +		}
>  	}
>  
>  out:
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index 7bd29e694603..9a0fe0c2fa02 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -15,6 +15,7 @@
>  #include <net/sock.h>
>  #include <net/route.h>
>  #include <net/tcp_states.h>
> +#include <net/sock_reuseport.h>
>  
>  int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  {
> @@ -69,6 +70,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
>  	}
>  	inet->inet_daddr = fl4->daddr;
>  	inet->inet_dport = usin->sin_port;
> +	reuseport_has_conns(sk, true);
>  	sk->sk_state = TCP_ESTABLISHED;
>  	sk_set_txhash(sk);
>  	inet->inet_id = jiffies;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index d88821c794fb..16486c8b708b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -423,12 +423,13 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  		score = compute_score(sk, net, saddr, sport,
>  				      daddr, hnum, dif, sdif);
>  		if (score > badness) {
> -			if (sk->sk_reuseport) {
> +			if (sk->sk_reuseport &&
> +			    sk->sk_state != TCP_ESTABLISHED) {
>  				hash = udp_ehashfn(net, daddr, hnum,
>  						   saddr, sport);
>  				result = reuseport_select_sock(sk, hash, skb,
>  							sizeof(struct udphdr));
> -				if (result)
> +				if (result && !reuseport_has_conns(sk, false))
>  					return result;
>  			}
>  			badness = score;
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index 9ab897ded4df..96f939248d2f 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -27,6 +27,7 @@
>  #include <net/ip6_route.h>
>  #include <net/tcp_states.h>
>  #include <net/dsfield.h>
> +#include <net/sock_reuseport.h>
>  
>  #include <linux/errqueue.h>
>  #include <linux/uaccess.h>
> @@ -254,6 +255,7 @@ int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
>  		goto out;
>  	}
>  
> +	reuseport_has_conns(sk, true);
>  	sk->sk_state = TCP_ESTABLISHED;
>  	sk_set_txhash(sk);
>  out:
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 827fe7385078..5995fdc99d3f 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -158,13 +158,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
>  		score = compute_score(sk, net, saddr, sport,
>  				      daddr, hnum, dif, sdif);
>  		if (score > badness) {
> -			if (sk->sk_reuseport) {
> +			if (sk->sk_reuseport &&
> +			    sk->sk_state != TCP_ESTABLISHED) {
>  				hash = udp6_ehashfn(net, daddr, hnum,
>  						    saddr, sport);
>  
>  				result = reuseport_select_sock(sk, hash, skb,
>  							sizeof(struct udphdr));
> -				if (result)
> +				if (result && !reuseport_has_conns(sk, false))
>  					return result;
>  			}
>  			result = sk;

The patch LGTM,

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-09-13 14:50 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Michal Kubecek, netdev, David Miller, Jakub Kicinski, David Ahern,
	Stephen Hemminger, dcbw, Andrew Lunn, parav, Saeed Mahameed,
	mlxsw
In-Reply-To: <20190912115942.GC7621@nanopsycho.orion>

Thu, Sep 12, 2019 at 01:59:42PM CEST, jiri@resnulli.us wrote:
>Fri, Aug 30, 2019 at 07:03:42PM CEST, jiri@resnulli.us wrote:
>>Fri, Aug 30, 2019 at 04:35:23PM CEST, roopa@cumulusnetworks.com wrote:
>
>[...]
>
>>>
>>>so to summarize, i think we have discussed the following options to
>>>update a netlink list attribute so far:
>>>(a) encode an optional attribute/flag in the list attribute in
>>>RTM_SETLINK to indicate if it is a add or del
>>>(b) Use a flag in RTM_SETLINK and RTM_DELINK to indicate add/del
>>>(close to bridge vlan add/del)
>>
>>Nope, bridge vlan add/del is done according to the cmd, not any flag.
>>
>>
>>>(c) introduce a separate generic msg type to add/del to a list
>>>attribute (IIUC this does need a separate msg type per subsystem or
>>>netlink API)
>
>Getting back to this, sorry.
>
>Thinking about it for some time, a,b,c have all their issues. Why can't
>we have another separate cmd as I originally proposed in this RFC? Does
>anyone have any argument against it? Could you please describe?
>
>Because otherwise, I don't feel comfortable going to any of a,b,c :(

How about this:

We have new commands, but we have them for lists of many types. In my
examples, I'm using "altname" and "color". This is very similar to
bridge vlan example Roopa pointed out. It scales and does not pollute
existing setlink/getlink messages. Also, the cmdline is aligned:


$ ip link property add eth0 altname someverylongname altname someotherlongname

$ ip link property add eth0 altname someotherveryverylongname color blue

$ ip link property del eth0 altname someverylongname color blue

$ ip link property add eth0 color red

$ ip link property show eth0
2: eth0: altname someotherlongname altname someotherveryverylongname color red

$ ip -j -p link property show eth0
[ {
        "ifindex": 2,
        "ifname": "eth0",
        "property": [ "altname": "someotherlongname",
	              "altname": "someotherveryverylongname",
	              "color": "red"]
    } ]

I call this "property" but I'm open to any other naming.
 





^ permalink raw reply

* Re: [PATCH 1/3] rtlwifi: rtl8192ce: replace _rtl92c_evm_db_to_percentage with generic version
From: Kalle Valo @ 2019-09-13 15:07 UTC (permalink / raw)
  To: Michael Straube
  Cc: pkshih, davem, linux-wireless, netdev, linux-kernel,
	Michael Straube
In-Reply-To: <20190910190422.63378-2-straube.linux@gmail.com>

Michael Straube <straube.linux@gmail.com> wrote:

> Function _rtl92c_evm_db_to_percentage is functionally identical
> to the generic version rtl_evm_db_to_percentage, so remove
> _rtl92c_evm_db_to_percentage and use the generic version instead.
> 
> Signed-off-by: Michael Straube <straube.linux@gmail.com>

3 patches applied to wireless-drivers-next.git, thanks.

1335ad27bd07 rtlwifi: rtl8192ce: replace _rtl92c_evm_db_to_percentage with generic version
622c19ed3607 rtlwifi: rtl8192cu: replace _rtl92c_evm_db_to_percentage with generic version
3a1f85798e9f rtlwifi: rtl8192de: replace _rtl92d_evm_db_to_percentage with generic version

-- 
https://patchwork.kernel.org/patch/11140005/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


^ permalink raw reply

* Re: [PATCH net-next] MAINTAINERS: xen-netback: update my email address
From: Wei Liu @ 2019-09-13 15:28 UTC (permalink / raw)
  To: Paul Durrant; +Cc: netdev, xen-devel, Wei Liu
In-Reply-To: <20190913124727.3277-1-paul.durrant@citrix.com>

On Fri, 13 Sep 2019 at 13:47, Paul Durrant <paul.durrant@citrix.com> wrote:
>
> My Citrix email address will expire shortly.
>
> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>

Acked-by: Wei Liu <wl@xen.org>

^ permalink raw reply

* [PATCH net-next 0/3] More fixes for unlocked cls hardware offload API refactoring
From: Vlad Buslov @ 2019-09-13 15:28 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, Vlad Buslov

Two fixes for my "Refactor cls hardware offload API to support
rtnl-independent drivers" series and refactoring patch that implements
infrastructure necessary for the fixes.

Vlad Buslov (3):
  net: sched: extend flow_action_entry with destructor
  net: sched: take reference to psample group in flow_action infra
  net: sched: use get_dev() action API in flow_action infra

 include/net/act_api.h          |  9 +++-
 include/net/flow_offload.h     |  6 ++-
 include/net/psample.h          |  1 +
 include/net/tc_act/tc_sample.h |  6 ---
 net/psample/psample.c          | 20 +++++---
 net/sched/act_mirred.c         | 21 +++++----
 net/sched/act_sample.c         | 27 +++++++++++
 net/sched/cls_api.c            | 83 ++++++++++++++++++++--------------
 8 files changed, 116 insertions(+), 57 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [PATCH net-next 1/3] net: sched: extend flow_action_entry with destructor
From: Vlad Buslov @ 2019-09-13 15:28 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, Vlad Buslov, Jiri Pirko
In-Reply-To: <20190913152841.15755-1-vladbu@mellanox.com>

Generalize flow_action_entry cleanup by extending the structure with
pointer to destructor function. Set the destructor in
tc_setup_flow_action(). Refactor tc_cleanup_flow_action() to call
entry->destructor() instead of using switch that dispatches by entry->id
and manually executes cleanup.

This refactoring is necessary for following patches in this series that
require destructor to use tc_action->ops callbacks that can't be easily
obtained in tc_cleanup_flow_action().

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/flow_offload.h |  6 ++-
 net/sched/cls_api.c        | 77 ++++++++++++++++++++++----------------
 2 files changed, 50 insertions(+), 33 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index fc881875f856..86c567f531f3 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -154,8 +154,12 @@ enum flow_action_mangle_base {
 	FLOW_ACT_MANGLE_HDR_TYPE_UDP,
 };
 
+typedef void (*action_destr)(void *priv);
+
 struct flow_action_entry {
 	enum flow_action_id		id;
+	action_destr			destructor;
+	void				*destructor_priv;
 	union {
 		u32			chain_index;	/* FLOW_ACTION_GOTO */
 		struct net_device	*dev;		/* FLOW_ACTION_REDIRECT */
@@ -170,7 +174,7 @@ struct flow_action_entry {
 			u32		mask;
 			u32		val;
 		} mangle;
-		const struct ip_tunnel_info *tunnel;	/* FLOW_ACTION_TUNNEL_ENCAP */
+		struct ip_tunnel_info	*tunnel;	/* FLOW_ACTION_TUNNEL_ENCAP */
 		u32			csum_flags;	/* FLOW_ACTION_CSUM */
 		u32			mark;		/* FLOW_ACTION_MARK */
 		u16                     ptype;          /* FLOW_ACTION_PTYPE */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 05c4fe1c3ca2..c668195379bd 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3282,25 +3282,48 @@ void tc_cleanup_flow_action(struct flow_action *flow_action)
 	struct flow_action_entry *entry;
 	int i;
 
-	flow_action_for_each(i, entry, flow_action) {
-		switch (entry->id) {
-		case FLOW_ACTION_REDIRECT:
-		case FLOW_ACTION_MIRRED:
-		case FLOW_ACTION_REDIRECT_INGRESS:
-		case FLOW_ACTION_MIRRED_INGRESS:
-			if (entry->dev)
-				dev_put(entry->dev);
-			break;
-		case FLOW_ACTION_TUNNEL_ENCAP:
-			kfree(entry->tunnel);
-			break;
-		default:
-			break;
-		}
-	}
+	flow_action_for_each(i, entry, flow_action)
+		if (entry->destructor)
+			entry->destructor(entry->destructor_priv);
 }
 EXPORT_SYMBOL(tc_cleanup_flow_action);
 
+static void tcf_mirred_put_dev(void *priv)
+{
+	struct net_device *dev = priv;
+
+	dev_put(dev);
+}
+
+static void tcf_mirred_get_dev(struct flow_action_entry *entry,
+			       const struct tc_action *act)
+{
+	entry->dev = tcf_mirred_dev(act);
+	if (!entry->dev)
+		return;
+	dev_hold(entry->dev);
+	entry->destructor = tcf_mirred_put_dev;
+	entry->destructor_priv = entry->dev;
+}
+
+static void tcf_tunnel_encap_put_tunnel(void *priv)
+{
+	struct ip_tunnel_info *tunnel = priv;
+
+	kfree(tunnel);
+}
+
+static int tcf_tunnel_encap_get_tunnel(struct flow_action_entry *entry,
+				       const struct tc_action *act)
+{
+	entry->tunnel = tcf_tunnel_info_copy(act);
+	if (!entry->tunnel)
+		return -ENOMEM;
+	entry->destructor = tcf_tunnel_encap_put_tunnel;
+	entry->destructor_priv = entry->tunnel;
+	return 0;
+}
+
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts, bool rtnl_held)
 {
@@ -3329,24 +3352,16 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 			entry->chain_index = tcf_gact_goto_chain_index(act);
 		} else if (is_tcf_mirred_egress_redirect(act)) {
 			entry->id = FLOW_ACTION_REDIRECT;
-			entry->dev = tcf_mirred_dev(act);
-			if (entry->dev)
-				dev_hold(entry->dev);
+			tcf_mirred_get_dev(entry, act);
 		} else if (is_tcf_mirred_egress_mirror(act)) {
 			entry->id = FLOW_ACTION_MIRRED;
-			entry->dev = tcf_mirred_dev(act);
-			if (entry->dev)
-				dev_hold(entry->dev);
+			tcf_mirred_get_dev(entry, act);
 		} else if (is_tcf_mirred_ingress_redirect(act)) {
 			entry->id = FLOW_ACTION_REDIRECT_INGRESS;
-			entry->dev = tcf_mirred_dev(act);
-			if (entry->dev)
-				dev_hold(entry->dev);
+			tcf_mirred_get_dev(entry, act);
 		} else if (is_tcf_mirred_ingress_mirror(act)) {
 			entry->id = FLOW_ACTION_MIRRED_INGRESS;
-			entry->dev = tcf_mirred_dev(act);
-			if (entry->dev)
-				dev_hold(entry->dev);
+			tcf_mirred_get_dev(entry, act);
 		} else if (is_tcf_vlan(act)) {
 			switch (tcf_vlan_action(act)) {
 			case TCA_VLAN_ACT_PUSH:
@@ -3370,11 +3385,9 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 			}
 		} else if (is_tcf_tunnel_set(act)) {
 			entry->id = FLOW_ACTION_TUNNEL_ENCAP;
-			entry->tunnel = tcf_tunnel_info_copy(act);
-			if (!entry->tunnel) {
-				err = -ENOMEM;
+			err = tcf_tunnel_encap_get_tunnel(entry, act);
+			if (err)
 				goto err_out;
-			}
 		} else if (is_tcf_tunnel_release(act)) {
 			entry->id = FLOW_ACTION_TUNNEL_DECAP;
 		} else if (is_tcf_pedit(act)) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 3/3] net: sched: use get_dev() action API in flow_action infra
From: Vlad Buslov @ 2019-09-13 15:28 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, Vlad Buslov, Jiri Pirko
In-Reply-To: <20190913152841.15755-1-vladbu@mellanox.com>

When filling in hardware intermediate representation tc_setup_flow_action()
directly obtains, checks and takes reference to dev used by mirred action,
instead of using act->ops->get_dev() API created specifically for this
purpose. In order to remove code duplication, refactor flow_action infra to
use action API when obtaining mirred action target dev. Extend get_dev()
with additional argument that is used to provide dev destructor to the
user.

Fixes: 5a6ff4b13d59 ("net: sched: take reference to action dev before calling offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/act_api.h  |  4 ++--
 net/sched/act_mirred.c | 21 +++++++++++++--------
 net/sched/cls_api.c    | 13 +++----------
 3 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 4be8b0daedf0..b18c699681ca 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -101,8 +101,8 @@ struct tc_action_ops {
 			struct netlink_ext_ack *);
 	void	(*stats_update)(struct tc_action *, u64, u32, u64, bool);
 	size_t  (*get_fill_size)(const struct tc_action *act);
-	struct net_device *(*get_dev)(const struct tc_action *a);
-	void	(*put_dev)(struct net_device *dev);
+	struct net_device *(*get_dev)(const struct tc_action *a,
+				      tc_action_priv_destructor *destructor);
 	struct psample_group *
 	(*get_psample_group)(const struct tc_action *a,
 			     tc_action_priv_destructor *destructor);
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 9d1bf508075a..9ce073a05414 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -408,25 +408,31 @@ static struct notifier_block mirred_device_notifier = {
 	.notifier_call = mirred_device_event,
 };
 
-static struct net_device *tcf_mirred_get_dev(const struct tc_action *a)
+static void tcf_mirred_dev_put(void *priv)
+{
+	struct net_device *dev = priv;
+
+	dev_put(dev);
+}
+
+static struct net_device *
+tcf_mirred_get_dev(const struct tc_action *a,
+		   tc_action_priv_destructor *destructor)
 {
 	struct tcf_mirred *m = to_mirred(a);
 	struct net_device *dev;
 
 	rcu_read_lock();
 	dev = rcu_dereference(m->tcfm_dev);
-	if (dev)
+	if (dev) {
 		dev_hold(dev);
+		*destructor = tcf_mirred_dev_put;
+	}
 	rcu_read_unlock();
 
 	return dev;
 }
 
-static void tcf_mirred_put_dev(struct net_device *dev)
-{
-	dev_put(dev);
-}
-
 static size_t tcf_mirred_get_fill_size(const struct tc_action *act)
 {
 	return nla_total_size(sizeof(struct tc_mirred));
@@ -446,7 +452,6 @@ static struct tc_action_ops act_mirred_ops = {
 	.get_fill_size	=	tcf_mirred_get_fill_size,
 	.size		=	sizeof(struct tcf_mirred),
 	.get_dev	=	tcf_mirred_get_dev,
-	.put_dev	=	tcf_mirred_put_dev,
 };
 
 static __net_init int mirred_init_net(struct net *net)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 60d44b14750a..32577c248968 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3288,22 +3288,15 @@ void tc_cleanup_flow_action(struct flow_action *flow_action)
 }
 EXPORT_SYMBOL(tc_cleanup_flow_action);
 
-static void tcf_mirred_put_dev(void *priv)
-{
-	struct net_device *dev = priv;
-
-	dev_put(dev);
-}
-
 static void tcf_mirred_get_dev(struct flow_action_entry *entry,
 			       const struct tc_action *act)
 {
-	entry->dev = tcf_mirred_dev(act);
+#ifdef CONFIG_NET_CLS_ACT
+	entry->dev = act->ops->get_dev(act, &entry->destructor);
 	if (!entry->dev)
 		return;
-	dev_hold(entry->dev);
-	entry->destructor = tcf_mirred_put_dev;
 	entry->destructor_priv = entry->dev;
+#endif
 }
 
 static void tcf_tunnel_encap_put_tunnel(void *priv)
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 2/3] net: sched: take reference to psample group in flow_action infra
From: Vlad Buslov @ 2019-09-13 15:28 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, Vlad Buslov, Jiri Pirko
In-Reply-To: <20190913152841.15755-1-vladbu@mellanox.com>

With recent patch set that removed rtnl lock dependency from cls hardware
offload API rtnl lock is only taken when reading action data and can be
released after action-specific data is parsed into intermediate
representation. However, sample action psample group is passed by pointer
without obtaining reference to it first, which makes it possible to
concurrently overwrite the action and deallocate object pointed by
psample_group pointer after rtnl lock is released but before driver
finished using the pointer.

To prevent such race condition, obtain reference to psample group while it
is used by flow_action infra. Extend psample API with function
psample_group_take() that increments psample group reference counter.
Extend struct tc_action_ops with new get_psample_group() API. Implement the
API for action sample using psample_group_take() and already existing
psample_group_put() as a destructor. Use it in tc_setup_flow_action() to
take reference to psample group pointed to by entry->sample.psample_group
and release it in tc_cleanup_flow_action().

Disable bh when taking psample_groups_lock. The lock is now taken while
holding action tcf_lock that is used by data path and requires bh to be
disabled, so doing the same for psample_groups_lock is necessary to
preserve SOFTIRQ-irq-safety.

Fixes: 918190f50eb6 ("net: sched: flower: don't take rtnl lock for cls hw offloads API")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/act_api.h          |  5 +++++
 include/net/psample.h          |  1 +
 include/net/tc_act/tc_sample.h |  6 ------
 net/psample/psample.c          | 20 ++++++++++++++------
 net/sched/act_sample.c         | 27 +++++++++++++++++++++++++++
 net/sched/cls_api.c            | 13 +++++++++++--
 6 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 3a1a72990fce..4be8b0daedf0 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -78,6 +78,8 @@ static inline void tcf_tm_dump(struct tcf_t *dtm, const struct tcf_t *stm)
 #define ACT_P_CREATED 1
 #define ACT_P_DELETED 1
 
+typedef void (*tc_action_priv_destructor)(void *priv);
+
 struct tc_action_ops {
 	struct list_head head;
 	char    kind[IFNAMSIZ];
@@ -101,6 +103,9 @@ struct tc_action_ops {
 	size_t  (*get_fill_size)(const struct tc_action *act);
 	struct net_device *(*get_dev)(const struct tc_action *a);
 	void	(*put_dev)(struct net_device *dev);
+	struct psample_group *
+	(*get_psample_group)(const struct tc_action *a,
+			     tc_action_priv_destructor *destructor);
 };
 
 struct tc_action_net {
diff --git a/include/net/psample.h b/include/net/psample.h
index 6b578ce69cd8..68ae16bb0a4a 100644
--- a/include/net/psample.h
+++ b/include/net/psample.h
@@ -15,6 +15,7 @@ struct psample_group {
 };
 
 struct psample_group *psample_group_get(struct net *net, u32 group_num);
+void psample_group_take(struct psample_group *group);
 void psample_group_put(struct psample_group *group);
 
 #if IS_ENABLED(CONFIG_PSAMPLE)
diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index b4fce0fae645..b5d76305e854 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -41,10 +41,4 @@ static inline int tcf_sample_trunc_size(const struct tc_action *a)
 	return to_sample(a)->trunc_size;
 }
 
-static inline struct psample_group *
-tcf_sample_psample_group(const struct tc_action *a)
-{
-	return rcu_dereference_rtnl(to_sample(a)->psample_group);
-}
-
 #endif /* __NET_TC_SAMPLE_H */
diff --git a/net/psample/psample.c b/net/psample/psample.c
index 66e4b61a350d..a6ceb0533b5b 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -73,7 +73,7 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
 	int idx = 0;
 	int err;
 
-	spin_lock(&psample_groups_lock);
+	spin_lock_bh(&psample_groups_lock);
 	list_for_each_entry(group, &psample_groups_list, list) {
 		if (!net_eq(group->net, sock_net(msg->sk)))
 			continue;
@@ -89,7 +89,7 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
 		idx++;
 	}
 
-	spin_unlock(&psample_groups_lock);
+	spin_unlock_bh(&psample_groups_lock);
 	cb->args[0] = idx;
 	return msg->len;
 }
@@ -172,7 +172,7 @@ struct psample_group *psample_group_get(struct net *net, u32 group_num)
 {
 	struct psample_group *group;
 
-	spin_lock(&psample_groups_lock);
+	spin_lock_bh(&psample_groups_lock);
 
 	group = psample_group_lookup(net, group_num);
 	if (!group) {
@@ -183,19 +183,27 @@ struct psample_group *psample_group_get(struct net *net, u32 group_num)
 	group->refcount++;
 
 out:
-	spin_unlock(&psample_groups_lock);
+	spin_unlock_bh(&psample_groups_lock);
 	return group;
 }
 EXPORT_SYMBOL_GPL(psample_group_get);
 
+void psample_group_take(struct psample_group *group)
+{
+	spin_lock_bh(&psample_groups_lock);
+	group->refcount++;
+	spin_unlock_bh(&psample_groups_lock);
+}
+EXPORT_SYMBOL_GPL(psample_group_take);
+
 void psample_group_put(struct psample_group *group)
 {
-	spin_lock(&psample_groups_lock);
+	spin_lock_bh(&psample_groups_lock);
 
 	if (--group->refcount == 0)
 		psample_group_destroy(group);
 
-	spin_unlock(&psample_groups_lock);
+	spin_unlock_bh(&psample_groups_lock);
 }
 EXPORT_SYMBOL_GPL(psample_group_put);
 
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 10229124a992..692c4c9040fd 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -252,6 +252,32 @@ static int tcf_sample_search(struct net *net, struct tc_action **a, u32 index)
 	return tcf_idr_search(tn, a, index);
 }
 
+static void tcf_psample_group_put(void *priv)
+{
+	struct psample_group *group = priv;
+
+	psample_group_put(group);
+}
+
+static struct psample_group *
+tcf_sample_get_group(const struct tc_action *a,
+		     tc_action_priv_destructor *destructor)
+{
+	struct tcf_sample *s = to_sample(a);
+	struct psample_group *group;
+
+	spin_lock_bh(&s->tcf_lock);
+	group = rcu_dereference_protected(s->psample_group,
+					  lockdep_is_held(&s->tcf_lock));
+	if (group) {
+		psample_group_take(group);
+		*destructor = tcf_psample_group_put;
+	}
+	spin_unlock_bh(&s->tcf_lock);
+
+	return group;
+}
+
 static struct tc_action_ops act_sample_ops = {
 	.kind	  = "sample",
 	.id	  = TCA_ID_SAMPLE,
@@ -262,6 +288,7 @@ static struct tc_action_ops act_sample_ops = {
 	.cleanup  = tcf_sample_cleanup,
 	.walk	  = tcf_sample_walker,
 	.lookup	  = tcf_sample_search,
+	.get_psample_group = tcf_sample_get_group,
 	.size	  = sizeof(struct tcf_sample),
 };
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index c668195379bd..60d44b14750a 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3324,6 +3324,16 @@ static int tcf_tunnel_encap_get_tunnel(struct flow_action_entry *entry,
 	return 0;
 }
 
+static void tcf_sample_get_group(struct flow_action_entry *entry,
+				 const struct tc_action *act)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	entry->sample.psample_group =
+		act->ops->get_psample_group(act, &entry->destructor);
+	entry->destructor_priv = entry->sample.psample_group;
+#endif
+}
+
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts, bool rtnl_held)
 {
@@ -3417,11 +3427,10 @@ int tc_setup_flow_action(struct flow_action *flow_action,
 			entry->mark = tcf_skbedit_mark(act);
 		} else if (is_tcf_sample(act)) {
 			entry->id = FLOW_ACTION_SAMPLE;
-			entry->sample.psample_group =
-				tcf_sample_psample_group(act);
 			entry->sample.trunc_size = tcf_sample_trunc_size(act);
 			entry->sample.truncate = tcf_sample_truncate(act);
 			entry->sample.rate = tcf_sample_rate(act);
+			tcf_sample_get_group(entry, act);
 		} else if (is_tcf_police(act)) {
 			entry->id = FLOW_ACTION_POLICE;
 			entry->police.burst = tcf_police_tcfp_burst(act);
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next] MAINTAINERS: xen-netback: update my email address
From: Wei Liu @ 2019-09-13 15:29 UTC (permalink / raw)
  To: Wei Liu; +Cc: Paul Durrant, netdev, xen-devel
In-Reply-To: <CAHd7Wqw6bQbzR2gvzGM+bBgVQ8HHQPCBJppSWWqHT7S7Dp27qg@mail.gmail.com>

On Fri, 13 Sep 2019 at 16:28, Wei Liu <wei.liu@kernel.org> wrote:
>
> On Fri, 13 Sep 2019 at 13:47, Paul Durrant <paul.durrant@citrix.com> wrote:
> >
> > My Citrix email address will expire shortly.
> >
> > Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
>
> Acked-by: Wei Liu <wl@xen.org>

Or rather:

Acked-by: Wei Liu <wei.liu@kernel.org>

^ permalink raw reply

* SFP support with RGMII MAC via RGMII to SERDES/SGMII PHY?
From: George McCollister @ 2019-09-13 16:44 UTC (permalink / raw)
  To: netdev

Every example of phylink SFP support I've seen is using an Ethernet
MAC with native SGMII.
Can phylink facilitate support of Fiber and Copper SFP modules
connected to an RGMII MAC if all of the following are true?

1) The MAC is connected via RGMII to a transceiver/PHY (such as
Marvell 88E1512) which then connects to the SFP via SERDER/SGMII. If
you want to see a block diagram it's the first one here:
https://www.marvell.com/transceivers/assets/Alaska_88E1512-001_product_brief.pdf

2) The 1G Ethernet driver has been converted to use phylink.

3) An I2C controller on the SoC is connected to the SFP cage.

4) TxFault, LOS and MOD-DEF0 are connected to GPIO on the SoC.

5) MDIO is connected to the intermediate PHY.

Any thoughts on what might be missing to support this (if anything)
would be appreciated.

Cheers,
George McCollister

^ permalink raw reply

* [PATCH] dt-bindings: net: Correct the documentation of KSZ9021 skew values
From: James Byrne @ 2019-09-13 16:46 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland
  Cc: David S . Miller, netdev, devicetree, James Byrne

The documentation of skew values for the KSZ9021 PHY was misleading
because the driver implementation followed the erroneous information
given in the original KSZ9021 datasheet before it was corrected in
revision 1.2 (Feb 2014). It is probably too late to correct the driver
now because of the many existing device trees, so instead this just
corrects the documentation to explain that what you actually get is not
what you might think when looking at the device tree.

Signed-off-by: James Byrne <james.byrne@origamienergy.com>
---
 .../bindings/net/micrel-ksz90x1.txt           | 32 +++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
index 5100358177c9..b921731cd970 100644
--- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
+++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
@@ -12,8 +12,36 @@ and therefore may overwrite them.
 KSZ9021:
 
   All skew control options are specified in picoseconds. The minimum
-  value is 0, the maximum value is 3000, and it is incremented by 200ps
-  steps.
+  value is 0, the maximum value is 3000, and it can be specified in 200ps
+  steps, *but* these values are in not fact what you get because this chip's
+  skew values actually increase in 120ps steps, starting from -840ps. The
+  incorrect values came from an error in the original KSZ9021 datasheet
+  before it was corrected in revision 1.2 (Feb 2014), but it is too late to
+  change the driver now because of the many existing device trees that have
+  been created using values that go up in increments of 200.
+
+  The following table shows the actual skew delay you will get for each of the
+  possible devicetree values, and the number that will be programmed into the
+  corresponding pad skew register:
+
+  Device Tree Value	Delay	Pad Skew Register Value
+  -----------------------------------------------------
+	0   		-840ps		0000
+	200 		-720ps		0001
+	400 		-600ps		0010
+	600 		-480ps		0011
+	800 		-360ps		0100
+	1000		-240ps		0101
+	1200		-120ps		0110
+	1400		   0ps		0111
+	1600		 120ps		1000
+	1800		 240ps		1001
+	2000		 360ps		1010
+	2200		 480ps		1011
+	2400		 600ps		1100
+	2600		 720ps		1101
+	2800		 840ps		1110
+	3000		 960ps		1111
 
   Optional properties:
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Eric W. Biederman @ 2019-09-13 16:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Al Viro, Carlos Antonio Neira Bustos, netdev@vger.kernel.org,
	brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <7b0a325e-9187-702f-eba7-bfcc7e3f7eb4@fb.com>

Yonghong Song <yhs@fb.com> writes:

> On 9/11/19 9:16 AM, Eric W. Biederman wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
>>>>
>>>> Carlos,
>>>>
>>>> Discussed with Eric today for what is the best way to get
>>>> the device number for a namespace. The following patch seems
>>>> a reasonable start although Eric would like to see
>>>> how the helper is used in order to decide whether the
>>>> interface looks right.
>>>>
>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
>>>> Author: Yonghong Song <yhs@fb.com>
>>>> Date:   Mon Sep 9 21:50:51 2019 -0700
>>>>
>>>>       nsfs: add an interface function ns_get_inum_dev()
>>>>
>>>>       This patch added an interface function
>>>>       ns_get_inum_dev(). Given a ns_common structure,
>>>>       the function returns the inode and device
>>>>       numbers. The function will be used later
>>>>       by a newly added bpf helper.
>>>>
>>>>       Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>
>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>>>> index a0431642c6b5..a603c6fc3f54 100644
>>>> --- a/fs/nsfs.c
>>>> +++ b/fs/nsfs.c
>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>>>>           return ERR_PTR(-EINVAL);
>>>>    }
>>>>
>>>> +/* Get the device number for the current task pidns.
>>>> + */
>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
>>>> +{
>>>> +       *inum = ns->inum;
>>>> +       *dev = nsfs_mnt->mnt_sb->s_dev;
>>>> +}
>>>
>>> Umm...  Where would it get the device number once we get (hell knows
>>> what for) multiple nsfs instances?  I still don't understand what
>>> would that be about, TBH...  Is it really per-userns?  Or something
>>> else entirely?  Eric, could you give some context?
>> 
>> My goal is not to paint things into a corner, with future changes.
>> Right now it is possible to stat a namespace file descriptor and
>> get a device and inode number.  Then compare that.
>> 
>> I don't want people using the inode number in nsfd as some magic
>> namespace id.
>> 
>> We have had times in the past where there was more than one superblock
>> and thus more than one device number.  Further if userspace ever uses
>> this heavily there may be times in the future where for
>> checkpoint/restart purposes we will want multiple nsfd's so we can
>> preserve the inode number accross a migration.
>> 
>> Realistically there will probably just some kind of hotplug notification
>> to userspace to say we have hotplugged your operatining system as
>> a migration notification.
>> 
>> Now the halway discussion did not quite capture everything I was trying
>> to say but it at least got to the right ballpark.
>> 
>> The helper in fs/nsfs.c should be:
>> 
>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>> {
>>          return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>> }
>> 
>> That way if/when there are multiple inodes identifying the same
>> namespace the bpf programs don't need to change.
>
> Thanks, Eric. This is indeed better. The bpf helper should focus
> on comparing dev/ino, instead of return the dev/ino to bpf program.
>
> So overall, nsfs related change will look like:
>
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index a0431642c6b5..7e78d89c2172 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd)
>          return ERR_PTR(-EINVAL);
>   }
>
> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
> +{
> +       return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
> +}
> +
>   static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
>   {
>          struct inode *inode = d_inode(dentry);
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index d31cb6215905..79639807e960 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
> task_struct *task,
>   typedef struct ns_common *ns_get_path_helper_t(void *);
>   extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
> ns_get_cb,
>                              void *private_data);
> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino);
>
>   extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
>                          const struct proc_ns_operations *ns_ops);
>
>> 
>> Up farther in the stack it should be something like:
>> 
>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino)
>>> {
>>>          return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino);
>>> }
>>>
>>> const struct bpf_func_proto bpf_current_pidns_match_proto = {
>>> 	.func		= bpf_current_pins_match,
>>> 	.gpl_only	= true,
>>> 	.ret_type	= RET_INTEGER
>>> 	.arg1_type	= ARG_PTR_TO_DEVICE_NUMBER,
>>> 	.arg2_type	= ARG_PTR_TO_INODE_NUMBER,
>>> };
>> 
>> That allows comparing what the bpf came up with with whatever value
>> userspace generated by stating the file descriptor.
>> 
>> 
>> That is the least bad suggestion I currently have for that
>> functionality.  It really would be better to not have that filter in the
>> bpf program itself but in the infrastructure that binds a program to a
>> set of tasks.
>> 
>> The problem with this approach is whatever device/inode you have when
>> the namespace they refer to exits there is the possibility that the
>> inode will be reused.  So your filter will eventually start matching on
>> the wrong thing.
>
> I come up with a differeent helper definition, which is much more
> similar to existing bpf_get_current_pid_tgid() and helper definition
> much more conforms to bpf convention.

There is a problem with your bpf_get_ns_current_pid_tgid below.
The inode number is a 64bit number.  To be nice to old userspace
we try and not use 64bit inode numbers where they are not required
but in this case we should not use an interface that assumes inode
numbers are 32bit.  They just aren't.

I didn't know how to express that in the bpf proto so I did what
I could.

The alternative to this would be to simply restrict this
helper to bpf programs registered in the initial pid namespace.
At which point you could just ensure all the numbers are in
the global pid namespace.

Hmm.  Looing at the comment below I am confused.

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..bc26903c80c7 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,8 @@
>   #include <linux/uidgid.h>
>   #include <linux/filter.h>
>   #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/proc_ns.h>
>
>   #include "../../lib/kstrtox.h"
>
> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = {
>          .arg4_type      = ARG_PTR_TO_LONG,
>   };
>   #endif
> +
> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum)
> +{
> +       struct task_struct *task = current;
> +       struct pid_namespace *pidns;
> +       pid_t pid, tgid;
> +
> +       if (unlikely(!task))
> +               return -EINVAL;
> +
> +
> +       pidns = task_active_pid_ns(task);
> +       if (unlikely(!pidns))
> +               return -ENOENT;
> +
> +       if (!ns_match(&pidns->ns, (dev_t)dev, inum))
> +               return -EINVAL;
> +
> +       pid = task_pid_nr_ns(task, pidns);
> +       tgid = task_tgid_nr_ns(task, pidns);
> +
> +       return (u64) tgid << 32 | pid;
> +}
> +
> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
> +       .func           = bpf_get_ns_current_pid_tgid,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_ANYTHING,
> +       .arg2_type      = ARG_ANYTHING,
> +};
>
> Existing usage of bpf_get_current_pid_tgid() can be converted
> to bpf_get_ns_current_pid_tgid() if ns dev/inode number
> is supplied. For bpf_get_ns_current_pid_tgid(), checking
> return value ( < 0 or not) is needed.

Ok.  I missed something.

What is the problem bpf_get_ns_current_pid_tgid trying to solve
that bpf_get_current_pid_tgid does not solve.

I would think since much of tracing ebpf is fundamentally restricted
to the global root user.  Limiting the ebpf programs to the initial
pid namespace should not be a problem.

So I don't understand why you need to specify the namespace in
the ebpf call.

Can someone give me a clue what problem is being sovled by this
new call?

Eric

^ permalink raw reply

* Re: "[RFC PATCH net-next 2/2] Reduce localhost to 127.0.0.0/16"
From: David Ahern @ 2019-09-13 17:01 UTC (permalink / raw)
  To: Dave Taht, Mark Smith; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAA93jw4SC2choBKXvaTD_5j93Op=RZ9ZEeKmyAu31ys_uNhSyA@mail.gmail.com>

On 9/13/19 10:14 AM, Dave Taht wrote:
> it came out that cumulus and a few others were possibly using high
> values of 127.x for switch chassis addressing, but we haven't got any
> documentation on how that works yet.

Not Cumulus.

I noted I am aware of 2 products from my history that use 127.x
addresses for communications within a box - e.g., to a bmc - that your
patch could break. Really it was meant as a data point that there are
released products that would be affected.

^ permalink raw reply

* Re: big ICMP requests get disrupted on IPSec tunnel activation
From: David Ahern @ 2019-09-13 17:13 UTC (permalink / raw)
  To: Bartschies, Thomas, 'netdev@vger.kernel.org'
In-Reply-To: <EB8510AA7A943D43916A72C9B8F4181F629D9741@cvk038.intra.cvk.de>

On 9/13/19 9:59 AM, Bartschies, Thomas wrote:
> Hello together,
> 
> since kenel 4.20 we're observing a strange behaviour when sending big ICMP packets. An example is a packet size of 3000 bytes.
> The packets should be forwarded by a linux gateway (firewall) having multiple interfaces also acting as a vpn gateway.
> 
> Test steps:
> 1. Disabled all iptables rules
> 2. Enabled the VPN IPSec Policies.
> 3. Start a ping with packet size (e.g. 3000 bytes) from a client in the DMZ passing the machine targeting another LAN machine
> 4. Ping works
> 5. Enable a VPN policy by sending pings from the gateway to a tunnel target. System tries to create the tunnel
> 6. Ping from 3. immediately stalls. No error messages. Just stops.
> 7. Stop Ping from 3. Start another without packet size parameter. Stalls also.
> 
> Result:
> Connections from the client to other services on the LAN machine still work. Tracepath works. Only ICMP requests do not pass
> the gateway anymore. tcpdump sees them on incoming interface, but not on the outgoing LAN interface. IMCP requests to any
> other target IP address in LAN still work. Until one uses a bigger packet size. Then these alternative connections stall also.
> 
> Flushing the policy table has no effect. Flushing the conntrack table has no effect. Setting rp_filter to loose (2) has no effect.
> Flush the route cache has no effect.
> 
> Only a reboot of the gateway restores normal behavior.
> 
> What can be the cause? Is this a networking bug?
> 

some of these most likely will fail due to other reasons, but can you
run 'tools/testing/selftests/net/pmtu.sh'[1] on 4.19 and then 4.20 and
compare results. Hopefully it will shed some light on the problem and
can be used to bisect to a commit that caused the regression.


[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/net/pmtu.sh


^ permalink raw reply

* Re: [patch iproute2-next v4 0/2] devlink: couple forgotten flash patches
From: David Ahern @ 2019-09-13 17:25 UTC (permalink / raw)
  To: Jiri Pirko, netdev; +Cc: stephen, jakub.kicinski, saeedm, mlxsw, f.fainelli
In-Reply-To: <20190912112938.2292-1-jiri@resnulli.us>

On 9/12/19 12:29 PM, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> I was under impression they are already merged, but apparently they are
> not. I just rebased them on top of current iproute2 net-next tree.
> 

they were not forgotten; they were dropped asking for changes.

thread is here:
https://lore.kernel.org/netdev/20190604134450.2839-3-jiri@resnulli.us/

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Yonghong Song @ 2019-09-13 17:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Carlos Antonio Neira Bustos, netdev@vger.kernel.org,
	brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <87a7b8gmj8.fsf@x220.int.ebiederm.org>



On 9/13/19 5:59 PM, Eric W. Biederman wrote:
> Yonghong Song <yhs@fb.com> writes:
> 
>> On 9/11/19 9:16 AM, Eric W. Biederman wrote:
>>> Al Viro <viro@zeniv.linux.org.uk> writes:
>>>
>>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
>>>>>
>>>>> Carlos,
>>>>>
>>>>> Discussed with Eric today for what is the best way to get
>>>>> the device number for a namespace. The following patch seems
>>>>> a reasonable start although Eric would like to see
>>>>> how the helper is used in order to decide whether the
>>>>> interface looks right.
>>>>>
>>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
>>>>> Author: Yonghong Song <yhs@fb.com>
>>>>> Date:   Mon Sep 9 21:50:51 2019 -0700
>>>>>
>>>>>        nsfs: add an interface function ns_get_inum_dev()
>>>>>
>>>>>        This patch added an interface function
>>>>>        ns_get_inum_dev(). Given a ns_common structure,
>>>>>        the function returns the inode and device
>>>>>        numbers. The function will be used later
>>>>>        by a newly added bpf helper.
>>>>>
>>>>>        Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>
>>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>>>>> index a0431642c6b5..a603c6fc3f54 100644
>>>>> --- a/fs/nsfs.c
>>>>> +++ b/fs/nsfs.c
>>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>>>>>            return ERR_PTR(-EINVAL);
>>>>>     }
>>>>>
>>>>> +/* Get the device number for the current task pidns.
>>>>> + */
>>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
>>>>> +{
>>>>> +       *inum = ns->inum;
>>>>> +       *dev = nsfs_mnt->mnt_sb->s_dev;
>>>>> +}
>>>>
>>>> Umm...  Where would it get the device number once we get (hell knows
>>>> what for) multiple nsfs instances?  I still don't understand what
>>>> would that be about, TBH...  Is it really per-userns?  Or something
>>>> else entirely?  Eric, could you give some context?
>>>
>>> My goal is not to paint things into a corner, with future changes.
>>> Right now it is possible to stat a namespace file descriptor and
>>> get a device and inode number.  Then compare that.
>>>
>>> I don't want people using the inode number in nsfd as some magic
>>> namespace id.
>>>
>>> We have had times in the past where there was more than one superblock
>>> and thus more than one device number.  Further if userspace ever uses
>>> this heavily there may be times in the future where for
>>> checkpoint/restart purposes we will want multiple nsfd's so we can
>>> preserve the inode number accross a migration.
>>>
>>> Realistically there will probably just some kind of hotplug notification
>>> to userspace to say we have hotplugged your operatining system as
>>> a migration notification.
>>>
>>> Now the halway discussion did not quite capture everything I was trying
>>> to say but it at least got to the right ballpark.
>>>
>>> The helper in fs/nsfs.c should be:
>>>
>>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>>> {
>>>           return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>>> }
>>>
>>> That way if/when there are multiple inodes identifying the same
>>> namespace the bpf programs don't need to change.
>>
>> Thanks, Eric. This is indeed better. The bpf helper should focus
>> on comparing dev/ino, instead of return the dev/ino to bpf program.
>>
>> So overall, nsfs related change will look like:
>>
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index a0431642c6b5..7e78d89c2172 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd)
>>           return ERR_PTR(-EINVAL);
>>    }
>>
>> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>> +{
>> +       return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>> +}
>> +
>>    static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
>>    {
>>           struct inode *inode = d_inode(dentry);
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index d31cb6215905..79639807e960 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct
>> task_struct *task,
>>    typedef struct ns_common *ns_get_path_helper_t(void *);
>>    extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t
>> ns_get_cb,
>>                               void *private_data);
>> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino);
>>
>>    extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
>>                           const struct proc_ns_operations *ns_ops);
>>
>>>
>>> Up farther in the stack it should be something like:
>>>
>>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino)
>>>> {
>>>>           return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino);
>>>> }
>>>>
>>>> const struct bpf_func_proto bpf_current_pidns_match_proto = {
>>>> 	.func		= bpf_current_pins_match,
>>>> 	.gpl_only	= true,
>>>> 	.ret_type	= RET_INTEGER
>>>> 	.arg1_type	= ARG_PTR_TO_DEVICE_NUMBER,
>>>> 	.arg2_type	= ARG_PTR_TO_INODE_NUMBER,
>>>> };
>>>
>>> That allows comparing what the bpf came up with with whatever value
>>> userspace generated by stating the file descriptor.
>>>
>>>
>>> That is the least bad suggestion I currently have for that
>>> functionality.  It really would be better to not have that filter in the
>>> bpf program itself but in the infrastructure that binds a program to a
>>> set of tasks.
>>>
>>> The problem with this approach is whatever device/inode you have when
>>> the namespace they refer to exits there is the possibility that the
>>> inode will be reused.  So your filter will eventually start matching on
>>> the wrong thing.
>>
>> I come up with a differeent helper definition, which is much more
>> similar to existing bpf_get_current_pid_tgid() and helper definition
>> much more conforms to bpf convention.
> 
> There is a problem with your bpf_get_ns_current_pid_tgid below.
> The inode number is a 64bit number.  To be nice to old userspace
> we try and not use 64bit inode numbers where they are not required
> but in this case we should not use an interface that assumes inode
> numbers are 32bit.  They just aren't.
> 
> I didn't know how to express that in the bpf proto so I did what
> I could.

We can change inum to u64. Just change the prototype like
`u64, inum` should be good.

> 
> The alternative to this would be to simply restrict this
> helper to bpf programs registered in the initial pid namespace.
> At which point you could just ensure all the numbers are in
> the global pid namespace.
> 
> Hmm.  Looing at the comment below I am confused.
> 
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 5e28718928ca..bc26903c80c7 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -11,6 +11,8 @@
>>    #include <linux/uidgid.h>
>>    #include <linux/filter.h>
>>    #include <linux/ctype.h>
>> +#include <linux/pid_namespace.h>
>> +#include <linux/proc_ns.h>
>>
>>    #include "../../lib/kstrtox.h"
>>
>> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = {
>>           .arg4_type      = ARG_PTR_TO_LONG,
>>    };
>>    #endif
>> +
>> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum)
>> +{
>> +       struct task_struct *task = current;
>> +       struct pid_namespace *pidns;
>> +       pid_t pid, tgid;
>> +
>> +       if (unlikely(!task))
>> +               return -EINVAL;
>> +
>> +
>> +       pidns = task_active_pid_ns(task);
>> +       if (unlikely(!pidns))
>> +               return -ENOENT;
>> +
>> +       if (!ns_match(&pidns->ns, (dev_t)dev, inum))
>> +               return -EINVAL;
>> +
>> +       pid = task_pid_nr_ns(task, pidns);
>> +       tgid = task_tgid_nr_ns(task, pidns);
>> +
>> +       return (u64) tgid << 32 | pid;
>> +}
>> +
>> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
>> +       .func           = bpf_get_ns_current_pid_tgid,
>> +       .gpl_only       = false,
>> +       .ret_type       = RET_INTEGER,
>> +       .arg1_type      = ARG_ANYTHING,
>> +       .arg2_type      = ARG_ANYTHING,
>> +};
>>
>> Existing usage of bpf_get_current_pid_tgid() can be converted
>> to bpf_get_ns_current_pid_tgid() if ns dev/inode number
>> is supplied. For bpf_get_ns_current_pid_tgid(), checking
>> return value ( < 0 or not) is needed.
> 
> Ok.  I missed something.
> 
> What is the problem bpf_get_ns_current_pid_tgid trying to solve
> that bpf_get_current_pid_tgid does not solve.
> 
> I would think since much of tracing ebpf is fundamentally restricted
> to the global root user.  Limiting the ebpf programs to the initial
> pid namespace should not be a problem.
> 
> So I don't understand why you need to specify the namespace in
> the ebpf call.
> 
> Can someone give me a clue what problem is being sovled by this
> new call?

We want to run the bpf program inside the namespace.
There are parallel work to introduce CAP_BPF and CAP_TRACING
(https://lore.kernel.org/bpf/20190906231053.1276792-1-ast@kernel.org/T/#t)
to facilitate this.

We have users requesting to use bcc tools inside the containers.
https://github.com/iovisor/bcc/issues/1875
https://github.com/iovisor/bcc/issues/1366
https://github.com/iovisor/bcc/issues/1329
https://github.com/iovisor/bcc/issues/1532
...

Yes, this may require granting `root` privilege to containers.
This can be done outside container as well. But it is just a
big usability improvement if people can do inside the containers.

In addition, we have requests below and internal requests as well
to filter based on containers.
https://github.com/iovisor/bcc/issues/1119
The new helper permits at root that you can filter based on
a particular container (not container id, but dev/inode should
identifying one).

Hope this clarify why this helper is useful for tracing community.

> 
> Eric
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox