Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 0/5] rxrpc: Fix use of skb_cow_data()
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel


Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
with skb_unshare() early on in the input process.  The problem that is
being seen is that skb_cow_data() indirectly requires that the maximum
usage count on an sk_buff be 1, and it may generate an assertion failure in
pskb_expand_head() if not.

This can occur because rxrpc_input_data() may be still holding a ref when
it has just attached the sk_buff to the rx ring and given that attachment
its own ref.  If recvmsg happens fast enough, skb_cow_data() can see the
ref still held by the softirq handler.

Further, a packet may contain multiple subpackets, each of which gets its
own attachment to the ring and its own ref - also making skb_cow_data() go
bang.

Fix this by:

 (1) The DATA packet is currently parsed for subpackets twice by the input
     routines.  Parse it just once instead and make notes in the sk_buff
     private data.

 (2) Use the notes from (1) when attaching the packet to the ring multiple
     times.  Once the packet is attached to the ring, recvmsg can see it
     and start modifying it, so the softirq handler is not permitted to
     look inside it from that point.

 (3) Pass the ref from the input code to the ring rather than getting an
     extra ref.  rxrpc_input_data() uses a ref on the second refcount to
     prevent the packet from evaporating under it.

 (4) Call skb_unshare() on secured DATA packets in rxrpc_input_packet()
     before we take call->input_lock.  Other sorts of packets don't get
     modified and so can be left.

     A trace is emitted if skb_unshare() eats the skb.  Note that
     skb_share() for our accounting in this regard as we can't see the
     parameters in the packet to log in a trace line if it releases it.

 (5) Remove the calls to skb_cow_data().  These are then no longer
     necessary.

There are also patches to improve the rxrpc_skb tracepoint to make sure
that Tx-derived buffers are identified separately from Rx-derived buffers
in the trace.

The patches are tagged here:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-fixes-20190827

and can also be found on the following branch:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes

David
---
David Howells (5):
      rxrpc: Pass the input handler's data skb reference to the Rx ring
      rxrpc: Abstract out rxtx ring cleanup
      rxrpc: Add a private skb flag to indicate transmission-phase skbs
      rxrpc: Use the tx-phase skb flag to simplify tracing
      rxrpc: Use skb_unshare() rather than skb_cow_data()


 include/trace/events/rxrpc.h |   59 ++++++++++++++++++++----------------------
 net/rxrpc/ar-internal.h      |    3 ++
 net/rxrpc/call_event.c       |    8 +++---
 net/rxrpc/call_object.c      |   33 +++++++++++------------
 net/rxrpc/conn_event.c       |    6 ++--
 net/rxrpc/input.c            |   52 ++++++++++++++++++++++++++++---------
 net/rxrpc/local_event.c      |    4 +--
 net/rxrpc/output.c           |    6 ++--
 net/rxrpc/peer_event.c       |   10 ++++---
 net/rxrpc/recvmsg.c          |    6 ++--
 net/rxrpc/rxkad.c            |   32 ++++++-----------------
 net/rxrpc/sendmsg.c          |   13 +++++----
 net/rxrpc/skbuff.c           |   40 ++++++++++++++++++++--------
 13 files changed, 151 insertions(+), 121 deletions(-)


^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Ivan Vecera @ 2019-08-29 12:55 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: Jiri Pirko, alexandre.belloni, UNGLinuxDriver, davem, andrew,
	allan.nielsen, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829124412.nrlpz5tzx3fkdoiw@soft-dev3.microsemi.net>

On Thu, 29 Aug 2019 14:44:14 +0200
Horatiu Vultur <horatiu.vultur@microchip.com> wrote:

> When a port is added to a bridge, then the port gets in promisc mode.
> But in our case the HW has bridge capabilities so it is not required
> to set the port in promisc mode.
> But if someone else requires the port to be in promisc mode (tcpdump
> or any other application) then I would like to set the port in promisc
> mode.
> In previous emails Andrew came with the suggestion to look at
> dev->promiscuity and check if the port is a bridge port. Using this
> information I could see when to add the port in promisc mode. He also
> suggested to add a new switchdev call(maybe I missunderstood him, or I
> have done it at the wrong place) in case there are no callbacks in the
> driver to get this information.

I would use the 1st suggestion.

for/in your driver:
if (dev->promiscuity > 0) {
	if (dev->promiscuity == 1 && netif_is_bridge_port(dev)) {
		/* no need to set promisc mode because promiscuity
		 * was requested by bridge
		 */
		...
	} else {
		/* need to set promisc mode as someone else requested
		 * promiscuity
		 */
	}
}

Thanks,
Ivan

^ permalink raw reply

* Re: [PATCH net] netdevsim: Restore per-network namespace accounting for fib entries
From: David Ahern @ 2019-08-29 12:54 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: David Ahern, davem, netdev
In-Reply-To: <20190829062850.GG2312@nanopsycho>

On 8/29/19 12:28 AM, Jiri Pirko wrote:
> Wed, Aug 28, 2019 at 11:26:03PM CEST, dsahern@gmail.com wrote:
>> On 8/28/19 4:37 AM, Jiri Pirko wrote:
>>> Tue, Aug 06, 2019 at 09:15:17PM CEST, dsahern@kernel.org wrote:
>>>> From: David Ahern <dsahern@gmail.com>
>>>>
>>>> Prior to the commit in the fixes tag, the resource controller in netdevsim
>>>> tracked fib entries and rules per network namespace. Restore that behavior.
>>>
>>> David, please help me understand. If the counters are per-device, not
>>> per-netns, they are both the same. If we have device (devlink instance)
>>> is in a netns and take only things happening in this netns into account,
>>> it should count exactly the same amount of fib entries, doesn't it?
>>
>> if you are only changing where the counters are stored - net_generic vs
>> devlink private - then yes, they should be equivalent.
> 
> Okay.
> 
>>
>>>
>>> I re-thinked the devlink netns patchset and currently I'm going in
>>> slightly different direction. I'm having netns as an attribute of
>>> devlink reload. So all the port netdevices and everything gets
>>> re-instantiated into new netns. Works fine with mlxsw. There we also
>>> re-register the fib notifier.
>>>
>>> I think that this can work for your usecase in netdevsim too:
>>> 1) devlink instance is registering a fib notifier to track all fib
>>>    entries in a namespace it belongs to. The counters are per-device -
>>>    counting fib entries in a namespace the device is in.
>>> 2) another devlink instance can do the same tracking in the same
>>>    namespace. No problem, it's a separate counter, but the numbers are
>>>    the same. One can set different limits to different devlink
>>>    instances, but you can have only one. That is the bahaviour you have
>>>    now.
>>> 3) on devlink reload, netdevsim re-instantiates ports and re-registers
>>>    fib notifier
>>> 4) on devlink reload with netns change, all should be fine as the
>>>    re-registered fib nofitier replays the entries. The ports are
>>>    re-instatiated in new netns.
>>>
>>> This way, we would get consistent behaviour between netdevsim and real
>>> devices (mlxsw), correct devlink-netns implementation (you also
>>> suggested to move ports to the namespace). Everyone should be happy.
>>>
>>> What do you think?
>>>
>>
>> Right now, registering the fib notifier walks all namespaces. That is
>> not a scalable solution. Are you changing that to replay only a given
>> netns? Are you changing the notifiers to be per-namespace?
> 
> Eventually, that seems like good idea. Currently I want to do
> if (net==nsim_dev->mynet)
> 	done
> check at the beginning of the notifier.
> 

The per-namespace replay should be done as part of this re-work. It
should not be that big of a change. Add 'struct net' arg to
register_fib_notifier. If set, call fib_net_dump only for that
namespace. The seq check should be made per-namespace.

You mentioned mlxsw works fine with moving ports to a new network
namespace, so that will be a 'real' example with a known scalability
problem that should be addressed now.

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Horatiu Vultur @ 2019-08-29 12:44 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexandre.belloni, UNGLinuxDriver, davem, andrew, allan.nielsen,
	ivecera, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829121811.GI2312@nanopsycho>

The 08/29/2019 14:18, Jiri Pirko wrote:
> External E-Mail
> 
> 
> Thu, Aug 29, 2019 at 12:56:54PM CEST, horatiu.vultur@microchip.com wrote:
> >The 08/29/2019 11:51, Jiri Pirko wrote:
> >> External E-Mail
> >> 
> >> 
> >> Thu, Aug 29, 2019 at 11:22:28AM CEST, horatiu.vultur@microchip.com wrote:
> >> >Add the SWITCHDEV_ATTR_ID_PORT_PROMISCUITY switchdev notification type,
> >> >used to indicate whenever the dev promiscuity counter is changed.
> >> >
> >> >The notification doesn't use any switchdev_attr attribute because in the
> >> >notifier callbacks is it possible to get the dev and read directly
> >> >the promiscuity value.
> >> >
> >> >Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
> >> >---
> >> > include/net/switchdev.h | 1 +
> >> > net/core/dev.c          | 9 +++++++++
> >> > 2 files changed, 10 insertions(+)
> >> >
> >> >diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> >> >index aee86a1..14b1617 100644
> >> >--- a/include/net/switchdev.h
> >> >+++ b/include/net/switchdev.h
> >> >@@ -40,6 +40,7 @@ enum switchdev_attr_id {
> >> > 	SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
> >> > 	SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED,
> >> > 	SWITCHDEV_ATTR_ID_BRIDGE_MROUTER,
> >> >+	SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
> >> > };
> >> > 
> >> > struct switchdev_attr {
> >> >diff --git a/net/core/dev.c b/net/core/dev.c
> >> >index 49589ed..40c74f2 100644
> >> >--- a/net/core/dev.c
> >> >+++ b/net/core/dev.c
> >> >@@ -142,6 +142,7 @@
> >> > #include <linux/net_namespace.h>
> >> > #include <linux/indirect_call_wrapper.h>
> >> > #include <net/devlink.h>
> >> >+#include <net/switchdev.h>
> >> > 
> >> > #include "net-sysfs.h"
> >> > 
> >> >@@ -7377,6 +7378,11 @@ static void dev_change_rx_flags(struct net_device *dev, int flags)
> >> > static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
> >> > {
> >> > 	unsigned int old_flags = dev->flags;
> >> >+	struct switchdev_attr attr = {
> >> >+		.orig_dev = dev,
> >> >+		.id = SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
> >> >+		.flags = SWITCHDEV_F_DEFER,
> >> 
> >
> >Hi Jiri,
> >
> >> NACK
> >> 
> >> This is invalid usecase for switchdev infra. Switchdev is there for
> >> bridge offload purposes only.
> >> 
> >> For promiscuity changes, the infrastructure is already present in the
> >> code. See __dev_notify_flags(). it calls:
> >> call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info)
> >> and you can actually see the changed flag in ".flags_changed".
> >Yes, you are right. But in case the port is part of a bridge and then
> >enable promisc mode by a user space application(tpcdump) then the drivers
> 
> If the promisc is on, it is on. Why do you need to know who enabled it?
When a port is added to a bridge, then the port gets in promisc mode.
But in our case the HW has bridge capabilities so it is not required to
set the port in promisc mode.
But if someone else requires the port to be in promisc mode (tcpdump or
any other application) then I would like to set the port in promisc
mode.
In previous emails Andrew came with the suggestion to look at
dev->promiscuity and check if the port is a bridge port. Using this
information I could see when to add the port in promisc mode. He also
suggested to add a new switchdev call(maybe I missunderstood him, or I
have done it at the wrong place) in case there are no callbacks in the
driver to get this information.
> 
> Or do you want to use this to ask driver to ask hw to trap packets to
> kernel? If yes, I don't think it is correct. If you want to "steal" some
> packets from the hw forwarding datapath, use TC action "trap".
No, I just wanted to know when to update the HW to set the port in
promisc mode.

> 
> 
> >will not be notified. The reason is that the dev->flags will still be the
> >same(because IFF_PROMISC was already set) and only promiscuity will be
> >changed.
> >
> >One fix could be to call __dev_notify_flags() no matter when the
> >promisc is enable or disabled.
> >
> >> 
> >> You just have to register netdev notifier block in your driver. Grep
> >> for: register_netdevice_notifier
> >> 
> >> 
> >> >+	};
> >> > 	kuid_t uid;
> >> > 	kgid_t gid;
> >> > 
> >> >@@ -7419,6 +7425,9 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
> >> > 	}
> >> > 	if (notify)
> >> > 		__dev_notify_flags(dev, old_flags, IFF_PROMISC);
> >> >+
> >> >+	switchdev_port_attr_set(dev, &attr);
> >> >+
> >> > 	return 0;
> >> > }
> >> > 
> >> >-- 
> >> >2.7.4
> >> >
> >> 
> >
> >-- 
> >/Horatiu
> 

-- 
/Horatiu

^ permalink raw reply

* Re: [PATCH] wil6210: Delete an unnecessary kfree() call in wil_tid_ampdu_rx_alloc()
From: merez @ 2019-08-28  7:44 UTC (permalink / raw)
  To: Markus Elfring
  Cc: linux-wireless, netdev, wil6210, David S. Miller, Kalle Valo,
	LKML, kernel-janitors, linux-wireless-owner
In-Reply-To: <b9620e49-618d-b392-6456-17de5807df75@web.de>

On 2019-08-27 17:44, Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Tue, 27 Aug 2019 16:39:02 +0200
> 
> A null pointer would be passed to a call of the function “kfree”
> directly after a call of the function “kcalloc” failed at one place.
> Remove this superfluous function call.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---

Reviewed-by: Maya Erez <merez@codeaurora.org>

-- 
Maya Erez
Qualcomm Israel, Inc. on behalf of Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH ipsec-next 7/7] xfrm: add espintcp (RFC 8229)
From: Sabrina Dubroca @ 2019-08-29 12:34 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Herbert Xu
In-Reply-To: <20190829070431.GA2879@gauss3.secunet.de>

2019-08-29, 09:04:31 +0200, Steffen Klassert wrote:
> On Wed, Aug 21, 2019 at 11:46:25PM +0200, Sabrina Dubroca wrote:
> > +static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
> > +{
> > +	struct xfrm_encap_tmpl *encap = x->encap;
> > +	struct esp_tcp_sk *esk;
> > +	__be16 sport, dport;
> > +	struct sock *nsk;
> > +	struct sock *sk;
> > +
> > +	sk = rcu_dereference(x->encap_sk);
> > +	if (sk && sk->sk_state == TCP_ESTABLISHED)
> > +		return sk;
> > +
> > +	spin_lock_bh(&x->lock);
> > +	sport = encap->encap_sport;
> > +	dport = encap->encap_dport;
> > +	nsk = rcu_dereference_protected(x->encap_sk,
> > +					lockdep_is_held(&x->lock));
> > +	if (sk && sk == nsk) {
> > +		esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
> > +		if (!esk) {
> > +			spin_unlock_bh(&x->lock);
> > +			return ERR_PTR(-ENOMEM);
> > +		}
> > +		RCU_INIT_POINTER(x->encap_sk, NULL);
> > +		esk->sk = sk;
> > +		call_rcu(&esk->rcu, esp_free_tcp_sk);
> 
> I don't understand this, can you please explain what you are doing here?

If we get to this block, the current encap_sk is not in ESTABLISHED
state and cannot be used to send data. We want to get rid of it (reset
x->encap_sk) and find a new usable socket. The weird kmalloc dance is
just here so that we can do the sock_put after an RCU grace period,
which I think is needed so that we don't destruct a socket that's
still used by packets in flight.

That's code I copied from Herbert's first submission, I may be missing
some details here.

> > +	}
> > +	spin_unlock_bh(&x->lock);
> > +
> > +	sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
> > +				     dport, x->props.saddr.a4, sport, 0);
> > +	if (!sk)
> > +		return ERR_PTR(-ENOENT);
> > +
> > +	if (!tcp_is_ulp_esp(sk)) {
> > +		sock_put(sk);
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	spin_lock_bh(&x->lock);
> > +	nsk = rcu_dereference_protected(x->encap_sk,
> > +					lockdep_is_held(&x->lock));
> > +	if (encap->encap_sport != sport ||
> > +	    encap->encap_dport != dport) {
> > +		sock_put(sk);
> > +		sk = nsk ?: ERR_PTR(-EREMCHG);
> > +	} else if (sk == nsk) {
> > +		sock_put(sk);
> > +	} else {
> > +		rcu_assign_pointer(x->encap_sk, sk);
> > +	}
> > +	spin_unlock_bh(&x->lock);
> > +
> > +	return sk;
> > +}
> > +
> > +static int esp_output_tcp_finish(struct xfrm_state *x, struct sk_buff *skb)
> > +{
> > +	struct sock *sk;
> > +	int err;
> > +
> > +	rcu_read_lock();
> > +
> > +	sk = esp_find_tcp_sk(x);
> > +	err = PTR_ERR(sk);
> > +	if (IS_ERR(sk))
> 
> Maybe better 'if (err)'?

That will work if I replace PTR_ERR with PTR_ERR_OR_ZERO.

> > +		goto out;
> > +
> > +	bh_lock_sock(sk);
> > +	if (sock_owned_by_user(sk)) {
> > +		err = espintcp_queue_out(sk, skb);
> > +		if (err < 0)
> > +			goto unlock_sock;
> 
> This goto is not needed.

Will fix.

> > +	} else {
> > +		err = espintcp_push_skb(sk, skb);
> > +	}
> > +
> > +unlock_sock:
> > +	bh_unlock_sock(sk);
> > +out:
> > +	rcu_read_unlock();
> > +	return err;
> > +}
> > +
> > +static int esp_output_tcp_encap_cb(struct net *net, struct sock *sk,
> > +				   struct sk_buff *skb)
> > +{
> > +	struct dst_entry *dst = skb_dst(skb);
> > +	struct xfrm_state *x = dst->xfrm;
> > +
> > +	return esp_output_tcp_finish(x, skb);
> > +}
> > +
> > +static int esp_output_tail_tcp(struct xfrm_state *x, struct sk_buff *skb)
> > +{
> > +	int err;
> > +
> > +	local_bh_disable();
> > +	err = xfrm_trans_queue_net(xs_net(x), skb, esp_output_tcp_encap_cb);
> > +	local_bh_enable();
> > +
> > +	/* EINPROGRESS just happens to do the right thing.  It
> > +	 * actually means that the skb has been consumed and
> > +	 * isn't coming back.
> > +	 */
> > +	return err ?: -EINPROGRESS;
> > +}
> > +#endif
> > +
> >  static void esp_output_done(struct crypto_async_request *base, int err)
> >  {
> >  	struct sk_buff *skb = base->data;
> > @@ -147,7 +272,13 @@ static void esp_output_done(struct crypto_async_request *base, int err)
> >  		secpath_reset(skb);
> >  		xfrm_dev_resume(skb);
> >  	} else {
> > -		xfrm_output_resume(skb, err);
> > +#ifdef CONFIG_XFRM_ESPINTCP
> 
> Do we really need all these ifdef? I guess most of them could
> be avoided with some code refactorization.

If we compile espintcp out, esp_output_tail_tcp will be dead code so
it might as well not be compiled in either.

The more trivial operations (like in esp_input_done2) could be
compiled in anyway. That might be considered a bit inconsistent,
because I need to keep the ifdef in esp_init_state (so that we land in
the default case and error out when espintcp is disabled).

Then I can provide variants of some functions (esp_output_tcp_encap,
esp_output_tail_tcp) that always fail for !XFRM_ESPINTCP.

> > +		if (!err &&
> > +		    x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
> > +			esp_output_tail_tcp(x, skb);
> > +		else
> > +#endif
> > +			xfrm_output_resume(skb, err);
> >  	}
> >  }
> 
> ...
> 
> > @@ -296,7 +460,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
> >  	struct sk_buff *trailer;
> >  	int tailen = esp->tailen;
> >  
> > -	/* this is non-NULL only with UDP Encapsulation */
> > +	/* this is non-NULL only with TCP/UDP Encapsulation */
> >  	if (x->encap) {
> >  		int err = esp_output_encap(x, skb, esp);
> >  
> > @@ -491,6 +655,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
> >  	if (sg != dsg)
> >  		esp_ssg_unref(x, tmp);
> >  
> > +#ifdef CONFIG_XFRM_ESPINTCP
> > +	if (!err && x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
> > +		err = esp_output_tail_tcp(x, skb);
> > +#endif
> > +
> >  error_free:
> >  	kfree(tmp);
> >  error:
> > @@ -617,10 +786,16 @@ int esp_input_done2(struct sk_buff *skb, int err)
> >  
> >  	if (x->encap) {
> >  		struct xfrm_encap_tmpl *encap = x->encap;
> > +		struct tcphdr *th = (void *)(skb_network_header(skb) + ihl);
> 
> This gives a unused variable warning if CONFIG_XFRM_ESPINTCP
> is not set.

Oops, will fix. Or that will take care of itself if I just remove the ifdef below.

> >  		struct udphdr *uh = (void *)(skb_network_header(skb) + ihl);
> >  		__be16 source;
> >  
> >  		switch (x->encap->encap_type) {
> > +#ifdef CONFIG_XFRM_ESPINTCP

(this one)

> > +		case TCP_ENCAP_ESPINTCP:
> > +			source = th->source;
> > +			break;
> > +#endif
> >  		case UDP_ENCAP_ESPINUDP:
> >  		case UDP_ENCAP_ESPINUDP_NON_IKE:
> >  			source = uh->source;
> > @@ -1017,6 +1192,14 @@ static int esp_init_state(struct xfrm_state *x)
> >  		case UDP_ENCAP_ESPINUDP_NON_IKE:
> >  			x->props.header_len += sizeof(struct udphdr) + 2 * sizeof(u32);
> >  			break;
> > +#ifdef CONFIG_XFRM_ESPINTCP
> > +		case TCP_ENCAP_ESPINTCP:
> > +			/* only the length field, TCP encap is done by
> > +			 * the socket
> > +			 */
> > +			x->props.header_len += 2;
> > +			break;
> > +#endif
> >  		}
> >  	}
> >  
> > diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
> > index c967fc3c38c8..ccc012b3ea10 100644
> > --- a/net/xfrm/Kconfig
> > +++ b/net/xfrm/Kconfig
> > @@ -71,6 +71,15 @@ config XFRM_IPCOMP
> >  	select CRYPTO
> >  	select CRYPTO_DEFLATE
> >  
> > +config XFRM_ESPINTCP
> > +	bool "ESP in TCP encapsulation (RFC 8229)"
> > +	depends on XFRM && INET_ESP
> > +	select STREAM_PARSER
> 
> I'm getting these compile errors:
> 
> espintcp.o: In function `espintcp_close':
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:469: undefined reference to `sk_msg_free'
> net/xfrm/espintcp.o: In function `espintcp_sendmsg':
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:302: undefined reference to `sk_msg_alloc'
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:316: undefined reference to `sk_msg_memcopy_from_iter'
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:341: undefined reference to `sk_msg_free'
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:321: undefined reference to `sk_msg_memcopy_from_iter'
> /home/klassert/git/linux-stk/Makefile:1067: recipe for target 'vmlinux' failed
> make[1]: *** [vmlinux] Error 1
> 
> I guess you need to select NET_SOCK_MSG.

Right, I'll go fix the Kconfig entry.

> Btw. I've updated the ipsec-next tree, so patch 1 is not needed anymore.

Cool, I'll drop it.

> Everything else looks good!

Thanks for the review.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH v2 2/6] mdev: Make mdev alias unique among all mdevs
From: Yunsheng Lin @ 2019-08-29 12:31 UTC (permalink / raw)
  To: Parav Pandit, alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev
In-Reply-To: <20190829111904.16042-3-parav@mellanox.com>

On 2019/8/29 19:19, Parav Pandit wrote:
> Mdev alias should be unique among all the mdevs, so that when such alias
> is used by the mdev users to derive other objects, there is no
> collision in a given system.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> 
> ---
> Changelog:
> v1->v2:
>  - Moved alias NULL check at beginning
> v0->v1:
>  - Fixed inclusiong of alias for NULL check
>  - Added ratelimited debug print for sha1 hash collision error
> ---
>  drivers/vfio/mdev/mdev_core.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 3bdff0469607..c9bf2ac362b9 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -388,6 +388,13 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
>  			ret = -EEXIST;
>  			goto mdev_fail;
>  		}
> +		if (alias && tmp->alias && strcmp(alias, tmp->alias) == 0) {

!strcmp(alias, tmp->alias) is a more common kernel pattern.

Also if we limit max len of the alias in patch 1, maybe use that to limit the
string compare too.

> +			mutex_unlock(&mdev_list_lock);
> +			ret = -EEXIST;
> +			dev_dbg_ratelimited(dev, "Hash collision in alias creation for UUID %pUl\n",
> +					    uuid);
> +			goto mdev_fail;
> +		}
>  	}
>  
>  	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> 


^ permalink raw reply

* Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
From: Yunsheng Lin @ 2019-08-29 12:26 UTC (permalink / raw)
  To: Parav Pandit, alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev
In-Reply-To: <20190829111904.16042-2-parav@mellanox.com>

On 2019/8/29 19:18, Parav Pandit wrote:
> Some vendor drivers want an identifier for an mdev device that is
> shorter than the UUID, due to length restrictions in the consumers of
> that identifier.
> 
> Add a callback that allows a vendor driver to request an alias of a
> specified length to be generated for an mdev device. If generated,
> that alias is checked for collisions.
> 
> It is an optional attribute.
> mdev alias is generated using sha1 from the mdev name.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> 
> ---
> Changelog:
> v1->v2:
>  - Kept mdev_device naturally aligned
>  - Added error checking for crypt_*() calls
>  - Corrected a typo from 'and' to 'an'
>  - Changed return type of generate_alias() from int to char*
> v0->v1:
>  - Moved alias length check outside of the parent lock
>  - Moved alias and digest allocation from kvzalloc to kzalloc
>  - &alias[0] changed to alias
>  - alias_length check is nested under get_alias_length callback check
>  - Changed comments to start with an empty line
>  - Fixed cleaunup of hash if mdev_bus_register() fails
>  - Added comment where alias memory ownership is handed over to mdev device
>  - Updated commit log to indicate motivation for this feature
> ---
>  drivers/vfio/mdev/mdev_core.c    | 123 ++++++++++++++++++++++++++++++-
>  drivers/vfio/mdev/mdev_private.h |   5 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
>  include/linux/mdev.h             |   4 +
>  4 files changed, 135 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..3bdff0469607 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -10,9 +10,11 @@
>  #include <linux/module.h>
>  #include <linux/device.h>
>  #include <linux/slab.h>
> +#include <linux/mm.h>
>  #include <linux/uuid.h>
>  #include <linux/sysfs.h>
>  #include <linux/mdev.h>
> +#include <crypto/hash.h>
>  
>  #include "mdev_private.h"
>  
> @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
>  static LIST_HEAD(mdev_list);
>  static DEFINE_MUTEX(mdev_list_lock);
>  
> +static struct crypto_shash *alias_hash;
> +
>  struct device *mdev_parent_dev(struct mdev_device *mdev)
>  {
>  	return mdev->parent->dev;
> @@ -150,6 +154,16 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
>  		return -EINVAL;
>  
> +	if (ops->get_alias_length) {
> +		unsigned int digest_size;
> +		unsigned int aligned_len;
> +
> +		aligned_len = roundup(ops->get_alias_length(), 2);
> +		digest_size = crypto_shash_digestsize(alias_hash);
> +		if (aligned_len / 2 > digest_size)
> +			return -EINVAL;
> +	}
> +
>  	dev = get_device(dev);
>  	if (!dev)
>  		return -EINVAL;
> @@ -259,6 +273,7 @@ static void mdev_device_free(struct mdev_device *mdev)
>  	mutex_unlock(&mdev_list_lock);
>  
>  	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev->alias);
>  	kfree(mdev);
>  }
>  
> @@ -269,18 +284,101 @@ static void mdev_device_release(struct device *dev)
>  	mdev_device_free(mdev);
>  }
>  
> -int mdev_device_create(struct kobject *kobj,
> -		       struct device *dev, const guid_t *uuid)
> +static const char *
> +generate_alias(const char *uuid, unsigned int max_alias_len)
> +{
> +	struct shash_desc *hash_desc;
> +	unsigned int digest_size;
> +	unsigned char *digest;
> +	unsigned int alias_len;
> +	char *alias;
> +	int ret;
> +
> +	/*
> +	 * Align to multiple of 2 as bin2hex will generate
> +	 * even number of bytes.
> +	 */
> +	alias_len = roundup(max_alias_len, 2);
> +	alias = kzalloc(alias_len + 1, GFP_KERNEL);

It seems the mtty_alias_length in mtty.c can be set from module
parameter, and user can set a very large number, maybe limit the
max of the alias_len before calling kzalloc?

> +	if (!alias)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Allocate and init descriptor */
> +	hash_desc = kvzalloc(sizeof(*hash_desc) +
> +			     crypto_shash_descsize(alias_hash),
> +			     GFP_KERNEL);
> +	if (!hash_desc) {
> +		ret = -ENOMEM;
> +		goto desc_err;
> +	}
> +
> +	hash_desc->tfm = alias_hash;
> +
> +	digest_size = crypto_shash_digestsize(alias_hash);
> +
> +	digest = kzalloc(digest_size, GFP_KERNEL);
> +	if (!digest) {
> +		ret = -ENOMEM;
> +		goto digest_err;
> +	}
> +	ret = crypto_shash_init(hash_desc);
> +	if (ret)
> +		goto hash_err;
> +
> +	ret = crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
> +	if (ret)
> +		goto hash_err;
> +
> +	ret = crypto_shash_final(hash_desc, digest);
> +	if (ret)
> +		goto hash_err;
> +
> +	bin2hex(alias, digest, min_t(unsigned int, digest_size, alias_len / 2));
> +	/*
> +	 * When alias length is odd, zero out an additional last byte
> +	 * that bin2hex has copied.
> +	 */
> +	if (max_alias_len % 2)
> +		alias[max_alias_len] = 0;
> +
> +	kfree(digest);
> +	kvfree(hash_desc);
> +	return alias;
> +
> +hash_err:
> +	kfree(digest);
> +digest_err:
> +	kvfree(hash_desc);
> +desc_err:
> +	kfree(alias);
> +	return ERR_PTR(ret);
> +}
> +
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +		       const char *uuid_str, const guid_t *uuid)
>  {
>  	int ret;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	const char *alias = NULL;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	if (parent->ops->get_alias_length) {
> +		unsigned int alias_len;
> +
> +		alias_len = parent->ops->get_alias_length();
> +		if (alias_len) {
> +			alias = generate_alias(uuid_str, alias_len);
> +			if (IS_ERR(alias)) {
> +				ret = PTR_ERR(alias);
> +				goto alias_fail;
> +			}
> +		}
> +	}
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -300,6 +398,12 @@ int mdev_device_create(struct kobject *kobj,
>  	}
>  
>  	guid_copy(&mdev->uuid, uuid);
> +	mdev->alias = alias;
> +	/*
> +	 * At this point alias memory is owned by the mdev.
> +	 * Mark it NULL, so that only mdev can free it.
> +	 */
> +	alias = NULL;
>  	list_add(&mdev->next, &mdev_list);
>  	mutex_unlock(&mdev_list_lock);
>  
> @@ -346,6 +450,8 @@ int mdev_device_create(struct kobject *kobj,
>  	up_read(&parent->unreg_sem);
>  	put_device(&mdev->dev);
>  mdev_fail:
> +	kfree(alias);
> +alias_fail:
>  	mdev_put_parent(parent);
>  	return ret;
>  }
> @@ -406,7 +512,17 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
>  
>  static int __init mdev_init(void)
>  {
> -	return mdev_bus_register();
> +	int ret;
> +
> +	alias_hash = crypto_alloc_shash("sha1", 0, 0);
> +	if (!alias_hash)
> +		return -ENOMEM;
> +
> +	ret = mdev_bus_register();
> +	if (ret)
> +		crypto_free_shash(alias_hash);
> +
> +	return ret;
>  }
>  
>  static void __exit mdev_exit(void)
> @@ -415,6 +531,7 @@ static void __exit mdev_exit(void)
>  		class_compat_unregister(mdev_bus_compat_class);
>  
>  	mdev_bus_unregister();
> +	crypto_free_shash(alias_hash);
>  }
>  
>  module_init(mdev_init)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 7d922950caaf..078fdaf7836e 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -32,6 +32,7 @@ struct mdev_device {
>  	struct list_head next;
>  	struct kobject *type_kobj;
>  	struct device *iommu_device;
> +	const char *alias;
>  	bool active;
>  };
>  
> @@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
>  int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
> -int  mdev_device_create(struct kobject *kobj,
> -			struct device *dev, const guid_t *uuid);
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +		       const char *uuid_str, const guid_t *uuid);
>  int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 7570c7602ab4..43afe0e80b76 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
>  		return -ENOMEM;
>  
>  	ret = guid_parse(str, &uuid);
> -	kfree(str);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
> -	ret = mdev_device_create(kobj, dev, &uuid);
> +	ret = mdev_device_create(kobj, dev, str, &uuid);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
> -	return count;
> +	ret = count;
> +
> +err:
> +	kfree(str);
> +	return ret;
>  }
>  
>  MDEV_TYPE_ATTR_WO(create);
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 0ce30ca78db0..f036fe9854ee 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
>   * @mmap:		mmap callback
>   *			@mdev: mediated device structure
>   *			@vma: vma structure
> + * @get_alias_length:	Generate alias for the mdevs of this parent based on the
> + *			mdev device name when it returns non zero alias length.
> + *			It is optional.
>   * Parent device that support mediated device should be registered with mdev
>   * module with mdev_parent_ops structure.
>   **/
> @@ -92,6 +95,7 @@ struct mdev_parent_ops {
>  	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
>  			 unsigned long arg);
>  	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +	unsigned int (*get_alias_length)(void);
>  };
>  
>  /* interface for exporting mdev supported type attributes */
> 


^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Shmulik Ladkani @ 2019-08-29 12:22 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal
In-Reply-To: <88a3da53-fecc-0d8c-56dc-a4c3b0e11dfd@iogearbox.net>

On Tue, 27 Aug 2019 14:10:35 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Given first point above wrt hitting rarely, it would be good to first get a
> better understanding for writing a reproducer. Back then Yonghong added one
> to the BPF kernel test suite [0], so it would be desirable to extend it for
> the case you're hitting. Given NAT64 use-case is needed and used by multiple
> parties, we should try to (fully) fix it generically.
> 

Thanks Daniel.

Managed to write a reproducer which mimics the skb we see on prodction,
that hits the exact same BUG_ON.

Submitted as a separate RFC PATCH to bpf-next.
Tested on v5.0.y (and fwd ported to net-next for submission).

Daniel, please use this reproducer.

Do note that the test assigns:

+	skb_shinfo(skb[0])->gso_size = 1288;

which is the *mangled* gso_size value, to mimic the works of
bpf_skb_proto_4_to_6().

When setting 'gso_size = 1288 + 20' (the *original* gso_size of the
GROed skb prior bpf_skb_proto_4_to_6), the test passes successfully and
we don't hit the mentioned BUG_ON.

Best,
Shmulik

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Jiri Pirko @ 2019-08-29 12:18 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: alexandre.belloni, UNGLinuxDriver, davem, andrew, allan.nielsen,
	ivecera, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829105650.btgvytgja63sx6wx@soft-dev3.microsemi.net>

Thu, Aug 29, 2019 at 12:56:54PM CEST, horatiu.vultur@microchip.com wrote:
>The 08/29/2019 11:51, Jiri Pirko wrote:
>> External E-Mail
>> 
>> 
>> Thu, Aug 29, 2019 at 11:22:28AM CEST, horatiu.vultur@microchip.com wrote:
>> >Add the SWITCHDEV_ATTR_ID_PORT_PROMISCUITY switchdev notification type,
>> >used to indicate whenever the dev promiscuity counter is changed.
>> >
>> >The notification doesn't use any switchdev_attr attribute because in the
>> >notifier callbacks is it possible to get the dev and read directly
>> >the promiscuity value.
>> >
>> >Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
>> >---
>> > include/net/switchdev.h | 1 +
>> > net/core/dev.c          | 9 +++++++++
>> > 2 files changed, 10 insertions(+)
>> >
>> >diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>> >index aee86a1..14b1617 100644
>> >--- a/include/net/switchdev.h
>> >+++ b/include/net/switchdev.h
>> >@@ -40,6 +40,7 @@ enum switchdev_attr_id {
>> > 	SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
>> > 	SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED,
>> > 	SWITCHDEV_ATTR_ID_BRIDGE_MROUTER,
>> >+	SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
>> > };
>> > 
>> > struct switchdev_attr {
>> >diff --git a/net/core/dev.c b/net/core/dev.c
>> >index 49589ed..40c74f2 100644
>> >--- a/net/core/dev.c
>> >+++ b/net/core/dev.c
>> >@@ -142,6 +142,7 @@
>> > #include <linux/net_namespace.h>
>> > #include <linux/indirect_call_wrapper.h>
>> > #include <net/devlink.h>
>> >+#include <net/switchdev.h>
>> > 
>> > #include "net-sysfs.h"
>> > 
>> >@@ -7377,6 +7378,11 @@ static void dev_change_rx_flags(struct net_device *dev, int flags)
>> > static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
>> > {
>> > 	unsigned int old_flags = dev->flags;
>> >+	struct switchdev_attr attr = {
>> >+		.orig_dev = dev,
>> >+		.id = SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
>> >+		.flags = SWITCHDEV_F_DEFER,
>> 
>
>Hi Jiri,
>
>> NACK
>> 
>> This is invalid usecase for switchdev infra. Switchdev is there for
>> bridge offload purposes only.
>> 
>> For promiscuity changes, the infrastructure is already present in the
>> code. See __dev_notify_flags(). it calls:
>> call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info)
>> and you can actually see the changed flag in ".flags_changed".
>Yes, you are right. But in case the port is part of a bridge and then
>enable promisc mode by a user space application(tpcdump) then the drivers

If the promisc is on, it is on. Why do you need to know who enabled it?

Or do you want to use this to ask driver to ask hw to trap packets to
kernel? If yes, I don't think it is correct. If you want to "steal" some
packets from the hw forwarding datapath, use TC action "trap".


>will not be notified. The reason is that the dev->flags will still be the
>same(because IFF_PROMISC was already set) and only promiscuity will be
>changed.
>
>One fix could be to call __dev_notify_flags() no matter when the
>promisc is enable or disabled.
>
>> 
>> You just have to register netdev notifier block in your driver. Grep
>> for: register_netdevice_notifier
>> 
>> 
>> >+	};
>> > 	kuid_t uid;
>> > 	kgid_t gid;
>> > 
>> >@@ -7419,6 +7425,9 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
>> > 	}
>> > 	if (notify)
>> > 		__dev_notify_flags(dev, old_flags, IFF_PROMISC);
>> >+
>> >+	switchdev_port_attr_set(dev, &attr);
>> >+
>> > 	return 0;
>> > }
>> > 
>> >-- 
>> >2.7.4
>> >
>> 
>
>-- 
>/Horatiu

^ permalink raw reply

* Re: KASAN: use-after-free Read in rxrpc_put_peer
From: David Howells @ 2019-08-29 12:17 UTC (permalink / raw)
  To: syzbot; +Cc: dhowells, davem, linux-afs, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000f6a13b059132aa6c@google.com>

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git 48b9e92aeb3c2b0df3454faf9024f6ca611d65a3

^ permalink raw reply

* [RFC PATCH bpf-next 2/2] test_bpf: Introduce 'gso_linear_no_head_frag' skb_segment test
From: Shmulik Ladkani @ 2019-08-29 12:13 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song
  Cc: Eric Dumazet, Alexei Starovoitov, netdev, bpf, Shmulik Ladkani
In-Reply-To: <20190829121344.20675-1-shmulik.ladkani@gmail.com>

Following reports of skb_segment() hitting a BUG_ON when working on
GROed skbs which have their gso_size mangled (e.g. after a
bpf_skb_change_proto call), add a reproducer test that mimics the
input skbs that lead to the mentioned BUG_ON as in [1].

[1] https://lists.openwall.net/netdev/2019/08/26/110

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
 lib/test_bpf.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 5e80cb3d3ca0..2fe1e3ab3c89 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6859,6 +6859,60 @@ static __init struct sk_buff *build_test_skb(void)
 	return NULL;
 }
 
+static __init struct sk_buff *build_test_skb_linear_no_head_frag(void)
+{
+	unsigned int alloc_size = 2000;
+	unsigned int headroom = 102, doffset = 72, data_size = 1308;
+	struct sk_buff *skb[2];
+	int i;
+
+	/* skbs linked in a frag_list, both with linear data, with head_frag=0
+	 * (data allocated by kmalloc), both have tcp data of 1308 bytes
+	 * (total payload is 2616 bytes).
+	 * Data offset is 72 bytes (40 ipv6 hdr, 32 tcp hdr). Some headroom.
+	 */
+	for (i = 0; i < 2; i++) {
+		skb[i] = alloc_skb(alloc_size, GFP_KERNEL);
+		if (!skb[i]) {
+			if (i == 0)
+				goto err_skb0;
+			else
+				goto err_skb1;
+		}
+
+		skb[i]->protocol = htons(ETH_P_IPV6);
+		skb_reserve(skb[i], headroom);
+		skb_put(skb[i], doffset + data_size);
+		skb_reset_network_header(skb[i]);
+		if (i == 0)
+			skb_reset_mac_header(skb[i]);
+		else
+			skb_set_mac_header(skb[i], -ETH_HLEN);
+		__skb_pull(skb[i], doffset);
+	}
+
+	/* setup shinfo.
+	 * mimic bpf_skb_proto_4_to_6, which resets gso_segs and assigns a
+	 * reduced gso_size.
+	 */
+	skb_shinfo(skb[0])->gso_size = 1288;
+	skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV6 | SKB_GSO_DODGY;
+	skb_shinfo(skb[0])->gso_segs = 0;
+	skb_shinfo(skb[0])->frag_list = skb[1];
+
+	/* adjust skb[0]'s len */
+	skb[0]->len += skb[1]->len;
+	skb[0]->data_len += skb[1]->len;
+	skb[0]->truesize += skb[1]->truesize;
+
+	return skb[0];
+
+err_skb1:
+	kfree_skb(skb[0]);
+err_skb0:
+	return NULL;
+}
+
 struct skb_segment_test {
 	const char *descr;
 	struct sk_buff *(*build_skb)(void);
@@ -6871,6 +6925,15 @@ static struct skb_segment_test skb_segment_tests[] __initconst = {
 		.build_skb = build_test_skb,
 		.features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
 			    NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM
+	},
+	{
+		.descr = "gso_linear_no_head_frag",
+		.build_skb = build_test_skb_linear_no_head_frag,
+		.features = NETIF_F_SG | NETIF_F_FRAGLIST |
+			    NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_GSO |
+			    NETIF_F_LLTX_BIT | NETIF_F_GRO |
+			    NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM |
+			    NETIF_F_HW_VLAN_STAG_TX_BIT
 	}
 };
 
-- 
2.19.1


^ permalink raw reply related

* [PATCH bpf-next 1/2] test_bpf: Refactor test_skb_segment() to allow testing skb_segment() on numerous different skbs
From: Shmulik Ladkani @ 2019-08-29 12:13 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song
  Cc: Eric Dumazet, Alexei Starovoitov, netdev, bpf, Shmulik Ladkani
In-Reply-To: <20190829121344.20675-1-shmulik.ladkani@gmail.com>

Currently, test_skb_segment() builds a single test skb and runs
skb_segment() on it.

Extend test_skb_segment() so it processes an array of numerous
skb/feature pairs to test.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
 lib/test_bpf.c | 51 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index c41705835cba..5e80cb3d3ca0 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6859,34 +6859,65 @@ static __init struct sk_buff *build_test_skb(void)
 	return NULL;
 }
 
-static __init int test_skb_segment(void)
-{
+struct skb_segment_test {
+	const char *descr;
+	struct sk_buff *(*build_skb)(void);
 	netdev_features_t features;
+};
+
+static struct skb_segment_test skb_segment_tests[] __initconst = {
+	{
+		.descr = "gso_with_rx_frags",
+		.build_skb = build_test_skb,
+		.features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+			    NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM
+	}
+};
+
+static __init int test_skb_segment_single(const struct skb_segment_test *test)
+{
 	struct sk_buff *skb, *segs;
 	int ret = -1;
 
-	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
-		   NETIF_F_IPV6_CSUM;
-	features |= NETIF_F_RXCSUM;
-	skb = build_test_skb();
+	skb = test->build_skb();
 	if (!skb) {
 		pr_info("%s: failed to build_test_skb", __func__);
 		goto done;
 	}
 
-	segs = skb_segment(skb, features);
+	segs = skb_segment(skb, test->features);
 	if (!IS_ERR(segs)) {
 		kfree_skb_list(segs);
 		ret = 0;
-		pr_info("%s: success in skb_segment!", __func__);
-	} else {
-		pr_info("%s: failed in skb_segment!", __func__);
 	}
 	kfree_skb(skb);
 done:
 	return ret;
 }
 
+static __init int test_skb_segment(void)
+{
+	int i, err_cnt = 0, pass_cnt = 0;
+
+	for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) {
+		const struct skb_segment_test *test = &skb_segment_tests[i];
+
+		pr_info("#%d %s ", i, test->descr);
+
+		if (test_skb_segment_single(test)) {
+			pr_cont("FAIL\n");
+			err_cnt++;
+		} else {
+			pr_cont("PASS\n");
+			pass_cnt++;
+		}
+	}
+
+	pr_info("%s: Summary: %d PASSED, %d FAILED\n", __func__,
+		pass_cnt, err_cnt);
+	return err_cnt ? -EINVAL : 0;
+}
+
 static __init int test_bpf(void)
 {
 	int i, err_cnt = 0, pass_cnt = 0;
-- 
2.19.1


^ permalink raw reply related

* [RFC PATCH bpf-next 0/2] test_bpf: Add an skb_segment test for a linear head_frag=0 skb whose gso_size was mangled
From: Shmulik Ladkani @ 2019-08-29 12:13 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song
  Cc: Eric Dumazet, Alexei Starovoitov, netdev, bpf, Shmulik Ladkani

Following reports of skb_segment() hitting a BUG_ON when working on
GROed skbs which have their gso_size mangled (e.g. after a
bpf_skb_change_proto call), add a reproducer test that mimics the
input skbs that lead to the mentioned BUG_ON as in [1].

Currently, this new test *INDEED* hits the BUG_ON, therefore the final
patch is to be applied only after skb_segment is fixed.

When changing the gso_size of the test skb to its original value (+20
bytes) then the test passes successfully.

[1] https://lists.openwall.net/netdev/2019/08/26/110

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yonghong Song <yhs@fb.com> 

Shmulik Ladkani (2):
  test_bpf: Refactor test_skb_segment() to allow testing skb_segment()
    on numerous different skbs
  test_bpf: Introduce 'gso_linear_no_head_frag' skb_segment test

 lib/test_bpf.c | 112 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 103 insertions(+), 9 deletions(-)

-- 
2.19.1

^ permalink raw reply

* Re: KASAN: use-after-free Read in rxrpc_send_keepalive
From: syzbot @ 2019-08-29 12:10 UTC (permalink / raw)
  To: davem, dhowells, linux-afs, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000e695c1058fb26925@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    ed2393ca Add linux-next specific files for 20190827
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=156adb1e600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2ef5940a07ed45f4
dashboard link: https://syzkaller.appspot.com/bug?extid=d850c266e3df14da1d31
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=167ab582600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d850c266e3df14da1d31@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): hsr0: link becomes ready
==================================================================
BUG: KASAN: use-after-free in rxrpc_send_keepalive+0x8a2/0x940  
net/rxrpc/output.c:634
Read of size 8 at addr ffff888086b01218 by task kworker/0:1/12

CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.3.0-rc6-next-20190827 #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: krxrpcd rxrpc_peer_keepalive_worker
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
  kasan_report+0x12/0x20 mm/kasan/common.c:634
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
  rxrpc_send_keepalive+0x8a2/0x940 net/rxrpc/output.c:634
  rxrpc_peer_keepalive_dispatch net/rxrpc/peer_event.c:369 [inline]
  rxrpc_peer_keepalive_worker+0x7be/0xd02 net/rxrpc/peer_event.c:430
  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 8741:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:510 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:483
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:524
  __do_kmalloc mm/slab.c:3655 [inline]
  __kmalloc+0x163/0x770 mm/slab.c:3664
  kmalloc_array include/linux/slab.h:614 [inline]
  kcalloc include/linux/slab.h:625 [inline]
  alloc_pipe_info+0x199/0x420 fs/pipe.c:676
  get_pipe_inode fs/pipe.c:738 [inline]
  create_pipe_files+0x8e/0x730 fs/pipe.c:770
  __do_pipe_flags+0x48/0x250 fs/pipe.c:807
  do_pipe2+0x84/0x160 fs/pipe.c:855
  __do_sys_pipe2 fs/pipe.c:873 [inline]
  __se_sys_pipe2 fs/pipe.c:871 [inline]
  __x64_sys_pipe2+0x54/0x80 fs/pipe.c:871
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8741:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  kasan_set_free_info mm/kasan/common.c:332 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:471
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  free_pipe_info+0x243/0x300 fs/pipe.c:709
  put_pipe_info+0xd0/0xf0 fs/pipe.c:582
  pipe_release+0x1e6/0x280 fs/pipe.c:603
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x65f/0x760 arch/x86/entry/common.c:300
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888086b01200
  which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 24 bytes inside of
  1024-byte region [ffff888086b01200, ffff888086b01600)
The buggy address belongs to the page:
page:ffffea00021ac000 refcount:1 mapcount:0 mapping:ffff8880aa400c40  
index:0xffff888086b00480 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00027b5588 ffffea00028e3808 ffff8880aa400c40
raw: ffff888086b00480 ffff888086b00000 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888086b01100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888086b01180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff888086b01200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
  ffff888086b01280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888086b01300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering
From: Vladimir Oltean @ 2019-08-29 11:50 UTC (permalink / raw)
  To: Florian Fainelli, Vivien Didelot, Andrew Lunn, Ido Schimmel,
	Roopa Prabhu, nikolay, David S. Miller
  Cc: netdev
In-Reply-To: <20190825184454.14678-3-olteanv@gmail.com>

Vivien,

On Sun, 25 Aug 2019 at 21:46, Vladimir Oltean <olteanv@gmail.com> wrote:
>
> The bridge core assumes that enabling/disabling vlan_filtering will
> translate into the simple toggling of a flag for switchdev drivers.
>
> That is clearly not the case for sja1105, which alters the VLAN table
> and the pvids in order to obtain port separation in standalone mode.
>
> There are 2 parts to the issue.
>
> First, tag_8021q changes the pvid to a unique per-port rx_vid for frame
> identification. But we need to disable tag_8021q when vlan_filtering
> kicks in, and at that point, the VLAN configured as pvid will have to be
> removed from the filtering table of the ports. With an invalid pvid, the
> ports will drop all traffic.  Since the bridge will not call any vlan
> operation through switchdev after enabling vlan_filtering, we need to
> ensure we're in a functional state ourselves. Hence read the pvid that
> the bridge is aware of, and program that into our ports.
>
> Secondly, tag_8021q uses the 1024-3071 range privately in
> vlan_filtering=0 mode. Had the user installed one of these VLANs during
> a previous vlan_filtering=1 session, then upon the next tag_8021q
> cleanup for vlan_filtering to kick in again, VLANs in that range will
> get deleted unconditionally, hence breaking user expectation. So when
> deleting the VLANs, check if the bridge had knowledge about them, and if
> it did, re-apply the settings. Wrap this logic inside a
> dsa_8021q_vid_apply helper function to reduce code duplication.
>
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> ---
>  net/dsa/tag_8021q.c | 91 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 71 insertions(+), 20 deletions(-)
>
> diff --git a/net/dsa/tag_8021q.c b/net/dsa/tag_8021q.c
> index 67a1bc635a7b..81f943e365b9 100644
> --- a/net/dsa/tag_8021q.c
> +++ b/net/dsa/tag_8021q.c
> @@ -93,6 +93,68 @@ int dsa_8021q_rx_source_port(u16 vid)
>  }
>  EXPORT_SYMBOL_GPL(dsa_8021q_rx_source_port);
>
> +static int dsa_8021q_restore_pvid(struct dsa_switch *ds, int port)
> +{
> +       struct bridge_vlan_info vinfo;
> +       struct net_device *slave;
> +       u16 pvid;
> +       int err;
> +
> +       if (!dsa_is_user_port(ds, port))
> +               return 0;
> +
> +       slave = ds->ports[port].slave;
> +
> +       err = br_vlan_get_pvid(slave, &pvid);
> +       if (err < 0) {
> +               dev_err(ds->dev, "Couldn't determine bridge PVID\n");
> +               return err;
> +       }
> +
> +       err = br_vlan_get_info(slave, pvid, &vinfo);
> +       if (err < 0) {
> +               dev_err(ds->dev, "Couldn't determine PVID attributes\n");
> +               return err;
> +       }
> +
> +       return dsa_port_vid_add(&ds->ports[port], pvid, vinfo.flags);

If the bridge had installed a dsa_8021q VLAN here, I need to use the
dsa_slave_vid_add logic to restore it. The dsa_8021q flags on the CPU
port are "ingress tagged", but that may not be the case for the bridge
VLAN.
Should I expose dsa_slave_vlan_add in dsa_priv.h, or should I just
open-code another dsa_port_vid_add for dp->cpu_dp, duplicating a bit
of code from dsa_slave_vlan_add?

> +}
> +
> +/* If @enabled is true, installs @vid with @flags into the switch port's HW
> + * filter.
> + * If @enabled is false, deletes @vid (ignores @flags) from the port. Had the
> + * user explicitly configured this @vid through the bridge core, then the @vid
> + * is installed again, but this time with the flags from the bridge layer.
> + */
> +static int dsa_8021q_vid_apply(struct dsa_switch *ds, int port, u16 vid,
> +                              u16 flags, bool enabled)
> +{
> +       struct dsa_port *dp = &ds->ports[port];
> +       struct bridge_vlan_info vinfo;
> +       int err;
> +
> +       if (enabled)
> +               return dsa_port_vid_add(dp, vid, flags);
> +
> +       err = dsa_port_vid_del(dp, vid);
> +       if (err < 0)
> +               return err;
> +
> +       /* Nothing to restore from the bridge for a non-user port */
> +       if (!dsa_is_user_port(ds, port))
> +               return 0;
> +
> +       err = br_vlan_get_info(dp->slave, vid, &vinfo);
> +       /* Couldn't determine bridge attributes for this vid,
> +        * it means the bridge had not configured it.
> +        */
> +       if (err < 0)
> +               return 0;
> +
> +       /* Restore the VID from the bridge */
> +       return dsa_port_vid_add(dp, vid, vinfo.flags);
> +}
> +
>  /* RX VLAN tagging (left) and TX VLAN tagging (right) setup shown for a single
>   * front-panel switch port (here swp0).
>   *
> @@ -148,8 +210,6 @@ EXPORT_SYMBOL_GPL(dsa_8021q_rx_source_port);
>  int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>  {
>         int upstream = dsa_upstream_port(ds, port);
> -       struct dsa_port *dp = &ds->ports[port];
> -       struct dsa_port *upstream_dp = &ds->ports[upstream];
>         u16 rx_vid = dsa_8021q_rx_vid(ds, port);
>         u16 tx_vid = dsa_8021q_tx_vid(ds, port);
>         int i, err;
> @@ -166,7 +226,6 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>          * restrictions, so there are no concerns about leaking traffic.
>          */
>         for (i = 0; i < ds->num_ports; i++) {
> -               struct dsa_port *other_dp = &ds->ports[i];
>                 u16 flags;
>
>                 if (i == upstream)
> @@ -179,10 +238,7 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>                         /* The RX VID is a regular VLAN on all others */
>                         flags = BRIDGE_VLAN_INFO_UNTAGGED;
>
> -               if (enabled)
> -                       err = dsa_port_vid_add(other_dp, rx_vid, flags);
> -               else
> -                       err = dsa_port_vid_del(other_dp, rx_vid);
> +               err = dsa_8021q_vid_apply(ds, i, rx_vid, flags, enabled);
>                 if (err) {
>                         dev_err(ds->dev, "Failed to apply RX VID %d to port %d: %d\n",
>                                 rx_vid, port, err);
> @@ -193,10 +249,7 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>         /* CPU port needs to see this port's RX VID
>          * as tagged egress.
>          */
> -       if (enabled)
> -               err = dsa_port_vid_add(upstream_dp, rx_vid, 0);
> -       else
> -               err = dsa_port_vid_del(upstream_dp, rx_vid);
> +       err = dsa_8021q_vid_apply(ds, upstream, rx_vid, 0, enabled);
>         if (err) {
>                 dev_err(ds->dev, "Failed to apply RX VID %d to port %d: %d\n",
>                         rx_vid, port, err);
> @@ -204,26 +257,24 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>         }
>
>         /* Finally apply the TX VID on this port and on the CPU port */
> -       if (enabled)
> -               err = dsa_port_vid_add(dp, tx_vid, BRIDGE_VLAN_INFO_UNTAGGED);
> -       else
> -               err = dsa_port_vid_del(dp, tx_vid);
> +       err = dsa_8021q_vid_apply(ds, port, tx_vid, BRIDGE_VLAN_INFO_UNTAGGED,
> +                                 enabled);
>         if (err) {
>                 dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
>                         tx_vid, port, err);
>                 return err;
>         }
> -       if (enabled)
> -               err = dsa_port_vid_add(upstream_dp, tx_vid, 0);
> -       else
> -               err = dsa_port_vid_del(upstream_dp, tx_vid);
> +       err = dsa_8021q_vid_apply(ds, upstream, tx_vid, 0, enabled);
>         if (err) {
>                 dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
>                         tx_vid, upstream, err);
>                 return err;
>         }
>
> -       return 0;
> +       if (!enabled)
> +               err = dsa_8021q_restore_pvid(ds, port);
> +
> +       return err;
>  }
>  EXPORT_SYMBOL_GPL(dsa_port_setup_8021q_tagging);
>
> --
> 2.17.1
>

Thanks,
-Vladimir

^ permalink raw reply

* RE: [PATCH v2] bonding: force enable lacp port after link state recovery for 802.3ad
From: zhangsha (A) @ 2019-08-29 11:33 UTC (permalink / raw)
  To: David Miller
  Cc: j.vosburgh@gmail.com, vfalico@gmail.com, andy@greyhouse.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, yuehaibing,
	hunongda, Chenzhendong (alex)
In-Reply-To: <20190827.150456.509211205582645335.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: 2019年8月28日 6:05
> To: zhangsha (A) <zhangsha.zhang@huawei.com>
> Cc: j.vosburgh@gmail.com; vfalico@gmail.com; andy@greyhouse.net;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; yuehaibing
> <yuehaibing@huawei.com>; hunongda <hunongda@huawei.com>;
> Chenzhendong (alex) <alex.chen@huawei.com>
> Subject: Re: [PATCH v2] bonding: force enable lacp port after link state
> recovery for 802.3ad
> 
> From: <zhangsha.zhang@huawei.com>
> Date: Fri, 23 Aug 2019 11:42:09 +0800
> 
> > - If speed/duplex getting failed here, the link status
> >   will be changed to BOND_LINK_FAIL;
> 
> How does it fail at this step?  I suspect this is a driver specific problem.

Hi, David,
I'm really sorry for the delayed email and appreciated for your feedback.

I was testing in kernel 4.19 with a Huawei hinic card when the problem occurred.
I checked the dmesg and got the logs in the following order:
1) link status definitely down for interface eth6, disabling it
2) link status up again after 0 ms for interface eth6
3) the paterner's system mac becomes to "00:00:00:00:00:00".
By  reading the codes, I think that the link status of the slave should be changed
to BOND_LINK_FAIL from BOND_LINK_DOWN. 

As this problem has only occurred once only, I am not very sure about whether this is a
driver specific problem or not at the moment. But I find the commit 4d2c0cda, 
its log says " Some NIC drivers don't have correct speed/duplex settings at the
time they send NETDEV_UP notification ...",  so I prefer to believe it's not.

To my problem I think  it is not enough that link-monitoring (miimon) only set
SPEED/DUPLEX right, the lacp port should be enabled too at the same time.

^ permalink raw reply

* WARNING in __mark_chain_precision (2)
From: syzbot @ 2019-08-29 11:28 UTC (permalink / raw)
  To: ast, bpf, clang-built-linux, daniel, davem, hawk, jakub.kicinski,
	john.fastabend, kafai, linux-kernel, netdev, songliubraving,
	syzkaller-bugs, yhs

Hello,

syzbot found the following crash on:

HEAD commit:    47ee6e86 selftests/bpf: remove wrong nhoff in flow dissect..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16227fa6600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7
dashboard link: https://syzkaller.appspot.com/bug?extid=c8d66267fd2b5955287e
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10d26ebc600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=127805ca600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c8d66267fd2b5955287e@syzkaller.appspotmail.com

------------[ cut here ]------------
verifier backtracking bug
WARNING: CPU: 0 PID: 8795 at kernel/bpf/verifier.c:1782  
__mark_chain_precision+0x197a/0x1ea0 kernel/bpf/verifier.c:1782
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 8795 Comm: syz-executor439 Not tainted 5.3.0-rc3+ #101
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2dc/0x755 kernel/panic.c:219
  __warn.cold+0x20/0x4c kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:__mark_chain_precision+0x197a/0x1ea0 kernel/bpf/verifier.c:1782
Code: 08 31 ff 89 de e8 a6 9e f2 ff 84 db 0f 85 07 ff ff ff e8 59 9d f2 ff  
48 c7 c7 a0 a7 91 87 c6 05 3d c6 21 08 01 e8 1e 10 c4 ff <0f> 0b 41 bc f2  
ff ff ff e9 e8 fe ff ff 48 8b bd d8 fe ff ff e8 cd
RSP: 0018:ffff88808609f380 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815c3ba6 RDI: ffffed1010c13e62
RBP: ffff88808609f4d0 R08: ffff8880a97940c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffff8880985b6cc0 R14: ffff88808e538e80 R15: ffff88808609f468
  mark_chain_precision kernel/bpf/verifier.c:1819 [inline]
  check_cond_jmp_op+0xcd8/0x3c30 kernel/bpf/verifier.c:5841
  do_check+0x63cd/0x8a30 kernel/bpf/verifier.c:7783
  bpf_check+0x6fff/0x99b2 kernel/bpf/verifier.c:9297
  bpf_prog_load+0xe68/0x1670 kernel/bpf/syscall.c:1709
  __do_sys_bpf+0xa44/0x3350 kernel/bpf/syscall.c:2860
  __se_sys_bpf kernel/bpf/syscall.c:2819 [inline]
  __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:2819
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4402a9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc52f10618 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402a9
RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401b30
R13: 0000000000401bc0 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* KASAN: use-after-free Read in nr_release (2)
From: syzbot @ 2019-08-29 11:28 UTC (permalink / raw)
  To: davem, linux-hams, linux-kernel, netdev, ralf, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    6525771f Merge tag 'arc-5.3-rc7' of git://git.kernel.org/p..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11f44d82600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
dashboard link: https://syzkaller.appspot.com/bug?extid=fd05016a0b263a41eb33
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16c627ca600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+fd05016a0b263a41eb33@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in atomic_read  
include/asm-generic/atomic-instrumented.h:26 [inline]
BUG: KASAN: use-after-free in refcount_inc_not_zero_checked+0x81/0x200  
lib/refcount.c:123
Read of size 4 at addr ffff888094849540 by task syz-executor.2/9817

CPU: 1 PID: 9817 Comm: syz-executor.2 Not tainted 5.3.0-rc6+ #128
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
  __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
  kasan_report+0x12/0x17 mm/kasan/common.c:618
  check_memory_region_inline mm/kasan/generic.c:185 [inline]
  check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
  __kasan_check_read+0x11/0x20 mm/kasan/common.c:92
  atomic_read include/asm-generic/atomic-instrumented.h:26 [inline]
  refcount_inc_not_zero_checked+0x81/0x200 lib/refcount.c:123
  refcount_inc_checked+0x17/0x70 lib/refcount.c:156
  sock_hold include/net/sock.h:649 [inline]
  nr_release+0x62/0x3e0 net/netrom/af_netrom.c:520
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x413561
Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 1b 00 00 c3 48  
83 ec 08 e8 0a fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48  
89 c2 e8 53 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffcd6d6ea60 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000413561
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 0000000000000001 R08: ffffffffffffffff R09: ffffffffffffffff
R10: 00007ffcd6d6eb40 R11: 0000000000000293 R12: 000000000075c9a0
R13: 000000000075c9a0 R14: 0000000000761aa0 R15: ffffffffffffffff

Allocated by task 9818:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:493 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:466
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:507
  __do_kmalloc mm/slab.c:3655 [inline]
  __kmalloc+0x163/0x770 mm/slab.c:3664
  kmalloc include/linux/slab.h:557 [inline]
  sk_prot_alloc+0x23a/0x310 net/core/sock.c:1603
  sk_alloc+0x39/0xf70 net/core/sock.c:1657
  nr_create+0xb9/0x5e0 net/netrom/af_netrom.c:433
  __sock_create+0x3d8/0x730 net/socket.c:1418
  sock_create net/socket.c:1469 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1511
  __do_sys_socket net/socket.c:1520 [inline]
  __se_sys_socket net/socket.c:1518 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1518
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9817:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:455
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:463
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  sk_prot_free net/core/sock.c:1640 [inline]
  __sk_destruct+0x4f7/0x6e0 net/core/sock.c:1726
  sk_destruct+0x86/0xa0 net/core/sock.c:1734
  __sk_free+0xfb/0x360 net/core/sock.c:1745
  sk_free+0x42/0x50 net/core/sock.c:1756
  sock_put include/net/sock.h:1725 [inline]
  nr_release+0x356/0x3e0 net/netrom/af_netrom.c:554
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880948494c0
  which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 128 bytes inside of
  2048-byte region [ffff8880948494c0, ffff888094849cc0)
The buggy address belongs to the page:
page:ffffea0002521200 refcount:1 mapcount:0 mapping:ffff8880aa400e00  
index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00024ff508 ffffea0002560088 ffff8880aa400e00
raw: 0000000000000000 ffff8880948483c0 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888094849400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ffff888094849480: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff888094849500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                            ^
  ffff888094849580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888094849600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* RE: [PATCH] powerpc/kmcent2: update the ethernet devices' phy properties
From: Madalin-cristian Bucur @ 2019-08-29 11:25 UTC (permalink / raw)
  To: Scott Wood, Valentin Longchamp
  Cc: linuxppc-dev@lists.ozlabs.org, galak@kernel.crashing.org,
	netdev@vger.kernel.org
In-Reply-To: <b535dcfc3d6b06ca583dc26703d4adc958eacdd8.camel@buserror.net>

> -----Original Message-----
> From: Scott Wood <oss@buserror.net>
> Sent: Wednesday, August 28, 2019 7:19 AM
> To: Valentin Longchamp <valentin@longchamp.me>; Madalin-cristian Bucur
> <madalin.bucur@nxp.com>
> Cc: linuxppc-dev@lists.ozlabs.org; galak@kernel.crashing.org;
> netdev@vger.kernel.org
> Subject: Re: [PATCH] powerpc/kmcent2: update the ethernet devices' phy
> properties
> 
> On Thu, 2019-08-08 at 23:09 +0200, Valentin Longchamp wrote:
> > Le mar. 30 juil. 2019 à 11:44, Madalin-cristian Bucur
> > <madalin.bucur@nxp.com> a écrit :
> > >
> > > > -----Original Message-----
> > > >
> > > > > Le dim. 14 juil. 2019 à 22:05, Valentin Longchamp
> > > > > <valentin@longchamp.me> a écrit :
> > > > > >
> > > > > > Change all phy-connection-type properties to phy-mode that are
> > > > > > better
> > > > > > supported by the fman driver.
> > > > > >
> > > > > > Use the more readable fixed-link node for the 2 sgmii links.
> > > > > >
> > > > > > Change the RGMII link to rgmii-id as the clock delays are added
> by
> > > > > > the
> > > > > > phy.
> > > > > >
> > > > > > Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
> > > >
> > > > I don't see any other uses of phy-mode in arch/powerpc/boot/dts/fsl,
> and
> > > > I see
> > > > lots of phy-connection-type with fman.  Madalin, does this patch
> look
> > > > OK?
> > > >
> > > > -Scott
> > >
> > > Hi,
> > >
> > > we are using "phy-connection-type" not "phy-mode" for the NXP (former
> > > Freescale)
> > > DPAA platforms. While the two seem to be interchangeable ("phy-mode"
> seems
> > > to be
> > > more recent, looking at the device tree bindings), the driver code in
> > > Linux seems
> > > to use one or the other, not both so one should stick with the variant
> the
> > > driver
> > > is using. To make things more complex, there may be dependencies in
> > > bootloaders,
> > > I see code in u-boot using only "phy-connection-type" or only "phy-
> mode".
> > >
> > > I'd leave "phy-connection-type" as is.
> >
> > So I have finally had time to have a look and now I understand what
> > happens. You are right, there are bootloader dependencies: u-boot
> > calls fdt_fixup_phy_connection() that somehow in our case adds (or
> > changes if already in the device tree) the phy-connection-type
> > property to a wrong value ! By having a phy-mode in the device tree,
> > that is not changed by u-boot and by chance picked up by the kernel
> > fman driver (of_get_phy_mode() ) over phy-connection-mode, the below
> > patch fixes it for us.
> >
> > I agree with you, it's not correct to have both phy-connection-type
> > and phy-mode. Ideally, u-boot on the board should be reworked so that
> > it does not perform the above wrong fixup. However, in an "unfixed"
> > .dtb (I have disabled fdt_fixup_phy_connection), the device tree in
> > the end only has either phy-connection-type or phy-mode, according to
> > what was chosen in the .dts file. And the fman driver works well with
> > both (thanks to the call to of_get_phy_mode() ). I would therefore
> > argue that even if all other DPAA platforms use phy-connection-type,
> > phy-mode is valid as well. (Furthermore we already have hundreds of
> > such boards in the field and we don't really support "remote" u-boot
> > update, so the u-boot fix is going to be difficult for us to pull).
> >
> > Valentin
> 
> Madalin, are you OK with the patch given this explanation?
> 
> -Scott
> 

Yes, I understand that it's the only option they have, given the inability
to upgrade u-boot (this may prove to be an issue in the future, in other
situations).

Acked-by: Madalin Bucur <madalin.bucur@nxp.com>

^ permalink raw reply

* [PATCH v2 0/6] Introduce variable length mdev alias
From: Parav Pandit @ 2019-08-29 11:18 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>

To have consistent naming for the netdevice of a mdev and to have
consistent naming of the devlink port [1] of a mdev, which is formed using
phys_port_name of the devlink port, current UUID is not usable because
UUID is too long.

UUID in string format is 36-characters long and in binary 128-bit.
Both formats are not able to fit within 15 characters limit of netdev
name.

It is desired to have mdev device naming consistent using UUID.
So that widely used user space framework such as ovs [2] can make use
of mdev representor in similar way as PCIe SR-IOV VF and PF representors.

Hence,
(a) mdev alias is created which is derived using sha1 from the mdev name.
(b) Vendor driver describes how long an alias should be for the child mdev
created for a given parent.
(c) Mdev aliases are unique at system level.
(d) alias is created optionally whenever parent requested.
This ensures that non networking mdev parents can function without alias
creation overhead.

This design is discussed at [3].

An example systemd/udev extension will have,

1. netdev name created using mdev alias available in sysfs.

mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
mdev 12 character alias=cd5b146a80a5

netdev name of this mdev = enmcd5b146a80a5
Here en = Ethernet link
m = mediated device

2. devlink port phys_port_name created using mdev alias.
devlink phys_port_name=pcd5b146a80a5

This patchset enables mdev core to maintain unique alias for a mdev.

Patch-1 Introduces mdev alias using sha1.
Patch-2 Ensures that mdev alias is unique in a system.
Patch-3 Exposes mdev alias in a sysfs hirerchy.
Patch-4 Introduces mdev_alias() API.
Patch-5 Updated sysfs documentation
Patch-6 Extends mtty driver to optionally provide alias generation.
This also enables to test UUID based sha1 collision and trigger
error handling for duplicate sha1 results.

[1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
[2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
[3] https://patchwork.kernel.org/cover/11084231/

---
Changelog:
v1->v2:
 - Corrected a typo from 'and' to 'an'
 - Addressed comments from Alex Williamson
 - Kept mdev_device naturally aligned
 - Added error checking for crypt_*() calls
 - Moved alias NULL check at beginning
 - Added mdev_alias() API
 - Updated mtty driver to show example mdev_alias() usage
 - Changed return type of generate_alias() from int to char*
v0->v1:
 - Addressed comments from Alex Williamson, Cornelia Hunk and Mark Bloch
 - Moved alias length check outside of the parent lock
 - Moved alias and digest allocation from kvzalloc to kzalloc
 - &alias[0] changed to alias
 - alias_length check is nested under get_alias_length callback check
 - Changed comments to start with an empty line
 - Added comment where alias memory ownership is handed over to mdev device
 - Fixed cleaunup of hash if mdev_bus_register() fails
 - Updated documentation for new sysfs alias file
 - Improved commit logs to make description more clear
 - Fixed inclusiong of alias for NULL check
 - Added ratelimited debug print for sha1 hash collision error

Parav Pandit (6):
  mdev: Introduce sha1 based mdev alias
  mdev: Make mdev alias unique among all mdevs
  mdev: Expose mdev alias in sysfs tree
  mdev: Introduce an API mdev_alias
  mdev: Update sysfs documentation
  mtty: Optionally support mtty alias

 .../driver-api/vfio-mediated-device.rst       |   5 +
 drivers/vfio/mdev/mdev_core.c                 | 142 +++++++++++++++++-
 drivers/vfio/mdev/mdev_private.h              |   5 +-
 drivers/vfio/mdev/mdev_sysfs.c                |  26 +++-
 include/linux/mdev.h                          |   5 +
 samples/vfio-mdev/mtty.c                      |  13 ++
 6 files changed, 186 insertions(+), 10 deletions(-)

-- 
2.19.2

^ permalink raw reply

* [PATCH v2 2/6] mdev: Make mdev alias unique among all mdevs
From: Parav Pandit @ 2019-08-29 11:19 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190829111904.16042-1-parav@mellanox.com>

Mdev alias should be unique among all the mdevs, so that when such alias
is used by the mdev users to derive other objects, there is no
collision in a given system.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v1->v2:
 - Moved alias NULL check at beginning
v0->v1:
 - Fixed inclusiong of alias for NULL check
 - Added ratelimited debug print for sha1 hash collision error
---
 drivers/vfio/mdev/mdev_core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 3bdff0469607..c9bf2ac362b9 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -388,6 +388,13 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
 			ret = -EEXIST;
 			goto mdev_fail;
 		}
+		if (alias && tmp->alias && strcmp(alias, tmp->alias) == 0) {
+			mutex_unlock(&mdev_list_lock);
+			ret = -EEXIST;
+			dev_dbg_ratelimited(dev, "Hash collision in alias creation for UUID %pUl\n",
+					    uuid);
+			goto mdev_fail;
+		}
 	}
 
 	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
-- 
2.19.2


^ permalink raw reply related

* [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-08-29 11:18 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190829111904.16042-1-parav@mellanox.com>

Some vendor drivers want an identifier for an mdev device that is
shorter than the UUID, due to length restrictions in the consumers of
that identifier.

Add a callback that allows a vendor driver to request an alias of a
specified length to be generated for an mdev device. If generated,
that alias is checked for collisions.

It is an optional attribute.
mdev alias is generated using sha1 from the mdev name.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v1->v2:
 - Kept mdev_device naturally aligned
 - Added error checking for crypt_*() calls
 - Corrected a typo from 'and' to 'an'
 - Changed return type of generate_alias() from int to char*
v0->v1:
 - Moved alias length check outside of the parent lock
 - Moved alias and digest allocation from kvzalloc to kzalloc
 - &alias[0] changed to alias
 - alias_length check is nested under get_alias_length callback check
 - Changed comments to start with an empty line
 - Fixed cleaunup of hash if mdev_bus_register() fails
 - Added comment where alias memory ownership is handed over to mdev device
 - Updated commit log to indicate motivation for this feature
---
 drivers/vfio/mdev/mdev_core.c    | 123 ++++++++++++++++++++++++++++++-
 drivers/vfio/mdev/mdev_private.h |   5 +-
 drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
 include/linux/mdev.h             |   4 +
 4 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b558d4cfd082..3bdff0469607 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -10,9 +10,11 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/uuid.h>
 #include <linux/sysfs.h>
 #include <linux/mdev.h>
+#include <crypto/hash.h>
 
 #include "mdev_private.h"
 
@@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
 static LIST_HEAD(mdev_list);
 static DEFINE_MUTEX(mdev_list_lock);
 
+static struct crypto_shash *alias_hash;
+
 struct device *mdev_parent_dev(struct mdev_device *mdev)
 {
 	return mdev->parent->dev;
@@ -150,6 +154,16 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
 		return -EINVAL;
 
+	if (ops->get_alias_length) {
+		unsigned int digest_size;
+		unsigned int aligned_len;
+
+		aligned_len = roundup(ops->get_alias_length(), 2);
+		digest_size = crypto_shash_digestsize(alias_hash);
+		if (aligned_len / 2 > digest_size)
+			return -EINVAL;
+	}
+
 	dev = get_device(dev);
 	if (!dev)
 		return -EINVAL;
@@ -259,6 +273,7 @@ static void mdev_device_free(struct mdev_device *mdev)
 	mutex_unlock(&mdev_list_lock);
 
 	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev->alias);
 	kfree(mdev);
 }
 
@@ -269,18 +284,101 @@ static void mdev_device_release(struct device *dev)
 	mdev_device_free(mdev);
 }
 
-int mdev_device_create(struct kobject *kobj,
-		       struct device *dev, const guid_t *uuid)
+static const char *
+generate_alias(const char *uuid, unsigned int max_alias_len)
+{
+	struct shash_desc *hash_desc;
+	unsigned int digest_size;
+	unsigned char *digest;
+	unsigned int alias_len;
+	char *alias;
+	int ret;
+
+	/*
+	 * Align to multiple of 2 as bin2hex will generate
+	 * even number of bytes.
+	 */
+	alias_len = roundup(max_alias_len, 2);
+	alias = kzalloc(alias_len + 1, GFP_KERNEL);
+	if (!alias)
+		return ERR_PTR(-ENOMEM);
+
+	/* Allocate and init descriptor */
+	hash_desc = kvzalloc(sizeof(*hash_desc) +
+			     crypto_shash_descsize(alias_hash),
+			     GFP_KERNEL);
+	if (!hash_desc) {
+		ret = -ENOMEM;
+		goto desc_err;
+	}
+
+	hash_desc->tfm = alias_hash;
+
+	digest_size = crypto_shash_digestsize(alias_hash);
+
+	digest = kzalloc(digest_size, GFP_KERNEL);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto digest_err;
+	}
+	ret = crypto_shash_init(hash_desc);
+	if (ret)
+		goto hash_err;
+
+	ret = crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
+	if (ret)
+		goto hash_err;
+
+	ret = crypto_shash_final(hash_desc, digest);
+	if (ret)
+		goto hash_err;
+
+	bin2hex(alias, digest, min_t(unsigned int, digest_size, alias_len / 2));
+	/*
+	 * When alias length is odd, zero out an additional last byte
+	 * that bin2hex has copied.
+	 */
+	if (max_alias_len % 2)
+		alias[max_alias_len] = 0;
+
+	kfree(digest);
+	kvfree(hash_desc);
+	return alias;
+
+hash_err:
+	kfree(digest);
+digest_err:
+	kvfree(hash_desc);
+desc_err:
+	kfree(alias);
+	return ERR_PTR(ret);
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+		       const char *uuid_str, const guid_t *uuid)
 {
 	int ret;
 	struct mdev_device *mdev, *tmp;
 	struct mdev_parent *parent;
 	struct mdev_type *type = to_mdev_type(kobj);
+	const char *alias = NULL;
 
 	parent = mdev_get_parent(type->parent);
 	if (!parent)
 		return -EINVAL;
 
+	if (parent->ops->get_alias_length) {
+		unsigned int alias_len;
+
+		alias_len = parent->ops->get_alias_length();
+		if (alias_len) {
+			alias = generate_alias(uuid_str, alias_len);
+			if (IS_ERR(alias)) {
+				ret = PTR_ERR(alias);
+				goto alias_fail;
+			}
+		}
+	}
 	mutex_lock(&mdev_list_lock);
 
 	/* Check for duplicate */
@@ -300,6 +398,12 @@ int mdev_device_create(struct kobject *kobj,
 	}
 
 	guid_copy(&mdev->uuid, uuid);
+	mdev->alias = alias;
+	/*
+	 * At this point alias memory is owned by the mdev.
+	 * Mark it NULL, so that only mdev can free it.
+	 */
+	alias = NULL;
 	list_add(&mdev->next, &mdev_list);
 	mutex_unlock(&mdev_list_lock);
 
@@ -346,6 +450,8 @@ int mdev_device_create(struct kobject *kobj,
 	up_read(&parent->unreg_sem);
 	put_device(&mdev->dev);
 mdev_fail:
+	kfree(alias);
+alias_fail:
 	mdev_put_parent(parent);
 	return ret;
 }
@@ -406,7 +512,17 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
 
 static int __init mdev_init(void)
 {
-	return mdev_bus_register();
+	int ret;
+
+	alias_hash = crypto_alloc_shash("sha1", 0, 0);
+	if (!alias_hash)
+		return -ENOMEM;
+
+	ret = mdev_bus_register();
+	if (ret)
+		crypto_free_shash(alias_hash);
+
+	return ret;
 }
 
 static void __exit mdev_exit(void)
@@ -415,6 +531,7 @@ static void __exit mdev_exit(void)
 		class_compat_unregister(mdev_bus_compat_class);
 
 	mdev_bus_unregister();
+	crypto_free_shash(alias_hash);
 }
 
 module_init(mdev_init)
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 7d922950caaf..078fdaf7836e 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -32,6 +32,7 @@ struct mdev_device {
 	struct list_head next;
 	struct kobject *type_kobj;
 	struct device *iommu_device;
+	const char *alias;
 	bool active;
 };
 
@@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
 int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
 
-int  mdev_device_create(struct kobject *kobj,
-			struct device *dev, const guid_t *uuid);
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+		       const char *uuid_str, const guid_t *uuid);
 int  mdev_device_remove(struct device *dev);
 
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 7570c7602ab4..43afe0e80b76 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
 		return -ENOMEM;
 
 	ret = guid_parse(str, &uuid);
-	kfree(str);
 	if (ret)
-		return ret;
+		goto err;
 
-	ret = mdev_device_create(kobj, dev, &uuid);
+	ret = mdev_device_create(kobj, dev, str, &uuid);
 	if (ret)
-		return ret;
+		goto err;
 
-	return count;
+	ret = count;
+
+err:
+	kfree(str);
+	return ret;
 }
 
 MDEV_TYPE_ATTR_WO(create);
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 0ce30ca78db0..f036fe9854ee 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
  * @mmap:		mmap callback
  *			@mdev: mediated device structure
  *			@vma: vma structure
+ * @get_alias_length:	Generate alias for the mdevs of this parent based on the
+ *			mdev device name when it returns non zero alias length.
+ *			It is optional.
  * Parent device that support mediated device should be registered with mdev
  * module with mdev_parent_ops structure.
  **/
@@ -92,6 +95,7 @@ struct mdev_parent_ops {
 	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
 			 unsigned long arg);
 	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	unsigned int (*get_alias_length)(void);
 };
 
 /* interface for exporting mdev supported type attributes */
-- 
2.19.2


^ permalink raw reply related

* [PATCH v2 3/6] mdev: Expose mdev alias in sysfs tree
From: Parav Pandit @ 2019-08-29 11:19 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190829111904.16042-1-parav@mellanox.com>

Expose the optional alias for an mdev device as a sysfs attribute.
This way, userspace tools such as udev may make use of the alias, for
example to create a netdevice name for the mdev.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v0->v1:
 - Addressed comments from Cornelia Huck
 - Updated commit description
---
 drivers/vfio/mdev/mdev_sysfs.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 43afe0e80b76..59f4e3cc5233 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
 
 static DEVICE_ATTR_WO(remove);
 
+static ssize_t alias_show(struct device *device,
+			  struct device_attribute *attr, char *buf)
+{
+	struct mdev_device *dev = mdev_from_dev(device);
+
+	if (!dev->alias)
+		return -EOPNOTSUPP;
+
+	return sprintf(buf, "%s\n", dev->alias);
+}
+static DEVICE_ATTR_RO(alias);
+
 static const struct attribute *mdev_device_attrs[] = {
+	&dev_attr_alias.attr,
 	&dev_attr_remove.attr,
 	NULL,
 };
-- 
2.19.2


^ permalink raw reply related

* [PATCH v2 4/6] mdev: Introduce an API mdev_alias
From: Parav Pandit @ 2019-08-29 11:19 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190829111904.16042-1-parav@mellanox.com>

Introduce an API mdev_alias() to provide access to optionally generated
alias.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 12 ++++++++++++
 include/linux/mdev.h          |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index c9bf2ac362b9..5399ed6f1612 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -517,6 +517,18 @@ struct device *mdev_get_iommu_device(struct device *dev)
 }
 EXPORT_SYMBOL(mdev_get_iommu_device);
 
+/**
+ * mdev_alias: Return alias string of a mdev device
+ * @mdev:	Pointer to the mdev device
+ * mdev_alias() returns alias string of a mdev device if alias is present,
+ * returns NULL otherwise.
+ */
+const char *mdev_alias(struct mdev_device *mdev)
+{
+	return mdev->alias;
+}
+EXPORT_SYMBOL(mdev_alias);
+
 static int __init mdev_init(void)
 {
 	int ret;
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index f036fe9854ee..6da82213bc4e 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -148,5 +148,6 @@ void mdev_unregister_driver(struct mdev_driver *drv);
 struct device *mdev_parent_dev(struct mdev_device *mdev);
 struct device *mdev_dev(struct mdev_device *mdev);
 struct mdev_device *mdev_from_dev(struct device *dev);
+const char *mdev_alias(struct mdev_device *mdev);
 
 #endif /* MDEV_H */
-- 
2.19.2


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox