Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 5/5] net: Replace ip_ra_lock with per-net mutex
From: Kirill Tkhai @ 2018-03-15 13:20 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	ktkhai, avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl
In-Reply-To: <152111892326.30586.3742839588131331286.stgit@localhost.localdomain>

Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/net/netns/ipv4.h |    1 +
 net/core/net_namespace.c |    1 +
 net/ipv4/ip_sockglue.c   |   15 ++++++---------
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 23c208fcf1a0..7295429732c6 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -50,6 +50,7 @@ struct netns_ipv4 {
 	struct ipv4_devconf	*devconf_all;
 	struct ipv4_devconf	*devconf_dflt;
 	struct ip_ra_chain	*ra_chain;
+	struct mutex		ra_mutex;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 	struct fib_rules_ops	*rules_ops;
 	bool			fib_has_custom_rules;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c340d5cfbdec..95ba2c53bd9a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -301,6 +301,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	net->user_ns = user_ns;
 	idr_init(&net->netns_ids);
 	spin_lock_init(&net->nsid_lock);
+	mutex_init(&net->ipv4.ra_mutex);
 
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index f36d35fe924b..5ad2d8ed3a3f 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,9 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 	return 0;
 }
 
-static DEFINE_SPINLOCK(ip_ra_lock);
-
-
 static void ip_ra_destroy_rcu(struct rcu_head *head)
 {
 	struct ip_ra_chain *ra = container_of(head, struct ip_ra_chain, rcu);
@@ -345,21 +342,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-	spin_lock_bh(&ip_ra_lock);
+	mutex_lock(&net->ipv4.ra_mutex);
 	for (rap = &net->ipv4.ra_chain;
 	     (ra = rcu_dereference_protected(*rap,
-			lockdep_is_held(&ip_ra_lock))) != NULL;
+			lockdep_is_held(&net->ipv4.ra_mutex))) != NULL;
 	     rap = &ra->next) {
 		if (ra->sk == sk) {
 			if (on) {
-				spin_unlock_bh(&ip_ra_lock);
+				mutex_unlock(&net->ipv4.ra_mutex);
 				kfree(new_ra);
 				return -EADDRINUSE;
 			}
 			/* dont let ip_call_ra_chain() use sk again */
 			ra->sk = NULL;
 			RCU_INIT_POINTER(*rap, ra->next);
-			spin_unlock_bh(&ip_ra_lock);
+			mutex_unlock(&net->ipv4.ra_mutex);
 
 			if (ra->destructor)
 				ra->destructor(sk);
@@ -374,7 +371,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 		}
 	}
 	if (!new_ra) {
-		spin_unlock_bh(&ip_ra_lock);
+		mutex_unlock(&net->ipv4.ra_mutex);
 		return -ENOBUFS;
 	}
 	new_ra->sk = sk;
@@ -383,7 +380,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 	RCU_INIT_POINTER(new_ra->next, ra);
 	rcu_assign_pointer(*rap, new_ra);
 	sock_hold(sk);
-	spin_unlock_bh(&ip_ra_lock);
+	mutex_unlock(&net->ipv4.ra_mutex);
 
 	return 0;
 }

^ permalink raw reply related

* Re: [PATCH net-next 0/5] sctp: add support for some sctp auth APIs from RFC6458
From: Neil Horman @ 2018-03-15 13:20 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, davem
In-Reply-To: <cover.1521025473.git.lucien.xin@gmail.com>

On Wed, Mar 14, 2018 at 07:05:29PM +0800, Xin Long wrote:
> This patchset mainly adds support for SCTP AUTH Information for sendmsg,
> described in RFC6458:
> 
>     5.3.8.  SCTP AUTH Information Structure (SCTP_AUTHINFO)
> 
> and also adds a sockopt described in RFC6458:
> 
>     8.3.4.  Deactivate a Shared Key (SCTP_AUTH_DEACTIVATE_KEY)
> 
> and two types of events for AUTHENTICATION_EVENT described in RFC6458:
> 
>     6.1.8.  SCTP_AUTHENTICATION_EVENT:
>              - SCTP_AUTH_NO_AUTH
>              - SCTP_AUTH_FREE_KEY
> 
> After this patchset, we have fully support for sctp_sendv in kernel.
> 
> Note that this patchset won't touch that sctp options merge conflict.
> 
> Xin Long (5):
>   sctp: add refcnt support for sh_key
>   sctp: add support for SCTP AUTH Information for sendmsg
>   sctp: add sockopt SCTP_AUTH_DEACTIVATE_KEY
>   sctp: add SCTP_AUTH_FREE_KEY type for AUTHENTICATION_EVENT
>   sctp: add SCTP_AUTH_NO_AUTH type for AUTHENTICATION_EVENT
> 
>  include/net/sctp/auth.h    |  21 ++++---
>  include/net/sctp/command.h |   1 +
>  include/net/sctp/sm.h      |   3 +-
>  include/net/sctp/structs.h |  10 +++-
>  include/uapi/linux/sctp.h  |  22 ++++++-
>  net/sctp/auth.c            | 146 +++++++++++++++++++++++++++++++--------------
>  net/sctp/chunk.c           |  14 +++++
>  net/sctp/output.c          |  18 +++++-
>  net/sctp/sm_make_chunk.c   |  33 +++++++++-
>  net/sctp/sm_sideeffect.c   |  13 ++++
>  net/sctp/sm_statefuns.c    |  56 ++++++++++++++---
>  net/sctp/socket.c          |  77 ++++++++++++++++++++++++
>  12 files changed, 342 insertions(+), 72 deletions(-)
> 
> -- 
> 2.1.0
> 
> 
Series
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH net-next] tuntap: XDP_TX can use native XDP
From: Michael S. Tsirkin @ 2018-03-15 13:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel
In-Reply-To: <21e53c2b-6677-fc9a-0771-e895da1c257f@redhat.com>

On Thu, Mar 15, 2018 at 04:39:25PM +0800, Jason Wang wrote:
> 
> 
> On 2018年03月14日 11:37, Michael S. Tsirkin wrote:
> > >   			return NULL;
> > >   		case XDP_TX:
> > > -			xdp_xmit = true;
> > > -			/* fall through */
> > > +			get_page(alloc_frag->page);
> > > +			alloc_frag->offset += buflen;
> > > +			if (tun_xdp_xmit(tun->dev, &xdp))
> > > +				goto err_redirect;
> > > +			tun_xdp_flush(tun->dev);
> > Why do we have to flush here though?
> > It might be a good idea to document the reason in a code comment.
> > 
> 
> ndo_xdp_xmit() does not touch doorbell, so we need a ndo_xdp_flush() here.
> It's the assumption of XDP API I think, so not sure it's worth to mention it
> here.
> 
> Thanks

Can't one flush we called after multiple xmit calls?

^ permalink raw reply

* Re: [PATCH net-next nfs 0/6] Converting pernet_operations (part #7)
From: Kirill Tkhai @ 2018-03-15 13:32 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

Hi,

Trond, Anna, Bruce, Jeff, David and other NFS and RXRPC people,
could you please provide your vision on this patches?

Thanks,
Kirill

On 13.03.2018 13:49, Kirill Tkhai wrote:
> Hi,
> 
> this series continues to review and to convert pernet_operations
> to make them possible to be executed in parallel for several
> net namespaces in the same time. There are nfs pernet_operations
> in this series. All of them look similar each other, they mostly
> create and destroy caches with small exceptions.
> 
> Also, there is rxrpc_net_ops, which is used in AFS.
> 
> Thanks,
> Kirill
> ---
> 
> Kirill Tkhai (6):
>       net: Convert rpcsec_gss_net_ops
>       net: Convert sunrpc_net_ops
>       net: Convert nfsd_net_ops
>       net: Convert nfs4_dns_resolver_ops
>       net: Convert nfs4blocklayout_net_ops
>       net: Convert rxrpc_net_ops
> 
> 
>  fs/nfs/blocklayout/rpc_pipefs.c |    1 +
>  fs/nfs/dns_resolve.c            |    1 +
>  fs/nfsd/nfsctl.c                |    1 +
>  net/rxrpc/net_ns.c              |    1 +
>  net/sunrpc/auth_gss/auth_gss.c  |    1 +
>  net/sunrpc/sunrpc_syms.c        |    1 +
>  6 files changed, 6 insertions(+)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> 

^ permalink raw reply

* Re: netns: send uevent messages
From: Christian Brauner @ 2018-03-15 13:39 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christian Brauner, ebiederm, gregkh, netdev, linux-kernel, serge,
	avagin
In-Reply-To: <cff34509-16ff-f12c-cd8c-9bcd91645d58@virtuozzo.com>

On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
> CC Andrey Vagin

Hey Kirill,

Thanks for CCing Andrey.

> 
> On 15.03.2018 03:12, Christian Brauner wrote:
> > This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
> > to allow sending uevent messages into the network namespace the socket
> > belongs to.
> > 
> > Currently non-initial network namespaces are already isolated and don't
> > receive uevents. There are a number of cases where it is beneficial for a
> > sufficiently privileged userspace process to send a uevent into a network
> > namespace.
> > 
> > One such use case would be debugging and fuzzing of a piece of software
> > which listens and reacts to uevents. By running a copy of that software
> > inside a network namespace, specific uevents could then be presented to it.
> > More concretely, this would allow for easy testing of udevd/ueventd.
> > 
> > This will also allow some piece of software to run components inside a
> > separate network namespace and then effectively filter what that software
> > can receive. Some examples of software that do directly listen to uevents
> > and that we have in the past attempted to run inside a network namespace
> > are rbd (CEPH client) or the X server.
> > 
> > Implementation:
> > The implementation has been kept as simple as possible from the kernel's
> > perspective. Specifically, a simple input method uevent_net_rcv() is added
> > to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
> > af_netlink infrastructure and does neither add an additional netlink family
> > nor requires any user-visible changes.
> > 
> > For example, by using netlink_rcv_skb() we can make use of existing netlink
> > infrastructure to report back informative error messages to userspace.
> > 
> > Furthermore, this implementation does not introduce any overhead for
> > existing uevent generating codepaths. The struct netns gets a new uevent
> > socket member that records the uevent socket associated with that network
> > namespace. Since we record the uevent socket for each network namespace in
> > struct net we don't have to walk the whole uevent socket list.
> > Instead we can directly retrieve the relevant uevent socket and send the
> > message. This keeps the codepath very performant without introducing
> > needless overhead.
> > 
> > Uevent sequence numbers are kept global. When a uevent message is sent to
> > another network namespace the implementation will simply increment the
> > global uevent sequence number and append it to the received uevent. This
> > has the advantage that the kernel will never need to parse the received
> > uevent message to replace any existing uevent sequence numbers. Instead it
> > is up to the userspace process to remove any existing uevent sequence
> > numbers in case the uevent message to be sent contains any.
> > 
> > Security:
> > In order for a caller to send uevent messages to a target network namespace
> > the caller must have CAP_SYS_ADMIN in the owning user namespace of the
> > target network namespace. Additionally, any received uevent message is
> > verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
> > needed to append the uevent sequence number.
> > 
> > Testing:
> > This patch has been tested and verified to work with the following udev
> > implementations:
> > 1. CentOS 6 with udevd version 147
> > 2. Debian Sid with systemd-udevd version 237
> > 3. Android 7.1.1 with ueventd
> > 
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  include/net/net_namespace.h |  1 +
> >  lib/kobject_uevent.c        | 88 ++++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 88 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index f306b2aa15a4..467bde763a9b 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -78,6 +78,7 @@ struct net {
> >  
> >  	struct sock 		*rtnl;			/* rtnetlink socket */
> >  	struct sock		*genl_sock;
> > +	struct sock		*uevent_sock;		/* uevent socket */
> 
> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
> become to look confusing. There are:
> 
>         mutex_lock(&uevent_sock_mutex);
>         list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>                 if (sock_net(ue_sk->sk) == net)
>                         goto found;
>         }
> 
> Can't we make a small cleanup in lib/kobject_uevent.c after this change
> and before the main part of the patch goes?

Hm, not sure. It seems it makes sense to maintain them in a separate
list. Looks like this lets us keep the locking simpler. Otherwise we
would have to do something like for_each_net() and it seems that this
would force us to use rntl_{un}lock().

> 
> >  
> >  	struct list_head 	dev_base_head;
> >  	struct hlist_head 	*dev_name_head;
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 9fe6ec8fda28..10b2144b9fc3 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/uuid.h>
> >  #include <linux/ctype.h>
> >  #include <net/sock.h>
> > +#include <net/netlink.h>
> >  #include <net/net_namespace.h>
> >  
> >  
> > @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> >  EXPORT_SYMBOL_GPL(add_uevent_var);
> >  
> >  #if defined(CONFIG_NET)
> > +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
> > +			   struct netlink_ext_ack *extack)
> > +{
> > +	int ret;
> > +	u64 seqnum;
> > +	/* u64 to chars: 2^64 - 1 = 21 chars */
> 
> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
> we forgot +1 in buf declaration.

sizeof("SEQNUM=") will include the '\0' pointer in contrast to
strlen("SEQNUM=") so this is correct if I'm not completely mistaken.

> 
> > +	char buf[sizeof("SEQNUM=") + 21];
> > +	struct sk_buff *skbc;
> > +
> > +	/* bump sequence number */
> > +	mutex_lock(&uevent_sock_mutex);
> > +	seqnum = ++uevent_seqnum;
> > +	mutex_unlock(&uevent_sock_mutex);
> 
> Commit 7b60a18da393 from Andrew Vagin says:
> 
>     uevent: send events in correct order according to seqnum (v3)
>     
>     The queue handling in the udev daemon assumes that the events are
>     ordered.
>     
>     Before this patch uevent_seqnum is incremented under sequence_lock,
>     than an event is send uner uevent_sock_mutex. I want to say that code
>     contained a window between incrementing seqnum and sending an event.
>     
>     This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
>     Signed-off-by: Andrew Vagin <avagin@openvz.org>
> 
> After this change the order will be lost. Also the rest of namespaces
> will have holes in uevent numbers, as they won't receive a number sent
> to specific namespace.

Afaict from looking into udevd when I wrote the patch it only cares
about numbers being ordered (which is also what Andrey's patch states)
not that they are sequential so holes should be fine. udevd will use
the DEVPATH to determine whether the sequence number of the current
uevent should be used as "even->delaying_seqnum" number. All that
matters is that it is greater than the previous one. About the ordering,
if that's an issue then we should simply do what Andrey has been doing
for kobject_uevent_env() and extend the lock being held until after we
sent out the uevent. Since we're not serving all listeners but only
ever one this is also way more lightweight then kobject_uevent_env().

> 
> It seems we should made uevent_seqnum per-net to solve this problem.

Yes, Eric and I have been discussing this already. The idea was to do
this in a follow-up patch to keep this patch as simple as possible. If
we agree that it would make sense right away then I will dig out the
corresponding patch.
It would basically just involve giving each struct net a uevent_seqnum
member.

Christian

> 
> > +	/* prepare sequence number */
> > +	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", seqnum);
> > +	if (ret < 0 || (size_t)ret >= sizeof(buf))
> > +		return -ENOMEM;> +	ret++;
> > +
> > +	/* verify message does not overflow */
> > +	if ((skb->len + ret) > UEVENT_BUFFER_SIZE) {
> > +		NL_SET_ERR_MSG(extack, "uevent message too big");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* copy skb and extend to accommodate sequence number */
> > +	skbc = skb_copy_expand(skb, 0, ret, GFP_KERNEL);
> > +	if (!skbc)
> > +		return -ENOMEM;
> > +
> > +	/* append sequence number */
> > +	skb_put_data(skbc, buf, ret);
> > +
> > +	/* remove msg header */
> > +	skb_pull(skbc, NLMSG_HDRLEN);
> > +
> > +	/* set portid 0 to inform userspace message comes from kernel */
> > +	NETLINK_CB(skbc).portid = 0;
> > +	NETLINK_CB(skbc).dst_group = 1;
> > +
> > +	ret = netlink_broadcast(usk, skbc, 0, 1, GFP_KERNEL);
> > +	/* ENOBUFS should be handled in userspace */
> > +	if (ret == -ENOBUFS || ret == -ESRCH)
> > +		ret = 0;
> > +
> > +	return ret;
> > +}
> > +
> > +static int uevent_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
> > +			  struct netlink_ext_ack *extack)
> > +{
> > +	void *msg;
> > +	struct net *target_net;
> > +	struct sock *target_sk;
> > +
> > +	msg = nlmsg_data(nlh);
> > +	if (!msg)
> > +		return -EINVAL;
> > +
> > +	target_net = sock_net(NETLINK_CB(skb).sk);
> > +	target_sk = target_net->uevent_sock;
> > +
> > +	/*
> > +	 * Verify that we are allowed to send messages to the target network
> > +	 * namespace. The caller must have CAP_SYS_ADMIN in the owning user
> > +	 * namespace of the target network namespace.
> > +	 */
> > +	if (!netlink_ns_capable(skb, target_net->user_ns, CAP_SYS_ADMIN)) {
> > +		NL_SET_ERR_MSG(extack, "missing CAP_SYS_ADMIN capability");
> > +		return -EPERM;
> > +	}
> > +
> > +	return uevent_net_send(target_sk, skb, extack);
> > +}
> > +
> > +static void uevent_net_rcv(struct sk_buff *skb)
> > +{
> > +	netlink_rcv_skb(skb, &uevent_rcv_msg);
> > +}
> > +
> >  static int uevent_net_init(struct net *net)
> >  {
> >  	struct uevent_sock *ue_sk;
> >  	struct netlink_kernel_cfg cfg = {
> >  		.groups	= 1,
> > -		.flags	= NL_CFG_F_NONROOT_RECV,
> > +		.input = uevent_net_rcv,
> > +		.flags	= NL_CFG_F_NONROOT_RECV
> >  	};
> >  
> >  	ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
> > @@ -621,6 +704,9 @@ static int uevent_net_init(struct net *net)
> >  		kfree(ue_sk);
> >  		return -ENODEV;
> >  	}
> > +
> > +	net->uevent_sock = ue_sk->sk;
> > +
> >  	mutex_lock(&uevent_sock_mutex);
> >  	list_add_tail(&ue_sk->list, &uevent_sock_list);
> >  	mutex_unlock(&uevent_sock_mutex);
> 
> Kirill

^ permalink raw reply

* Re: [PATCH v2] vhost: add vsock compat ioctl
From: Stefan Hajnoczi @ 2018-03-15 13:51 UTC (permalink / raw)
  To: Sonny Rao
  Cc: kvm, Michael S . Tsirkin, Jason Wang, virtualization, netdev,
	linux-kernel
In-Reply-To: <20180314213625.119211-1-sonnyrao@chromium.org>

[-- Attachment #1: Type: text/plain, Size: 335 bytes --]

On Wed, Mar 14, 2018 at 02:36:25PM -0700, Sonny Rao wrote:
> This will allow usage of vsock from 32-bit binaries on a 64-bit
> kernel.
> 
> Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
> ---
>  drivers/vhost/vsock.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply

* [PATCH] cxgb4/cudbg_lib: Use common error handling code in cudbg_collect_tid()
From: SF Markus Elfring @ 2018-03-15 14:05 UTC (permalink / raw)
  To: netdev, Ganesh Goudar; +Cc: LKML, kernel-janitors, Rahul Lakkireddy

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 15 Mar 2018 14:56:10 +0100

Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
index 9da6f57901a9..d5b5b6221d1f 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
@@ -1660,11 +1660,9 @@ int cudbg_collect_tid(struct cudbg_init *pdbg_init,
 	para[0] = FW_PARAM_PFVF_A(ETHOFLD_START);
 	para[1] = FW_PARAM_PFVF_A(ETHOFLD_END);
 	rc = t4_query_params(padap, padap->mbox, padap->pf, 0, 2, para, val);
-	if (rc <  0) {
-		cudbg_err->sys_err = rc;
-		cudbg_put_buff(pdbg_init, &temp_buff);
-		return rc;
-	}
+	if (rc < 0)
+		goto put_buffer;
+
 	tid->uotid_base = val[0];
 	tid->nuotids = val[1] - val[0] + 1;
 
@@ -1679,11 +1677,9 @@ int cudbg_collect_tid(struct cudbg_init *pdbg_init,
 		para[1] = FW_PARAM_PFVF_A(HPFILTER_END);
 		rc = t4_query_params(padap, padap->mbox, padap->pf, 0, 2,
 				     para, val);
-		if (rc < 0) {
-			cudbg_err->sys_err = rc;
-			cudbg_put_buff(pdbg_init, &temp_buff);
-			return rc;
-		}
+		if (rc < 0)
+			goto put_buffer;
+
 		tid->hpftid_base = val[0];
 		tid->nhpftids = val[1] - val[0] + 1;
 	}
@@ -1711,6 +1707,11 @@ int cudbg_collect_tid(struct cudbg_init *pdbg_init,
 	tid->ipv6_users = t4_read_reg(padap, LE_DB_ACT_CNT_IPV6_A);
 
 	return cudbg_write_and_release_buff(pdbg_init, &temp_buff, dbg_buff);
+
+put_buffer:
+	cudbg_put_buff(pdbg_init, &temp_buff);
+	cudbg_err->sys_err = rc;
+	return rc;
 }
 
 int cudbg_collect_pcie_config(struct cudbg_init *pdbg_init,
-- 
2.16.2

^ permalink raw reply related

* [PATCH net] net/sched: fix NULL dereference in the error path of tcf_vlan_init()
From: Davide Caratti @ 2018-03-15 14:06 UTC (permalink / raw)
  To: Cong Wang, Manish Kurup, Jiri Pirko, David S. Miller; +Cc: netdev

when the following command

 # tc actions replace action vlan pop index 100

is run for the first time, and tcf_vlan_init() fails allocating struct
tcf_vlan_params, tcf_vlan_cleanup() calls kfree_rcu(NULL, ...). This causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
 IP: __call_rcu+0x23/0x2b0
 PGD 80000000760a2067 P4D 80000000760a2067 PUD 742c1067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_vlan(E) ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel mbcache snd_hda_codec jbd2 snd_hda_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper snd cryptd joydev soundcore virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_console virtio_blk virtio_net ata_piix crc32c_intel libata virtio_pci i2c_core virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
 CPU: 3 PID: 3119 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffaac3005fb798 EFLAGS: 00010246
 RAX: ffffffffc0704080 RBX: ffff97f2b4bbe900 RCX: 00000000ffffffff
 RDX: ffffffffabca5f00 RSI: 0000000000000010 RDI: 0000000000000010
 RBP: 0000000000000010 R08: 0000000000000001 R09: 0000000000000044
 R10: 00000000fd003000 R11: ffff97f2faab5b91 R12: 0000000000000000
 R13: ffffffffabca5f00 R14: ffff97f2fb80202c R15: 00000000fffffff4
 FS:  00007f68f75b4740(0000) GS:ffff97f2ffd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000018 CR3: 0000000072b52001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_vlan_init+0x168/0x270 [act_vlan]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f68f69c5ba0
 RSP: 002b:00007fffd79c1118 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fffd79c1240 RCX: 00007f68f69c5ba0
 RDX: 0000000000000000 RSI: 00007fffd79c1190 RDI: 0000000000000003
 RBP: 000000005aaa708e R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fffd79c0ba0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fffd79c1254 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffaac3005fb798
 CR2: 0000000000000018

fix this in tcf_vlan_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 4c5b9d9642c8 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_vlan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index e1a1b3f3983a..c2914e9a4a6f 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -225,7 +225,8 @@ static void tcf_vlan_cleanup(struct tc_action *a)
 	struct tcf_vlan_params *p;
 
 	p = rcu_dereference_protected(v->vlan_p, 1);
-	kfree_rcu(p, rcu);
+	if (p)
+		kfree_rcu(p, rcu);
 }
 
 static int tcf_vlan_dump(struct sk_buff *skb, struct tc_action *a,
-- 
2.14.3

^ permalink raw reply related

* Re: netns: send uevent messages
From: Kirill Tkhai @ 2018-03-15 14:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Brauner, ebiederm, gregkh, netdev, linux-kernel, serge,
	avagin
In-Reply-To: <20180315133920.GA25733@gmail.com>

On 15.03.2018 16:39, Christian Brauner wrote:
> On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
>> CC Andrey Vagin
> 
> Hey Kirill,
> 
> Thanks for CCing Andrey.
> 
>>
>> On 15.03.2018 03:12, Christian Brauner wrote:
>>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
>>> to allow sending uevent messages into the network namespace the socket
>>> belongs to.
>>>
>>> Currently non-initial network namespaces are already isolated and don't
>>> receive uevents. There are a number of cases where it is beneficial for a
>>> sufficiently privileged userspace process to send a uevent into a network
>>> namespace.
>>>
>>> One such use case would be debugging and fuzzing of a piece of software
>>> which listens and reacts to uevents. By running a copy of that software
>>> inside a network namespace, specific uevents could then be presented to it.
>>> More concretely, this would allow for easy testing of udevd/ueventd.
>>>
>>> This will also allow some piece of software to run components inside a
>>> separate network namespace and then effectively filter what that software
>>> can receive. Some examples of software that do directly listen to uevents
>>> and that we have in the past attempted to run inside a network namespace
>>> are rbd (CEPH client) or the X server.
>>>
>>> Implementation:
>>> The implementation has been kept as simple as possible from the kernel's
>>> perspective. Specifically, a simple input method uevent_net_rcv() is added
>>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
>>> af_netlink infrastructure and does neither add an additional netlink family
>>> nor requires any user-visible changes.
>>>
>>> For example, by using netlink_rcv_skb() we can make use of existing netlink
>>> infrastructure to report back informative error messages to userspace.
>>>
>>> Furthermore, this implementation does not introduce any overhead for
>>> existing uevent generating codepaths. The struct netns gets a new uevent
>>> socket member that records the uevent socket associated with that network
>>> namespace. Since we record the uevent socket for each network namespace in
>>> struct net we don't have to walk the whole uevent socket list.
>>> Instead we can directly retrieve the relevant uevent socket and send the
>>> message. This keeps the codepath very performant without introducing
>>> needless overhead.
>>>
>>> Uevent sequence numbers are kept global. When a uevent message is sent to
>>> another network namespace the implementation will simply increment the
>>> global uevent sequence number and append it to the received uevent. This
>>> has the advantage that the kernel will never need to parse the received
>>> uevent message to replace any existing uevent sequence numbers. Instead it
>>> is up to the userspace process to remove any existing uevent sequence
>>> numbers in case the uevent message to be sent contains any.
>>>
>>> Security:
>>> In order for a caller to send uevent messages to a target network namespace
>>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
>>> target network namespace. Additionally, any received uevent message is
>>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
>>> needed to append the uevent sequence number.
>>>
>>> Testing:
>>> This patch has been tested and verified to work with the following udev
>>> implementations:
>>> 1. CentOS 6 with udevd version 147
>>> 2. Debian Sid with systemd-udevd version 237
>>> 3. Android 7.1.1 with ueventd
>>>
>>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>>> ---
>>>  include/net/net_namespace.h |  1 +
>>>  lib/kobject_uevent.c        | 88 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  2 files changed, 88 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>>> index f306b2aa15a4..467bde763a9b 100644
>>> --- a/include/net/net_namespace.h
>>> +++ b/include/net/net_namespace.h
>>> @@ -78,6 +78,7 @@ struct net {
>>>  
>>>  	struct sock 		*rtnl;			/* rtnetlink socket */
>>>  	struct sock		*genl_sock;
>>> +	struct sock		*uevent_sock;		/* uevent socket */
>>
>> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
>> become to look confusing. There are:
>>
>>         mutex_lock(&uevent_sock_mutex);
>>         list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>>                 if (sock_net(ue_sk->sk) == net)
>>                         goto found;
>>         }
>>
>> Can't we make a small cleanup in lib/kobject_uevent.c after this change
>> and before the main part of the patch goes?
> 
> Hm, not sure. It seems it makes sense to maintain them in a separate
> list. Looks like this lets us keep the locking simpler. Otherwise we
> would have to do something like for_each_net() and it seems that this
> would force us to use rntl_{un}lock().

I'm about:

mutex_lock();
list_del(net->ue_sk->list);
mutex_unlock();
kfree();

Thus we avoid iterations in uevent_net_exit().

>>
>>>  
>>>  	struct list_head 	dev_base_head;
>>>  	struct hlist_head 	*dev_name_head;
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 9fe6ec8fda28..10b2144b9fc3 100644
>>> --- a/lib/kobject_uevent.c
>>> +++ b/lib/kobject_uevent.c
>>> @@ -25,6 +25,7 @@
>>>  #include <linux/uuid.h>
>>>  #include <linux/ctype.h>
>>>  #include <net/sock.h>
>>> +#include <net/netlink.h>
>>>  #include <net/net_namespace.h>
>>>  
>>>  
>>> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>>>  EXPORT_SYMBOL_GPL(add_uevent_var);
>>>  
>>>  #if defined(CONFIG_NET)
>>> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
>>> +			   struct netlink_ext_ack *extack)
>>> +{
>>> +	int ret;
>>> +	u64 seqnum;
>>> +	/* u64 to chars: 2^64 - 1 = 21 chars */
>>
>> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
>> we forgot +1 in buf declaration.
> 
> sizeof("SEQNUM=") will include the '\0' pointer in contrast to
> strlen("SEQNUM=") so this is correct if I'm not completely mistaken.

The code is OK, I'm worrying about comment. But I've missed this sizeof().
So, there is only 1 bytes excess allocated as 0xFFFFFFFFFFFFFFFF=18446744073709551615
Not so important...

>>
>>> +	char buf[sizeof("SEQNUM=") + 21];
>>> +	struct sk_buff *skbc;
>>> +
>>> +	/* bump sequence number */
>>> +	mutex_lock(&uevent_sock_mutex);
>>> +	seqnum = ++uevent_seqnum;
>>> +	mutex_unlock(&uevent_sock_mutex);
>>
>> Commit 7b60a18da393 from Andrew Vagin says:
>>
>>     uevent: send events in correct order according to seqnum (v3)
>>     
>>     The queue handling in the udev daemon assumes that the events are
>>     ordered.
>>     
>>     Before this patch uevent_seqnum is incremented under sequence_lock,
>>     than an event is send uner uevent_sock_mutex. I want to say that code
>>     contained a window between incrementing seqnum and sending an event.
>>     
>>     This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
>>     Signed-off-by: Andrew Vagin <avagin@openvz.org>
>>
>> After this change the order will be lost. Also the rest of namespaces
>> will have holes in uevent numbers, as they won't receive a number sent
>> to specific namespace.
> 
> Afaict from looking into udevd when I wrote the patch it only cares
> about numbers being ordered (which is also what Andrey's patch states)
> not that they are sequential so holes should be fine. udevd will use
> the DEVPATH to determine whether the sequence number of the current
> uevent should be used as "even->delaying_seqnum" number. All that
> matters is that it is greater than the previous one. About the ordering,
> if that's an issue then we should simply do what Andrey has been doing
> for kobject_uevent_env() and extend the lock being held until after we
> sent out the uevent. Since we're not serving all listeners but only
> ever one this is also way more lightweight then kobject_uevent_env().

Yes, extending the lock to netlink_broadcast() should fix the problem.

>>
>> It seems we should made uevent_seqnum per-net to solve this problem.
> 
> Yes, Eric and I have been discussing this already. The idea was to do
> this in a follow-up patch to keep this patch as simple as possible. If
> we agree that it would make sense right away then I will dig out the
> corresponding patch.
> It would basically just involve giving each struct net a uevent_seqnum
> member.

pernet seqnum may have a sense if we introduce per-net locks to protect it.
If there is single mutex, it does not matter either they are pernet or not.
Per-net mutex may be useful only if we have many single-net messages like
you introduce in this patch, or if we are going to filter net in existing
kobject_uevent_net_broadcast() (by user_ns?!) in the future.

Kirill

^ permalink raw reply

* Re: [PATCH 00/16] remove eight obsolete architectures
From: Christoph Hellwig @ 2018-03-15 14:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arch, linux-block, linux-fbdev, linux-watchdog,
	open list:DOCUMENTATION, Networking, dri-devel, linux-usb,
	linux-wireless, Linux Kernel Mailing List, linux-pwm, linux-spi,
	David Howells, IDE-ML, Hannes Reinecke, open list:HID CORE LAYER,
	Linux FS-devel Mailing List, Linux-MM, linux-rtc
In-Reply-To: <CAK8P3a01pfvsdM1mR8raU9dA7p4H-jRJz2Y8-+KEY76W_Mukpg@mail.gmail.com>

On Thu, Mar 15, 2018 at 11:42:25AM +0100, Arnd Bergmann wrote:
> Is anyone producing a chip that includes enough of the Privileged ISA spec
> to have things like system calls, but not the MMU parts?

Various SiFive SOCs seem to support M and U mode, but no S mode or
iommu.  That should be enough for nommu Linux running in M mode if
someone cares enough to actually port it.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply

* Re: [PATCH net] net/sched: fix NULL dereference in the error path of tcf_vlan_init()
From: Jiri Pirko @ 2018-03-15 14:21 UTC (permalink / raw)
  To: Davide Caratti; +Cc: Cong Wang, Manish Kurup, David S. Miller, netdev
In-Reply-To: <1d301fb86becf863283ee1be107f17e837ef4255.1521122595.git.dcaratti@redhat.com>

Thu, Mar 15, 2018 at 03:06:30PM CET, dcaratti@redhat.com wrote:
>when the following command
>
> # tc actions replace action vlan pop index 100
>
>is run for the first time, and tcf_vlan_init() fails allocating struct
>tcf_vlan_params, tcf_vlan_cleanup() calls kfree_rcu(NULL, ...). This causes
>the following error:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> IP: __call_rcu+0x23/0x2b0
> PGD 80000000760a2067 P4D 80000000760a2067 PUD 742c1067 PMD 0
> Oops: 0002 [#1] SMP PTI
> Modules linked in: act_vlan(E) ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel mbcache snd_hda_codec jbd2 snd_hda_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper snd cryptd joydev soundcore virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_console virtio_blk virtio_net ata_piix crc32c_intel libata virtio_pci i2c_core virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
> CPU: 3 PID: 3119 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> RIP: 0010:__call_rcu+0x23/0x2b0
> RSP: 0018:ffffaac3005fb798 EFLAGS: 00010246
> RAX: ffffffffc0704080 RBX: ffff97f2b4bbe900 RCX: 00000000ffffffff
> RDX: ffffffffabca5f00 RSI: 0000000000000010 RDI: 0000000000000010
> RBP: 0000000000000010 R08: 0000000000000001 R09: 0000000000000044
> R10: 00000000fd003000 R11: ffff97f2faab5b91 R12: 0000000000000000
> R13: ffffffffabca5f00 R14: ffff97f2fb80202c R15: 00000000fffffff4
> FS:  00007f68f75b4740(0000) GS:ffff97f2ffd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000018 CR3: 0000000072b52001 CR4: 00000000001606e0
> Call Trace:
>  __tcf_idr_release+0x79/0xf0
>  tcf_vlan_init+0x168/0x270 [act_vlan]
>  tcf_action_init_1+0x2cc/0x430
>  tcf_action_init+0xd3/0x1b0
>  tc_ctl_action+0x18b/0x240
>  rtnetlink_rcv_msg+0x29c/0x310
>  ? _cond_resched+0x15/0x30
>  ? __kmalloc_node_track_caller+0x1b9/0x270
>  ? rtnl_calcit.isra.28+0x100/0x100
>  netlink_rcv_skb+0xd2/0x110
>  netlink_unicast+0x17c/0x230
>  netlink_sendmsg+0x2cd/0x3c0
>  sock_sendmsg+0x30/0x40
>  ___sys_sendmsg+0x27a/0x290
>  ? filemap_map_pages+0x34a/0x3a0
>  ? __handle_mm_fault+0xbfd/0xe20
>  __sys_sendmsg+0x51/0x90
>  do_syscall_64+0x6e/0x1a0
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x7f68f69c5ba0
> RSP: 002b:00007fffd79c1118 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 00007fffd79c1240 RCX: 00007f68f69c5ba0
> RDX: 0000000000000000 RSI: 00007fffd79c1190 RDI: 0000000000000003
> RBP: 000000005aaa708e R08: 0000000000000002 R09: 0000000000000000
> R10: 00007fffd79c0ba0 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007fffd79c1254 R14: 0000000000000001 R15: 0000000000669f60
> Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
> RIP: __call_rcu+0x23/0x2b0 RSP: ffffaac3005fb798
> CR2: 0000000000000018
>
>fix this in tcf_vlan_cleanup(), ensuring that kfree_rcu(p, ...) is called
>only when p is not NULL.
>
>Fixes: 4c5b9d9642c8 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
>Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Jiri Benc @ 2018-03-15 14:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Alexei Starovoitov, David S. Miller,
	Daniel Borkmann, Network Development, Kernel Team
In-Reply-To: <20180315033657.jj7ozjx66p27h3ar@ast-mbp>

On Wed, 14 Mar 2018 20:37:00 -0700, Alexei Starovoitov wrote:
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html

ipvlan improved a lot since that time :-) And Paolo has recently fixed
the remaining issues we were aware of.

 Jiri

^ permalink raw reply

* Re: [PATCH net-next nfs 0/6] Converting pernet_operations (part #7)
From: J. Bruce Fields @ 2018-03-15 14:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: davem, trond.myklebust, anna.schumaker, jlayton, dhowells,
	keescook, dwindsor, ishkamiel, elena.reshetova, linux-nfs,
	linux-afs, netdev
In-Reply-To: <e9612f81-0306-6f87-2ad5-c79b848eafae@virtuozzo.com>

On Thu, Mar 15, 2018 at 04:32:30PM +0300, Kirill Tkhai wrote:
> Trond, Anna, Bruce, Jeff, David and other NFS and RXRPC people,
> could you please provide your vision on this patches?

Whoops, sorry, I haven't been paying attention.  Do you have a pointer
to documentation?  I'm unclear what the actual concurrency change
is--sounds like it becomes possible that e.g. multiple ->init methods
(from the same pernet_operations but for different namespaces) could run
in parallel?

Sounds likely to be safe, and I don't actually care too much who merges
them as they look very unlikely to conflict with anything pending.  But
unless anyone tells me otherwise I'll take the one nfsd_net_ops patch
and leave the rest to Anna or Trond.

--b.


> 
> Thanks,
> Kirill
> 
> On 13.03.2018 13:49, Kirill Tkhai wrote:
> > Hi,
> > 
> > this series continues to review and to convert pernet_operations
> > to make them possible to be executed in parallel for several
> > net namespaces in the same time. There are nfs pernet_operations
> > in this series. All of them look similar each other, they mostly
> > create and destroy caches with small exceptions.
> > 
> > Also, there is rxrpc_net_ops, which is used in AFS.
> > 
> > Thanks,
> > Kirill
> > ---
> > 
> > Kirill Tkhai (6):
> >       net: Convert rpcsec_gss_net_ops
> >       net: Convert sunrpc_net_ops
> >       net: Convert nfsd_net_ops
> >       net: Convert nfs4_dns_resolver_ops
> >       net: Convert nfs4blocklayout_net_ops
> >       net: Convert rxrpc_net_ops
> > 
> > 
> >  fs/nfs/blocklayout/rpc_pipefs.c |    1 +
> >  fs/nfs/dns_resolve.c            |    1 +
> >  fs/nfsd/nfsctl.c                |    1 +
> >  net/rxrpc/net_ns.c              |    1 +
> >  net/sunrpc/auth_gss/auth_gss.c  |    1 +
> >  net/sunrpc/sunrpc_syms.c        |    1 +
> >  6 files changed, 6 insertions(+)
> > 
> > --
> > Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> > 

^ permalink raw reply

* Re: [PATCH net] net/sched: fix NULL dereference in the error path of tcf_vlan_init()
From: Roman Mashak @ 2018-03-15 14:28 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Cong Wang, Manish Kurup, Jiri Pirko, David S. Miller, netdev
In-Reply-To: <1d301fb86becf863283ee1be107f17e837ef4255.1521122595.git.dcaratti@redhat.com>

Davide Caratti <dcaratti@redhat.com> writes:

> when the following command
>
>  # tc actions replace action vlan pop index 100
>
> is run for the first time, and tcf_vlan_init() fails allocating struct
> tcf_vlan_params, tcf_vlan_cleanup() calls kfree_rcu(NULL, ...). This causes
> the following error:
>

[...]

> fix this in tcf_vlan_cleanup(), ensuring that kfree_rcu(p, ...) is called
> only when p is not NULL.
>
> Fixes: 4c5b9d9642c8 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> ---
>  net/sched/act_vlan.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
> index e1a1b3f3983a..c2914e9a4a6f 100644
> --- a/net/sched/act_vlan.c
> +++ b/net/sched/act_vlan.c
> @@ -225,7 +225,8 @@ static void tcf_vlan_cleanup(struct tc_action *a)
>  	struct tcf_vlan_params *p;
>  
>  	p = rcu_dereference_protected(v->vlan_p, 1);
> -	kfree_rcu(p, rcu);
> +	if (p)
> +		kfree_rcu(p, rcu);
>  }
>  
>  static int tcf_vlan_dump(struct sk_buff *skb, struct tc_action *a,

Good catch. I think you can propagate the fix on the other actions
->cleanup(), where private parameters structure may not be present at
cleanup time, e.g. csum, ife.

^ permalink raw reply

* Re: [PATCH net] net/sched: fix NULL dereference in the error path of tcf_vlan_init()
From: Davide Caratti @ 2018-03-15 14:29 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Cong Wang, David S. Miller, netdev
In-Reply-To: <20180315142157.GG2130@nanopsycho>

On Thu, 2018-03-15 at 15:21 +0100, Jiri Pirko wrote:
...

> Acked-by: Jiri Pirko <jiri@mellanox.com>

thank you for reviewing!

apparently, also act_tunnel_key seem and act_csum have a similar problem.
I will check and eventually do a followup series this afternoon.

thank you,
regards
-- 
davide

^ permalink raw reply

* Re: [PATCH] net: ethernet: ti: cpsw: add check for in-band mode setting with RGMII PHY interface
From: Andrew Lunn @ 2018-03-15 14:31 UTC (permalink / raw)
  To: SZ Lin (?????????)
  Cc: Schuyler Patton, Grygorii Strashko, David S. Miller,
	Ivan Khoronzhuk, Keerthy, Sekhar Nori, linux-omap, netdev,
	linux-kernel
In-Reply-To: <20180315064145.7022-1-sz.lin@moxa.com>

> @@ -1014,7 +1014,13 @@ static void _cpsw_adjust_link(struct cpsw_slave *slave,
>  		/* set speed_in input in case RMII mode is used in 100Mbps */
>  		if (phy->speed == 100)
>  			mac_control |= BIT(15);
> -		else if (phy->speed == 10)
> +
> +		/* in band mode only works in 10Mbps RGMII mode */
> +		else if ((phy->speed == 10) &&
> +			 ((phy->interface == PHY_INTERFACE_MODE_RGMII) ||
> +			 (phy->interface == PHY_INTERFACE_MODE_RGMII_ID) ||
> +			 (phy->interface == PHY_INTERFACE_MODE_RGMII_RXID) ||
> +			 (phy->interface == PHY_INTERFACE_MODE_RGMII_TXID)))

Please use phy_interface_mode_is_rgmii()

	Andrew

^ permalink raw reply

* Re: [PATCH 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Alexander Duyck @ 2018-03-15 14:32 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Timur Tabi, Netdev, sulrich, linux-arm-msm, linux-arm-kernel,
	Jeff Kirsher, intel-wired-lan, LKML
In-Reply-To: <53bf7dfe-32ee-1861-e6ea-81f667590a43@codeaurora.org>

On Wed, Mar 14, 2018 at 7:17 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> On 3/14/2018 9:44 PM, Alexander Duyck wrote:
>> On Wed, Mar 14, 2018 at 3:57 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
>>> Hi Alexander,
>>>
>>> On 3/14/2018 5:49 PM, Alexander Duyck wrote:
>>>> On Wed, Mar 14, 2018 at 5:13 AM,  <okaya@codeaurora.org> wrote:
>>>>> On 2018-03-14 01:08, Timur Tabi wrote:
>>>>>>
>>>>>> On 3/13/18 10:20 PM, Sinan Kaya wrote:
>>>
>>>> Actually I would argue this whole patch set is pointless. For starters
>>>> why is it we are only updating the Intel Ethernet drivers here?
>>>
>>> I did a regex search for wmb() followed by writel() in each drivers directory.
>>> I scrubbed the ones I care about and posted this series. Note also that
>>> I have one Infiniband patch in the series.
>>
>> I didn't see it as it I was only looking at the patches that had ended
>> up in intel-wired-lan. Also was there a cover page, I couldn't seem to
>> find that on LKML.
>
> Yeah, I didn't have a cover page. These patches were sitting on my branch
> for a while. I wanted to get them out without putting too much effort into
> it. I'll add it on the next version.
>
>>
>>> I considered "ease of change", "popular usage" and "performance critical
>>> path" as the determining criteria for my filtering.
>>
>> It might be advisable to break things up to subsystem or family. So
>> for example if you are going to update the Intel Ethernet drivers I
>> would focus on that and maybe spin the infiniband patch of into a
>> separate set that can be applied to a separate tree. This is something
>> I would consider more of a driver optimization than a fix. In our case
>> it makes it easier for us to maintain the patches to the Intel drivers
>> if you could submit just those to Jeff and Intel-wired-lan so that we
>> can take care of test and review as well as figure out what other
>> drivers will would still need to update in order to handle all the
>> cases involved in this.
>>
>>>> This
>>>> seems like something that is going to impact the whole kernel tree
>>>> since many of us have been writing drivers for some time assuming x86
>>>> style behavior.
>>>
>>> That's true. We used relaxed API heavily on ARM for a long time but
>>> it did not exist on other architectures. For this reason, relaxed
>>> architectures have been paying double penalty in order to use the common
>>> drivers.
>>>
>>> Now that relaxed API is present on all architectures, we can go and scrub
>>> all drivers to see what needs to change and what can remain.
>>
>> My only real objection is that you are going to be having to scrub
>> pretty much ALL the drivers. It seems a little like trying to fix a
>> bad tire on your car by paving the road to match the shape of the
>> tire.
>
> Or we start with mostly used ones and hope to increase the coverage over time.
> It will take a while to cover all drivers.
>
> We could do several iterations like you are suggesting for each subsystem.
>
>>
>>>>
>>>> It doesn't make sense to be optimizing the drivers for one subset of
>>>> architectures. The scope of the work needed to update the drivers for
>>>> this would be ridiculous. Also I don't see how this could be expected
>>>> to work on any other architecture when we pretty much need to have a
>>>> wmb() before calling the writel on x86 to deal with accesses between
>>>> coherent and non-coherent memory. It seems to me more like somebody
>>>> added what they considered to be an optimization somewhere that is a
>>>> workaround for a poorly written driver. Either that or the barrier is
>>>> serving a different purpose then the one we were using.
>>>
>>> Is there a semantic problem with the definition of wmb() vs. writel() vs.
>>> writel_relaxed()? I thought everything is well described in barriers
>>> document about what to expect from these APIs.
>>>
>>> AFAIK, writel() is equal to writel_relaxed() on x86 architecture.
>>> It doesn't really change anything for x86 but it saves barriers on
>>> other architectures.
>>
>> Yeah. I had to go through and do some review since my concerns have
>> been PowerPC, IA64, and x86 historically. From what I can tell all
>> those architectures are setup the same way so that shouldn't be an
>> issue.
>
> OK, glad that we are in common understanding.
>
>>
>>>>
>>>> It would make more sense to put in the effort making writel and
>>>> writel_relaxed consistent between architectures before we go through
>>>> and start modifying driver code to support different architectures.
>>>>
>>>
>>> Is there an arch problem that I'm not seeing?
>>>
>>> Sinan
>>
>> It isn't really an arch problem I have as a logistical one. It just
>> seems like this is really backwards in terms of how this has been
>> handled. For the x86 we have historically had to deal with the
>> barriers for this kind of stuff ourselves, now for ARM and a couple
>> other architectures they seem to have incorporated the barriers into
>> writel and are expecting everyone to move over to writel_relaxed.
>
> You want to move to writel_relaxed() only if you know that your register
> accesses won't have any side effects.
>
> If you require some memory update to be observable to the HW before
> doing a register write, right thing is to do
>
> wmb() + writel_relaxed()
>
> wmb() + writel() is clearly the wrong choice and that's what the goal of
> this change.
>
> If we know that all writel() adaptations in all architectures guarantee
> observability, another option is to get rid of wmb() and just keep writel().
>
> I'm not so convinced about this and hoping that someone will correct me.
>
> wmb() on x86 seems to have an sfence but writel() seems to have a compiler
> barrier in it. So, the type of barrier wmb() is using is different.
>
> we can't say
>
> (wmb()+ writel_relaxed()) == writel()
>
> for all architectures, but maybe I'm wrong.
>
> We really don't want to convert all writel() to writel_relaxed() blindly
> without giving much thought into it.
>
> This was also another reason why I limited the changes to wmb() + writel()
> combinations only.
>
> If there is wmb() + code + writel() and if we convert this to wmb() + code +
> writel_relaxed(), code will not be observed by the HW and this might break
> the driver.

So that is where things will be a bit trickier to understand from the
perspective of someone who hasn't worked in our drivers for a while.

We tend to do something like:
  update tx_buffer_info
  update tx_desc
  wmb()
  point first tx_buffer_info next_to_watch value at last tx_desc
  update next_to_use
  notify device via writel

We do it this way because we have to synchronize between the Tx
cleanup path and the hardware so we basically lump the two barriers
together. instead of invoking both a smp_wmb and a wmb. Now that I
look at the pseudocode though I wonder if we shouldn't move the
next_to_use update before the wmb, but that might be material for
another patch. Anyway, in the Tx cleanup path we should have an
smp_rmb() after we read the next_to_watch values so that we avoid
reading any of the other fields in the buffer_info if either the field
is NULL or the descriptor pointed to has not been written back.

>> It
>> seems like instead of going that route they should have probably just
>> looked at pushing the ARM drivers to something like a "writel_strict"
>> and adopted the behavior of the other architectures for writel.
>>
>> I'll go back through and review. It looks like a number of items were missed.
>>
>
> OK, I'll take a look at each one. Saw the code concern on the e1000e suggestion.
> I wanted to raise it here first before inspecting the rest.

In the case of the Intel drivers it is pretty straightforward as most
of the drivers wrap the writel usage in a macro with the exception of
hot-path areas. In my review feedback I only focused on the tail
updates for the Tx and Rx rings. We have a few other spots where
writel is used to update the interrupt registers but I figure we can
leave those as-is for now as we would want to guarantee ordering of
the tail writes versus the register writes to re-enable the interrupt
for the queue.

^ permalink raw reply

* Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
From: Roman Mashak @ 2018-03-15 14:35 UTC (permalink / raw)
  To: Liran Alon
  Cc: daniel, netdev, shmulik.ladkani, davem, linux-kernel, yuval.shaia,
	idan.brown
In-Reply-To: <be0d17cf-12d5-440c-adee-e943ccb199c9@default>

Liran Alon <liran.alon@oracle.com> writes:


[...]

>> Overall I think it might be nice to not need scrubbing skb in such
>> cases,
>> although my concern would be that this has potential to break
>> existing
>> setups when they would expect mark being zero on other veth peer in
>> any
>> case since it's the behavior for a long time already. The safer
>> option
>> would be to have some sort of explicit opt-in e.g. on link creation to
>> let
>> the skb->mark pass through unscrubbed. This would definitely be a
>> useful
>> option e.g. when mark is set in the netns facing veth via
>> clsact/egress
>> on xmit and when the container is unprivileged anyway.
>> 
>> Thanks,
>> Daniel
>
> I see your point in regards to backwards comparability.
> However, not scrubbing skb when it cross netns via some kernel functions compared to
> others is basically a bug which could easily break with a little bit of more refactoring.
> Therefore, it seems a bit weird to me to from now on, we will force
> every user on link creation to consider that once there was a bug leading
> to this weird behavior on specific netdevs.
> Thus, I suggest to maybe control this via a global /proc/sys/net file instead.

One valid use case could be preserving a source namespace nsid in
skb->mark when a packet crosses netns.

^ permalink raw reply

* Re: Inphi IN112525 PHY
From: Andrew Lunn @ 2018-03-15 14:39 UTC (permalink / raw)
  To: Vicen?iu Galanopulo; +Cc: netdev@vger.kernel.org
In-Reply-To: <HE1PR0402MB35781CC11AAC5DB372D194CFEED00@HE1PR0402MB3578.eurprd04.prod.outlook.com>

> I have a patch ready for this, but we would like to know if this is right way to go about this.

Please post the patch, so we can review it.

What you are proposing sounds O.K, but we should also try to put some
pressure on vendors to conform to the standard. Have you asked them if
a firmware update can fix this?

       Andrew

^ permalink raw reply

* Re: [PATCH net-next nfs 0/6] Converting pernet_operations (part #7)
From: Kirill Tkhai @ 2018-03-15 14:46 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: davem, trond.myklebust, anna.schumaker, jlayton, dhowells,
	keescook, dwindsor, ishkamiel, elena.reshetova, linux-nfs,
	linux-afs, netdev
In-Reply-To: <20180315142452.GA17336@fieldses.org>

On 15.03.2018 17:24, J. Bruce Fields wrote:
> On Thu, Mar 15, 2018 at 04:32:30PM +0300, Kirill Tkhai wrote:
>> Trond, Anna, Bruce, Jeff, David and other NFS and RXRPC people,
>> could you please provide your vision on this patches?
> 
> Whoops, sorry, I haven't been paying attention.  Do you have a pointer
> to documentation?  I'm unclear what the actual concurrency change
> is--sounds like it becomes possible that e.g. multiple ->init methods
> (from the same pernet_operations but for different namespaces) could run
> in parallel?

There is a commentary near struct pernet_operations in fresh net-next.git:

struct pernet_operations {
        struct list_head list;
        /*
         * Below methods are called without any exclusive locks.
         * More than one net may be constructed and destructed
         * in parallel on several cpus. Every pernet_operations
         * have to keep in mind all other pernet_operations and
         * to introduce a locking, if they share common resources.
         *
         * Exit methods using blocking RCU primitives, such as
         * synchronize_rcu(), should be implemented via exit_batch.
         * Then, destruction of a group of net requires single
         * synchronize_rcu() related to these pernet_operations,
         * instead of separate synchronize_rcu() for every net.
         * Please, avoid synchronize_rcu() at all, where it's possible.
         */

I hope this is enough. Please tell me, if there is unclear thing, I'll
extend it.
 
> Sounds likely to be safe, and I don't actually care too much who merges
> them as they look very unlikely to conflict with anything pending.  But
> unless anyone tells me otherwise I'll take the one nfsd_net_ops patch
> and leave the rest to Anna or Trond.

Sounds great. Thanks!

Kirill

>> Thanks,
>> Kirill
>>
>> On 13.03.2018 13:49, Kirill Tkhai wrote:
>>> Hi,
>>>
>>> this series continues to review and to convert pernet_operations
>>> to make them possible to be executed in parallel for several
>>> net namespaces in the same time. There are nfs pernet_operations
>>> in this series. All of them look similar each other, they mostly
>>> create and destroy caches with small exceptions.
>>>
>>> Also, there is rxrpc_net_ops, which is used in AFS.
>>>
>>> Thanks,
>>> Kirill
>>> ---
>>>
>>> Kirill Tkhai (6):
>>>       net: Convert rpcsec_gss_net_ops
>>>       net: Convert sunrpc_net_ops
>>>       net: Convert nfsd_net_ops
>>>       net: Convert nfs4_dns_resolver_ops
>>>       net: Convert nfs4blocklayout_net_ops
>>>       net: Convert rxrpc_net_ops
>>>
>>>
>>>  fs/nfs/blocklayout/rpc_pipefs.c |    1 +
>>>  fs/nfs/dns_resolve.c            |    1 +
>>>  fs/nfsd/nfsctl.c                |    1 +
>>>  net/rxrpc/net_ns.c              |    1 +
>>>  net/sunrpc/auth_gss/auth_gss.c  |    1 +
>>>  net/sunrpc/sunrpc_syms.c        |    1 +
>>>  6 files changed, 6 insertions(+)
>>>
>>> --
>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>

^ permalink raw reply

* Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
From: Daniel Borkmann @ 2018-03-15 14:53 UTC (permalink / raw)
  To: Roman Mashak, Liran Alon
  Cc: netdev, shmulik.ladkani, davem, linux-kernel, yuval.shaia,
	idan.brown
In-Reply-To: <85woydh0e9.fsf@mojatatu.com>

On 03/15/2018 03:35 PM, Roman Mashak wrote:
> Liran Alon <liran.alon@oracle.com> writes:
> [...]
>>> Overall I think it might be nice to not need scrubbing skb in such
>>> cases,
>>> although my concern would be that this has potential to break
>>> existing
>>> setups when they would expect mark being zero on other veth peer in
>>> any
>>> case since it's the behavior for a long time already. The safer
>>> option
>>> would be to have some sort of explicit opt-in e.g. on link creation to
>>> let
>>> the skb->mark pass through unscrubbed. This would definitely be a
>>> useful
>>> option e.g. when mark is set in the netns facing veth via
>>> clsact/egress
>>> on xmit and when the container is unprivileged anyway.
>>>
>>> Thanks,
>>> Daniel
>>
>> I see your point in regards to backwards comparability.
>> However, not scrubbing skb when it cross netns via some kernel functions compared to
>> others is basically a bug which could easily break with a little bit of more refactoring.
>> Therefore, it seems a bit weird to me to from now on, we will force
>> every user on link creation to consider that once there was a bug leading
>> to this weird behavior on specific netdevs.

Why bug specifically? It could well be that for some unpriv containers
it would be fine to do e.g. in cases where orchestrator sets up clsact/
egress on veth/ipvlan/etc in the container to set the mark and where app
cannot mess with this while for others you need to act out of host facing
veth; thus, explicit opt-in per dev could provide such more fine grained
control.

> One valid use case could be preserving a source namespace nsid in
> skb->mark when a packet crosses netns.

Right, was thinking about something similar.

^ permalink raw reply

* Fw: [Bug 199121] New: Packet header is incorrect when following through an IPsec tunnel after upgrade kernel to 4.15
From: Stephen Hemminger @ 2018-03-15 14:59 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S. Miller; +Cc: netdev



Begin forwarded message:

Date: Thu, 15 Mar 2018 06:37:27 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 199121] New: Packet header is incorrect when following through an IPsec tunnel after upgrade kernel to 4.15


https://bugzilla.kernel.org/show_bug.cgi?id=199121

            Bug ID: 199121
           Summary: Packet header is incorrect when following through an
                    IPsec tunnel after upgrade kernel to 4.15
           Product: Networking
           Version: 2.5
    Kernel Version: 4.15.9
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: stephen@networkplumber.org
          Reporter: posonsky@yandex.ru
        Regression: No

I have been using IPsec tunnel for a while. StrongSwan is used for management:
```
# swanctl -l
pfsense2: #1, ESTABLISHED, IKEv2, cc04d3c5b34b4bda_i* f150c78e4fc042ef_r
  local  '90.188.239.175' @ 90.188.239.175[500]
  remote '62.152.54.102' @ 62.152.54.102[500]
  3DES_CBC/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
  established 649s ago, reauth in 2746s
  pfsense2: #1, reqid 1, INSTALLED, TUNNEL, ESP:AES_CBC-256/HMAC_SHA1_96
    installed 649s ago, rekeying in 286s, expires in 551s
    in  c41e18d6,    588 bytes,     7 packets,   643s ago
    out cfad3c32,    588 bytes,     7 packets,   643s ago
    local  192.168.8.0/24
    remote 10.10.1.0/24
```
And everything worked fine. But after updating to 4.15 traffic stopped passing.

I created [issue](https://wiki.strongswan.org/issues/2571) on
wiki.strongswan.org. During the analysis of the situation it was found, when I
try to send ICMP request to 10.10.1.248, for example, 
```
$ ping 10.10.1.248
PING 10.10.1.248 (10.10.1.248) 56(84) bytes of data.
^C
--- 10.10.1.248 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3068ms
```
the response is returned as if from 8.0.1.248.
```
# tcpdump -n -vv -i ppp0 icmp
dropped privs to tcpdump
tcpdump: listening on ppp0, link-type LINUX_SLL (Linux cooked), capture size
262144 bytes
01:42:37.767964 IP (tos 0x0, ttl 63, id 42527, offset 0, flags [none], proto
ICMP (1), length 84)
    10.10.1.248 > 192.168.8.1: ICMP echo reply, id 12345, seq 1, length 64
01:42:38.767950 IP (tos 0x0, ttl 63, id 42736, offset 0, flags [none], proto
ICMP (1), length 84)
    10.10.1.248 > 192.168.8.1: ICMP echo reply, id 12345, seq 2, length 64
01:42:39.771778 IP (tos 0x0, ttl 63, id 42807, offset 0, flags [none], proto
ICMP (1), length 84)
    10.10.1.248 > 192.168.8.1: ICMP echo reply, id 12345, seq 3, length 64
01:42:40.768358 IP (tos 0x0, ttl 63, id 42816, offset 0, flags [none], proto
ICMP (1), length 84)
    10.10.1.248 > 192.168.8.1: ICMP echo reply, id 12345, seq 4, length 64
```
I have tested on all versions of 4.15 since 4.15.1.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
From: Liran Alon @ 2018-03-15 15:01 UTC (permalink / raw)
  To: mrv
  Cc: netdev, shmulik.ladkani, daniel, davem, linux-kernel, yuval.shaia,
	idan.brown


----- mrv@mojatatu.com wrote:

> Liran Alon <liran.alon@oracle.com> writes:
> 
> 
> [...]
> 
> >> Overall I think it might be nice to not need scrubbing skb in such
> >> cases,
> >> although my concern would be that this has potential to break
> >> existing
> >> setups when they would expect mark being zero on other veth peer
> in
> >> any
> >> case since it's the behavior for a long time already. The safer
> >> option
> >> would be to have some sort of explicit opt-in e.g. on link creation
> to
> >> let
> >> the skb->mark pass through unscrubbed. This would definitely be a
> >> useful
> >> option e.g. when mark is set in the netns facing veth via
> >> clsact/egress
> >> on xmit and when the container is unprivileged anyway.
> >> 
> >> Thanks,
> >> Daniel
> >
> > I see your point in regards to backwards comparability.
> > However, not scrubbing skb when it cross netns via some kernel
> functions compared to
> > others is basically a bug which could easily break with a little bit
> of more refactoring.
> > Therefore, it seems a bit weird to me to from now on, we will force
> > every user on link creation to consider that once there was a bug
> leading
> > to this weird behavior on specific netdevs.
> > Thus, I suggest to maybe control this via a global /proc/sys/net
> file instead.
> 
> One valid use case could be preserving a source namespace nsid in
> skb->mark when a packet crosses netns.

Before and after this commit, veth peers that crosses netns zero skb->mark.
Therefore, what you suggest is a new feature unrelated to the
issue fixed by this commit.

I still think that default behavior should be to zero skb->mark only when skb
cross netdevs in different netns. And for backwards comparability, we can
consider adding a /proc/sys/net/core file to let dev_forward_skb() to always
scrub packet regardless if it crosses netdevs or not.

^ permalink raw reply

* Re: [PATCH] net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
From: Liran Alon @ 2018-03-15 15:05 UTC (permalink / raw)
  To: daniel
  Cc: netdev, shmulik.ladkani, mrv, davem, linux-kernel, yuval.shaia,
	idan.brown


----- daniel@iogearbox.net wrote:

> On 03/15/2018 03:35 PM, Roman Mashak wrote:
> > Liran Alon <liran.alon@oracle.com> writes:
> > [...]
> >>> Overall I think it might be nice to not need scrubbing skb in
> such
> >>> cases,
> >>> although my concern would be that this has potential to break
> >>> existing
> >>> setups when they would expect mark being zero on other veth peer
> in
> >>> any
> >>> case since it's the behavior for a long time already. The safer
> >>> option
> >>> would be to have some sort of explicit opt-in e.g. on link
> creation to
> >>> let
> >>> the skb->mark pass through unscrubbed. This would definitely be a
> >>> useful
> >>> option e.g. when mark is set in the netns facing veth via
> >>> clsact/egress
> >>> on xmit and when the container is unprivileged anyway.
> >>>
> >>> Thanks,
> >>> Daniel
> >>
> >> I see your point in regards to backwards comparability.
> >> However, not scrubbing skb when it cross netns via some kernel
> functions compared to
> >> others is basically a bug which could easily break with a little
> bit of more refactoring.
> >> Therefore, it seems a bit weird to me to from now on, we will
> force
> >> every user on link creation to consider that once there was a bug
> leading
> >> to this weird behavior on specific netdevs.
> 
> Why bug specifically? It could well be that for some unpriv
> containers
> it would be fine to do e.g. in cases where orchestrator sets up
> clsact/
> egress on veth/ipvlan/etc in the container to set the mark and where
> app
> cannot mess with this while for others you need to act out of host
> facing
> veth; thus, explicit opt-in per dev could provide such more fine
> grained
> control.
> 
> > One valid use case could be preserving a source namespace nsid in
> > skb->mark when a packet crosses netns.
> 
> Right, was thinking about something similar.

I agree with all the above but this behavior was not supported both
before and after this commit. skb->mark is always zeroed when crossing netns.
This commit only changes behavior for skb crossing netdevs on *same* netns
via dev_forward_skb().

Therefore, I believe we should discuss here what we want default behavior to be
and how it should be controlled for backwards comparability.
Only after we should discuss about adding an extra feature of controlling skb scrub
per netdev or something similar.

^ permalink raw reply

* [RFC PATCH 0/2] net:setup XPS mapping for each online CPU
From: Paolo Abeni @ 2018-03-15 15:08 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jeff Kirsher, Eric Dumazet, intel-wired-lan,
	Alexander Duyck

Currently, most MQ netdevice setup the default XPS configuration mapping 1-1
the first real_num_tx_queues queues and CPUs and no mapping is created for
the CPUs with id greater then real_num_tx_queues, if any.

As a result, the xmit path for unconnected sockets on such cores experiences a
relevant overhead in netdev_pick_tx(), which needs to dissect each packet and
compute its hash.

Such scenario is easily triggered. e.g. from DNS server under relevant load, as
the user-space process is moved away from the CPUs serving the softirqs (note:
this is beneficial for the overall DNS server performances).

This series introduces an helper to easily setup up XPS mapping for all the
online CPUs, and use it in the ixgbe driver, demonstrating a relevant
performance improvement in the above scenario.

Paolo Abeni (2):
  net: introduce netif_set_xps()
  ixgbe: setup XPS via netif_set_xps()

 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    | 13 ++----
 include/linux/netdevice.h                        |  6 +++
 net/core/dev.c                                   | 58 ++++++++++++++++++------
 5 files changed, 55 insertions(+), 25 deletions(-)

-- 
2.14.3

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox