Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Thomas Gleixner @ 2018-04-11 20:16 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <e758b90b-508b-d1e8-bd1b-41e7b40c357b@intel.com>

On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
> >> This will be provided by tbs if the socket which is transmitting packets is
> >> configured for deadline mode.
> > 
> > You don't want the socket to decide that. The qdisc into which a socket
> > feeds defines the mode and the qdisc rejects requests with the wrong mode.
> > 
> > Making a qdisc doing both and let the user decide what he wants it to be is
> > not really going to fly. Especially if you have different users which want
> > a different mode. It's clearly distinct functionality.
> 
> 
> Ok, so just to make sure I got this right, are you suggesting that both the
> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
> parameter for specifying the txtime mode? This way if there is a mismatch,
> packets from that socket are rejected by the qdisc.

Correct. The same is true if you try to set SO_TXTIME for something which
is just routing regular traffic.

> (...)
> > 
> >> Another question for this mode (but perhaps that applies to both modes) is, what
> >> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> >> the packet during dequeue.
> > 
> > There the question is how user space is notified about that issue. The
> > application which queued the packet on time does rightfully assume that
> > it's going to be on the wire on time.
> > 
> > This is a violation of the overall scheduling plan, so you need to have
> > a sane design to handle that.
> 
> In addition to the qdisc stats, we could look into using the socket's error
> queue to notify the application about that.

Makes sense.
 
> >> Putting it all together, we end up with:
> >>
> >> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
> >> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
> > 
> > Why CLOCK_REALTIME? The only interesting time in a TSN network is
> > CLOCK_TAI, really.
> 
> REALTIME was just an example here to show that the qdisc has to be configured
> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?

I think so. It's _the_ network time on which everything is based on.

> >> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> >> either as a txtime or as deadline by tbs (and further the NIC driver for the
> >> offlaod case): SCM_TXTIME.
> >>
> >> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> >> socket, and will have as parameters a clockid and a txtime mode (deadline or
> >> explicit), that defines the semantics of the timestamp set on packets using
> >> SCM_TXTIME.
> >>
> >> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> > 
> > Can you remind me why we would need that?
> 
> So there is a "clockid" that can be used for the full hw offload modes. On this
> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .

And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
yet another clock, right?

Thanks,

	tglx

^ permalink raw reply

* Re: [patch net] devlink: convert occ_get op to separate registration
From: Sasha Levin @ 2018-04-11 20:04 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: David Miller, jiri@resnulli.us, jiri@mellanox.com,
	netdev@vger.kernel.org, idosch@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20180411084600.GA26298@kroah.com>

On Wed, Apr 11, 2018 at 10:46:00AM +0200, gregkh@linuxfoundation.org wrote:
>On Tue, Apr 10, 2018 at 08:43:31PM +0000, Sasha Levin wrote:
>> >Bots are starting to overwhelm actual content from human beings
>> >on this list, and I want to put my foot on the brake right now
>> >before it gets even more out of control.
>>
>> I think we're just hitting the limitations of using a mailing list. Bots
>> aren't around to spam everyone for fun, right?
>>
>> 0day bot is spammy because patches fail to compile, syzbot is spammy
>> because we have tons of bugs users can hit and the stable bot is
>> spammy because we miss lots of commits that should be in stable.
>>
>> As the kernel grows, not only the codebase is expanding but also the
>> tooling around it. While spammy, bots provide valuable input that in the
>> past would take blood, sweat and tears to acquire. We just need a better
>> way to consume it rather than blocking off these inputs.
>
>As one of the primary abusers of the "I send too much email" flood of
>mess, I'm going to start cutting down on what type of message I respond
>to a public vger list from now on.  I'll write more on the stable@ list
>about this, but for your individual patches like this, how about just
>responding to the developers themselves and not a public list?  I do
>that when I commit a patch to my tree, stripping out lists, as it's not
>a useful message for anyone other than the person who wrote the patch
>and the reviewers.

Sure.

I think we should have a discussion as to the best way to "spam" people.
There are a few alternatives I can think of (a "digest" mail once a
day/week? Web UI? different mailing list?) but we keep using the one
size fits all approach for all our bots.

I don't see a reason why we can't tailor bot output based on
maintainer/author preference? Maybe David would be less upset if he'd
see just one mail per week with commits we ask to review?

We're stuck with mailing lists for kernel development, might as well
make the best out of it.

^ permalink raw reply

* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: Neil Horman @ 2018-04-11 19:59 UTC (permalink / raw)
  To: Xin Long; +Cc: Marcelo Ricardo Leitner, network dev, linux-sctp, davem
In-Reply-To: <CADvbK_egknPZCHRkZoLZ2MuGhRyaYKcMCaqG9Apt=QJ3-o-D6A@mail.gmail.com>

On Thu, Apr 12, 2018 at 12:16:58AM +0800, Xin Long wrote:
> On Wed, Apr 11, 2018 at 10:59 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
> >> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
> >> > pf->cmp_addr() is called before binding a v6 address to the sock. It
> >> > should not check ports, like in sctp_inet_cmp_addr.
> >> >
> >> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
> >> > sctp_v6_cmp_addr where it also compares the ports.
> >> >
> >> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
> >> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
> >> > lack the check for ports in sctp_v6_cmp_addr").
> >> >
> >> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
> >> > but do the proper check for both v6 addrs and v4mapped addrs.
> >> >
> >> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
> >> > Reported-by: Jianwen Ji <jiji@redhat.com>
> >> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> >> > ---
> >> >  net/sctp/ipv6.c | 27 ++++++++++++++++++++++++---
> >> >  1 file changed, 24 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> >> > index f1fc48e..be4b72c 100644
> >> > --- a/net/sctp/ipv6.c
> >> > +++ b/net/sctp/ipv6.c
> >> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
> >> >                            const union sctp_addr *addr2,
> >> >                            struct sctp_sock *opt)
> >> >  {
> >> > -   struct sctp_af *af1, *af2;
> >> >     struct sock *sk = sctp_opt2sk(opt);
> >> > +   struct sctp_af *af1, *af2;
> >> >
> >> >     af1 = sctp_get_af_specific(addr1->sa.sa_family);
> >> >     af2 = sctp_get_af_specific(addr2->sa.sa_family);
> >> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
> >> >     if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
> >> >             return 1;
> >> >
> >> > -   if (addr1->sa.sa_family != addr2->sa.sa_family)
> >> > +   if (addr1->sa.sa_family != addr2->sa.sa_family) {
> >> > +           if (addr1->sa.sa_family == AF_INET &&
> >> > +               addr2->sa.sa_family == AF_INET6 &&
> >> > +               ipv6_addr_v4mapped(&addr2->v6.sin6_addr))
> >> > +                   if (addr2->v6.sin6_addr.s6_addr32[3] ==
> >> > +                       addr1->v4.sin_addr.s_addr)
> >> > +                           return 1;
> >> > +           if (addr2->sa.sa_family == AF_INET &&
> >> > +               addr1->sa.sa_family == AF_INET6 &&
> >> > +               ipv6_addr_v4mapped(&addr1->v6.sin6_addr))
> >> > +                   if (addr1->v6.sin6_addr.s6_addr32[3] ==
> >> > +                       addr2->v4.sin_addr.s_addr)
> >> > +                           return 1;
> >> > +           return 0;
> >> > +   }
> >> > +
> >> > +   if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr))
> >> > +           return 0;
> >> > +
> >> > +   if ((ipv6_addr_type(&addr1->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
> >> > +       addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
> >> > +       addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
> >> >             return 0;
> >> >
> >> > -   return af1->cmp_addr(addr1, addr2);
> >> > +   return 1;
> >> >  }
> >> >
> >> >  /* Verify that the provided sockaddr looks bindable.   Common verification,
> >> > --
> >> > 2.1.0
> >> >
> >> This looks correct to me, but is it worth duplicating the comparison code like
> >> this from the cmp_addr function?  It might be more worthwhile to add a flag to
> >> the cmp_addr method to direct weather it needs to check port values or not.
> >> That way you could continue to use the cmp_addr function here.
> >
> > Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
> > read. It's already not one of the best looking part of it. Maybe
> > still duplicate part of it it, but at 'af' level? As in:
> > - af->cmp_addr
> > - af->cmp_addr_port
> >
> What do you think of:
> 
> static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
>                             const union sctp_addr *addr2)
> {
>         return __sctp_v6_cmp_addr(addr1, addr2) &&
>                addr1->v6.sin_port == addr2->v6.sin_port;
> }
> 
> (v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
>  we've exploited this in many places in SCTP)
Yes, I'd be ok with that
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-11 19:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <871sflk0zc.fsf@xmission.com>

On Wed, Apr 11, 2018 at 02:16:23PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> > Yeah, agreed.
> >> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> >> > user namespace actually receives uevents we need to:
> >> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >> >    namespaces that are owned by init_user_ns
> >> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> >> >    uevent socket that is owned by init_user_ns *from* a
> >> >> >    non-init_user_ns
> >> >> >
> >> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> >> > socket list who are owned by init_user_ns.
> >> >> > But to account for 2. we need to filter by the user namespace who owns
> >> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >> >
> >> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> >> > --- a/lib/kobject_uevent.c
> >> >> > +++ b/lib/kobject_uevent.c
> >> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >  		return sock_ns != ns;
> >> >> >  	}
> >> >> >  
> >> >> > -	return 0;
> >> >> > + 	/* Check if socket was opened from non-initial user namespace. */
> >> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >  }
> >> >> >  #endif
> >> >> >  
> >> >> >
> >> >> > But correct me if I'm wrong.
> >> >> 
> >> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> >> permissions and an explicit opt-in to receiving packets from multiple
> >> >> network namespaces.
> >> >
> >> > I don't think that's what I'm talking about unless that is somehow the
> >> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> >> > doing
> >> >
> >> > unshare -U --map-root
> >> >
> >> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> >> > uevents. Imho, this should not be possible because I'm in a
> >> > non-init_user_ns. But currently I'm able to - even with the patch to
> >> > come - since the uevent socket in the kernel was created when init_net
> >> > was created and hence is *owned* by the init_user_ns which means it is
> >> > in the list of uevent sockets. Here's a demo of what I mean:
> >> >
> >> > https://asciinema.org/a/175632
> >> 
> >> Why do you care about this case?
> >
> > It's not so much that I care about this case since any workload that
> > wants to run a separate udevd will have to unshare the network namespace
> > and the user namespace for it to make complete sense.
> > What I do care about is that the two of us are absolutely in the clear
> > about what semantics we are going to expose to userspace and it seems
> > that we were talking past each other wrt to this "corner case".
> > For userspace, it needs to be very clear that the intention is to filter
> > by *owning user namespace of the network namespace a given task resides
> > in* and not by user namespace of the task per se. This is what this
> > corner case basically shows, I think.
> 
> If this is just a clarification of semantics then yes this is a
> productive question.  I almost agree with your definition above.
> 
> I would make the definition very simple.  Uevents will not be broadcast
> via netlink in a network namespace where net->user_ns != &init_user_ns,
> with the exception of uevents for network devices in that network
> namespace.

Well, for the sake of posterity :) I should add that I'd prefer we'd add
what I suggested above:

-	return 0;
+ 	/* Check if socket was opened from non-initial user namespace. */
+ 	return sk_user_ns(dsk) != &init_user_ns;
 }

to slam the door shut once and for all for all non-init_user_ns
namespaces because it *seems* like the cleanest solution: uevents are
owned by init_user_ns; period. Because it is the only user namespace
that can do anything interesting with them *by default*.
But what we have now right now with my upcoming patch is at least
sufficient and safe.

Christian

> 
> The existing filtering by the sending uid and verifying that it is uid 0
> gives a little more room to filter if we want (as udev & friends will
> ignore the uevent), but I don't see the point.
> 
> Eric

^ permalink raw reply

* Re: WARNING: possible recursive locking detected
From: Julian Anastasov @ 2018-04-11 19:45 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, Kirill Tkhai, LKML, Xin Long, netdev,
	syzkaller-bugs
In-Reply-To: <CACT4Y+Z3Ktnio1B0TAZpx8HOEfd2evwyQqbMeWGvEkzh0yfhKQ@mail.gmail.com>


	Hello,

On Wed, 11 Apr 2018, Dmitry Vyukov wrote:

> On Wed, Apr 11, 2018 at 4:02 PM, syzbot
> <syzbot+3c43eecd7745a5ce1640@syzkaller.appspotmail.com> wrote:
> > Hello,
> >
> > syzbot hit the following crash on upstream commit
> > b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +0000)
> > Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=3c43eecd7745a5ce1640
> >
> > So far this crash happened 3 times on upstream.
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5103706542440448
> > syzkaller reproducer:
> > https://syzkaller.appspot.com/x/repro.syz?id=5641659786199040
> > Raw console output:
> > https://syzkaller.appspot.com/x/log.txt?id=5099510896263168
> > Kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
> > compiler: gcc (GCC) 8.0.1 20180301 (experimental)
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+3c43eecd7745a5ce1640@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> #syz dup: possible deadlock in rtnl_lock (5)

	Yes, patch is now in the "nf" tree, so all these
lockups around start_sync_thread should be resolved soon...

> > IPVS: sync thread started: state = BACKUP, mcast_ifn = lo, syncid = 0, id =
> > 0
> > IPVS: stopping backup sync thread 4546 ...
> >
> > ============================================
> > IPVS: stopping backup sync thread 4559 ...
> > WARNING: possible recursive locking detected

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Julian Anastasov @ 2018-04-11 19:37 UTC (permalink / raw)
  To: Stephen Suryaputra; +Cc: netdev
In-Reply-To: <1523415335-17154-1-git-send-email-ssuryaextr@gmail.com>


	Hello,

On Tue, 10 Apr 2018, Stephen Suryaputra wrote:

> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
> 
> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
> ---

...

> diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
> index b54b948..bb9be11 100644
> --- a/net/ipv4/ip_forward.c
> +++ b/net/ipv4/ip_forward.c
> @@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
>  {
>  	struct ip_options *opt	= &(IPCB(skb)->opt);
>  
> -	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
> -	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
> +	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
> +	__IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
>  
>  	if (unlikely(opt->optlen))
>  		ip_forward_options(skb);
> @@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
>  	IPCB(skb)->flags |= IPSKB_FORWARDED;
>  	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
>  	if (ip_exceeds_mtu(skb, mtu)) {
> -		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
> +		IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
>  		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  			  htonl(mtu));
>  		goto drop;
> @@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
>  
>  too_many_hops:
>  	/* Tell the sender its packet died... */
> -	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> +	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);

	May be skb->dev if we want to account it to the
input device.

>  	icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>  drop:
>  	kfree_skb(skb);

...

> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..32bd3af 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
>  	{
>  		if (ip_hdr(skb)->ttl <= 1) {
>  			/* Tell the sender its packet died... */
> -			__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> +			__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);

	At this point, skb_dst(skb) can be:

- input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device
- output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL

	We should see this error on LOCAL_IN but better to be
safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just
'skb_dst(skb)->dev'.

>  			icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>  			return false;
>  		}

	The patch probably has other errors, for example,
using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error,
may be 'dev' should be used instead...

Regards

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-11 19:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180411185715.GA15862@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> > Yeah, agreed.
>> >> > But I think the patch is not complete. To guarantee that no non-initial
>> >> > user namespace actually receives uevents we need to:
>> >> > 1. only sent uevents to uevent sockets that are located in network
>> >> >    namespaces that are owned by init_user_ns
>> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >> >    uevent socket that is owned by init_user_ns *from* a
>> >> >    non-init_user_ns
>> >> >
>> >> > We account for 1. by only recording uevent sockets in the global uevent
>> >> > socket list who are owned by init_user_ns.
>> >> > But to account for 2. we need to filter by the user namespace who owns
>> >> > the socket in mc_list. So in addition to that we also need to slightly
>> >> > change the filter logic in kobj_bcast_filter() I think:
>> >> >
>> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> > index 22a2c1a98b8f..064d7d29ace5 100644
>> >> > --- a/lib/kobject_uevent.c
>> >> > +++ b/lib/kobject_uevent.c
>> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >  		return sock_ns != ns;
>> >> >  	}
>> >> >  
>> >> > -	return 0;
>> >> > + 	/* Check if socket was opened from non-initial user namespace. */
>> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
>> >> >  }
>> >> >  #endif
>> >> >  
>> >> >
>> >> > But correct me if I'm wrong.
>> >> 
>> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> >> permissions and an explicit opt-in to receiving packets from multiple
>> >> network namespaces.
>> >
>> > I don't think that's what I'm talking about unless that is somehow the
>> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
>> > doing
>> >
>> > unshare -U --map-root
>> >
>> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
>> > uevents. Imho, this should not be possible because I'm in a
>> > non-init_user_ns. But currently I'm able to - even with the patch to
>> > come - since the uevent socket in the kernel was created when init_net
>> > was created and hence is *owned* by the init_user_ns which means it is
>> > in the list of uevent sockets. Here's a demo of what I mean:
>> >
>> > https://asciinema.org/a/175632
>> 
>> Why do you care about this case?
>
> It's not so much that I care about this case since any workload that
> wants to run a separate udevd will have to unshare the network namespace
> and the user namespace for it to make complete sense.
> What I do care about is that the two of us are absolutely in the clear
> about what semantics we are going to expose to userspace and it seems
> that we were talking past each other wrt to this "corner case".
> For userspace, it needs to be very clear that the intention is to filter
> by *owning user namespace of the network namespace a given task resides
> in* and not by user namespace of the task per se. This is what this
> corner case basically shows, I think.

If this is just a clarification of semantics then yes this is a
productive question.  I almost agree with your definition above.

I would make the definition very simple.  Uevents will not be broadcast
via netlink in a network namespace where net->user_ns != &init_user_ns,
with the exception of uevents for network devices in that network
namespace.

The existing filtering by the sending uid and verifying that it is uid 0
gives a little more room to filter if we want (as udev & friends will
ignore the uevent), but I don't see the point.

Eric

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-11 19:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <20180411155127.GQ2028@nanopsycho>

On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each bypass
>> instance along with a set of ops to manage the slave events.
>>      bypass_master_register()
>>      bypass_master_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the bypass module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>>       bypass_master_create()
>>       bypass_master_destroy()
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>> include/linux/netdevice.h |  14 +
>> include/net/bypass.h      |  96 ++++++
>> net/Kconfig               |  18 +
>> net/core/Makefile         |   1 +
>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 973 insertions(+)
>> create mode 100644 include/net/bypass.h
>> create mode 100644 net/core/bypass.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..587293728f70 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> 	IFF_PHONY_HEADROOM		= 1<<24,
>> 	IFF_MACSEC			= 1<<25,
>> 	IFF_NO_RX_HANDLER		= 1<<26,
>> +	IFF_BYPASS			= 1 << 27,
>> +	IFF_BYPASS_SLAVE		= 1 << 28,
> I wonder, why you don't follow the existing coding style... Also, please
> add these to into the comment above.

To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
to the existing coding style to be consistent.

>
>
>> };
>>
>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> #define IFF_MACSEC			IFF_MACSEC
>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> +#define IFF_BYPASS			IFF_BYPASS
>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>
>> /**
>>   *	struct net_device - The DEVICE structure.
>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> }
>>
>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS;
>> +}
>> +
>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> +}
>> +
>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> static inline void netif_keep_dst(struct net_device *dev)
>> {
>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>> new file mode 100644
>> index 000000000000..86b02cb894cf
>> --- /dev/null
>> +++ b/include/net/bypass.h
>> @@ -0,0 +1,96 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_BYPASS_H
>> +#define _NET_BYPASS_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct bypass_ops {
>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> +				  struct net_device *bypass_netdev);
>> +	int (*slave_join)(struct net_device *slave_netdev,
>> +			  struct net_device *bypass_netdev);
>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> +				    struct net_device *bypass_netdev);
>> +	int (*slave_release)(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev);
>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>> +				 struct net_device *bypass_netdev);
>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct bypass_master {
>> +	struct list_head list;
>> +	struct net_device __rcu *bypass_netdev;
>> +	struct bypass_ops __rcu *ops;
>> +};
>> +
>> +/* bypass state */
>> +struct bypass_info {
>> +	/* passthru netdev with same MAC */
>> +	struct net_device __rcu *active_netdev;
> You still use "active"/"backup" names which is highly misleading as
> it has completely different meaning that in bond for example.
> I noted that in my previous review already. Please change it.

I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
matches with the BACKUP feature bit we are adding to virtio_net.

With regards to alternate names for 'active', you suggested 'stolen', but i
am not too happy with it.
netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'



>
>
>> +
>> +	/* virtio_net netdev */
>> +	struct net_device __rcu *backup_netdev;
>> +
>> +	/* active netdev stats */
>> +	struct rtnl_link_stats64 active_stats;
>> +
>> +	/* backup netdev stats */
>> +	struct rtnl_link_stats64 backup_stats;
>> +
>> +	/* aggregated stats */
>> +	struct rtnl_link_stats64 bypass_stats;
>> +
>> +	/* spinlock while updating stats */
>> +	spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +void bypass_master_destroy(struct bypass_master *bypass_master);
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master);
>> +void bypass_master_unregister(struct bypass_master *bypass_master);
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev);
>> +
>> +#else
>> +
>> +static inline
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct pbypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_BYPASS_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..994445f4a96a 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> 	  devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_BYPASS
>> +	tristate "Bypass interface"
>> +	---help---
>> +	  This provides a generic interface for paravirtual drivers to listen
>> +	  for netdev register/unregister/link change events from pci ethernet
>> +	  devices with the same MAC and takeover their datapath. This also
>> +	  enables live migration of a VM with direct attached VF by failing
>> +	  over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_BYPASS
>> +	tristate
>> +	default m if NET_BYPASS=m
>> +	default y if NET_BYPASS=y || NET_BYPASS=n
>> +	help
>> +	  Drivers using the bypass infrastructure should have a dependency
>> +	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>> +	  bypass is a loadable module and the driver using it is built-in.
>> +
>> endif   # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..a9727ed1c8fc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> diff --git a/net/core/bypass.c b/net/core/bypass.c
>> new file mode 100644
>> index 000000000000..b5b9cb554c3f
>> --- /dev/null
>> +++ b/net/core/bypass.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/bypass.h>
>> +
>> +static LIST_HEAD(bypass_master_list);
>> +static DEFINE_SPINLOCK(bypass_lock);
>> +
>> +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> +				     struct net_device *bypass_netdev,
>> +				     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_register)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_register(slave_netdev,
>> +						      bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +	if (backup ? rtnl_dereference(bi->backup_netdev) :
>> +			rtnl_dereference(bi->active_netdev)) {
>> +		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> +			   slave_netdev->name, backup ? "backup" : "active");
>> +		return -EEXIST;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!slave_netdev->dev.parent ||
>> +			!dev_is_pci(slave_netdev->dev.parent)))
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_join(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev,
>> +			     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_join)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +
>> +	dev_hold(slave_netdev);
>> +
>> +	if (backup) {
>> +		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> +		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> +		dev_get_stats(bi->active_netdev, &bi->active_stats);
>> +		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> +		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> +	}
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> +	skb->dev = ndev;
>> +
>> +	return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> As I wrote the last time, you don't need this list, spinlock.
> You can do just something like:
>          for_each_net(net) {
>                  for_each_netdev(net, dev) {
> 			if (netif_is_bypass_master(dev)) {

This function returns the upper netdev as well as the ops associated
with that netdev.
bypass_master_list is a list of 'struct bypass_master' that associates
'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
We need 'ops' only to support the 2 netdev model of netvsc. ops will be
NULL for 3-netdev model.


>
>
>
>
>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> +			*ops = rcu_dereference(bypass_master->ops);
> I don't see how rcu_dereference is ok here.
> 1) I don't see rcu_read_lock taken
> 2) Looks like bypass_master->ops has the same value across the whole
>     existence.

We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
Yes. ops doesn't change.

>
>
>> +			spin_unlock(&bypass_lock);
>> +			return bypass_netdev;
>> +		}
>> +	}
>> +	spin_unlock(&bypass_lock);
>> +	return NULL;
>> +}
>> +
>> +static int bypass_slave_register(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret, orig_mtu;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
> For master, could you use word "master" in the variables so it is clear?
> Also, "dev" is fine instead of "netdev".
> Something like "bpmaster_dev"

bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
I can change all _netdev suffixes to _dev to make the names shorter.


>
>
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> +					bypass_ops);
>> +	if (ret != 0)
> 	Just "if (ret)" will do. You have this on more places.

OK.


>
>
>> +		goto done;
>> +
>> +	ret = netdev_rx_handler_register(slave_netdev,
>> +					 bypass_ops ? bypass_ops->handle_frame :
>> +					 bypass_handle_frame, bypass_netdev);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> +			   ret);
>> +		goto done;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> +			   bypass_netdev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> +
>> +	if (netif_running(bypass_netdev)) {
>> +		ret = dev_open(slave_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> +				   slave_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
>> +
>> +	/* Align MTU of slave with master */
>> +	orig_mtu = slave_netdev->mtu;
>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> +	if (ret != 0) {
>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> +			   slave_netdev->name, bypass_netdev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> +	if (ret != 0)
>> +		goto err_join;
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> +		    slave_netdev->name);
>> +
>> +	goto done;
>> +
>> +err_join:
>> +	dev_set_mtu(slave_netdev, orig_mtu);
>> +err_set_mtu:
>> +	dev_close(slave_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> +				       struct net_device *bypass_netdev,
>> +				       struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_unregister)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> +							bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_release(struct net_device *slave_netdev,
>> +				struct net_device *bypass_netdev,
>> +				struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_release)
>> +			return -EINVAL;
> I think it would be good to make the API to the driver more strict and
> have a separate set of ops for "active" and "backup" netdevices.
> That should stop people thinking about extending this to more slaves in
> the future.

We have checks in slave_pre_register() that allows only 1 'backup' and 1
'active' slave.


>
>
>
>> +
>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev == backup_netdev) {
>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> +		if (backup_netdev) {
>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> +					  bypass_ops);
>> +	if (ret != 0)
>> +		goto done;
>> +
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +
>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> +		    slave_netdev->name);
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> +
>> +static bool bypass_xmit_ready(struct net_device *dev)
>> +{
>> +	return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	struct bypass_info *bi;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_link_change)
>> +			goto done;
>> +
>> +		return bypass_ops->slave_link_change(slave_netdev,
>> +						     bypass_netdev);
>> +	}
>> +
>> +	if (!netif_running(bypass_netdev))
>> +		return 0;
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		goto done;
> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> above is enough.

I think we need this check to not allow events from a slave that is not
attached to this master but has the same MAC.

>
>
>> +
>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> +		netif_carrier_on(bypass_netdev);
>> +		netif_tx_wake_all_queues(bypass_netdev);
>> +	} else {
>> +		netif_carrier_off(bypass_netdev);
>> +		netif_tx_stop_all_queues(bypass_netdev);
>> +	}
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static bool bypass_validate_event_dev(struct net_device *dev)
>> +{
>> +	/* Skip parent events */
>> +	if (netif_is_bypass_master(dev))
>> +		return false;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (dev->type != ARPHRD_ETHER)
>> +		return false;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as VF */
>> +	if (is_vlan_dev(dev))
>> +		return false;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> Yeah, this is certainly incorrect. One thing is, you should be using the
> helpers netif_is_bond_master().
> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>
> You need to do it not by blacklisting, but with whitelisting. You need
> to whitelist VF devices. My port flavours patchset might help with this.

May be i can use netdev_has_lower_dev() helper to make sure that the slave
device is not an upper dev.
Can you point to your port flavours patchset? Is it upstream?

>
>
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int
>> +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	if (!bypass_validate_event_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return bypass_slave_register(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return bypass_slave_unregister(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return bypass_slave_link_change(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
>> +
>> +static struct notifier_block bypass_notifier = {
>> +	.notifier_call = bypass_event,
>> +};
>> +
>> +int bypass_open(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int err;
>> +
>> +	netif_carrier_off(dev);
>> +	netif_tx_wake_all_queues(dev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		err = dev_open(active_netdev);
>> +		if (err)
>> +			goto err_active_open;
>> +	}
>> +
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		err = dev_open(backup_netdev);
>> +		if (err)
>> +			goto err_backup_open;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_backup_open:
>> +	dev_close(active_netdev);
>> +err_active_open:
>> +	netif_tx_disable(dev);
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_open);
>> +
>> +int bypass_close(struct net_device *dev)
>> +{
>> +	struct bypass_info *vi = netdev_priv(dev);
> This should be probably "bi"

Yes.


>
>
>> +	struct net_device *slave_netdev;
>> +
>> +	netif_tx_disable(dev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->active_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->backup_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_close);
>> +
>> +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	atomic_long_inc(&dev->tx_dropped);
>> +	dev_kfree_skb_any(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> If you rename the other variable to "bpmaster_dev", it would be nice to
> rename this to bpinfo or something more descriptive. "bi" is too short
> to know what that is right away.

Will rename bypass_netdev to bypass_dev. bypass indicates that it is
an upper master dev.


>
>
>> +	struct net_device *xmit_dev;
> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.

OK.


>
>
>
>> +
>> +	/* Try xmit via active netdev followed by backup netdev */
>> +	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> +	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> +		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> +		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> +			return bypass_drop_xmit(skb, dev);
>> +	}
>> +
>> +	skb->dev = xmit_dev;
>> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> +
>> +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +			void *accel_priv, select_queue_fallback_t fallback)
>> +{
>> +	/* This helper function exists to help dev_pick_tx get the correct
>> +	 * destination queue.  Using a helper function skips a call to
>> +	 * skb_tx_hash and will put the skbs in the queue we expect on their
>> +	 * way down to the bonding driver.
>> +	 */
>> +	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +	/* Save the original txq to restore before passing to the driver */
>> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +		do {
>> +			txq -= dev->real_num_tx_queues;
>> +		} while (txq >= dev->real_num_tx_queues);
>> +	}
>> +
>> +	return txq;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> +			      const struct rtnl_link_stats64 *_new,
>> +			      const struct rtnl_link_stats64 *_old)
>> +{
>> +	const u64 *new = (const u64 *)_new;
>> +	const u64 *old = (const u64 *)_old;
>> +	u64 *res = (u64 *)_res;
>> +	int i;
>> +
>> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +		u64 nv = new[i];
>> +		u64 ov = old[i];
>> +		s64 delta = nv - ov;
>> +
>> +		/* detects if this particular field is 32bit only */
>> +		if (((nv | ov) >> 32) == 0)
>> +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +		/* filter anomalies, some drivers reset their stats
>> +		 * at down/up events.
>> +		 */
>> +		if (delta > 0)
>> +			res[i] += delta;
>> +	}
>> +}
>> +
>> +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> You can WARN_ON and return in case the dev is not bypass master, just
> to catch buggy drivers. Same with other helpers.

I can make this static and not export this helper as well as all
bypass_netdev ops.

>
>
>> +	const struct rtnl_link_stats64 *new;
>> +	struct rtnl_link_stats64 temp;
>> +	struct net_device *slave_netdev;
>> +
>> +	spin_lock(&bi->stats_lock);
>> +	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->active_stats);
>> +		memcpy(&bi->active_stats, new, sizeof(*new));
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->backup_stats);
>> +		memcpy(&bi->backup_stats, new, sizeof(*new));
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> +	spin_unlock(&bi->stats_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> +
>> +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int ret = 0;
> Pointless initialization.
>
>
>> +
>> +	active_netdev = rcu_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		ret = dev_set_mtu(active_netdev, new_mtu);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	backup_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		ret = dev_set_mtu(backup_netdev, new_mtu);
>> +		if (ret) {
>> +			dev_set_mtu(active_netdev, dev->mtu);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	dev->mtu = new_mtu;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> +
>> +void bypass_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> +
>> +static const struct net_device_ops bypass_netdev_ops = {
>> +	.ndo_open		= bypass_open,
>> +	.ndo_stop		= bypass_close,
>> +	.ndo_start_xmit		= bypass_start_xmit,
>> +	.ndo_select_queue	= bypass_select_queue,
>> +	.ndo_get_stats64	= bypass_get_stats,
>> +	.ndo_change_mtu		= bypass_change_mtu,
>> +	.ndo_set_rx_mode	= bypass_set_rx_mode,
>> +	.ndo_validate_addr	= eth_validate_addr,
>> +	.ndo_features_check	= passthru_features_check,
>> +};
>> +
>> +#define BYPASS_DRV_NAME "bypass"
>> +#define BYPASS_DRV_VERSION "0.1"
>> +
>> +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> +				       struct ethtool_drvinfo *drvinfo)
>> +{
>> +	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> +				      struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +		slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> +			cmd->base.port = PORT_OTHER;
>> +			cmd->base.speed = SPEED_UNKNOWN;
>> +
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops bypass_ethtool_ops = {
>> +	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>> +	.get_link               = ethtool_op_get_link,
>> +	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> +{
>> +	struct net *net = dev_net(bypass_netdev);
>> +	struct net_device *dev;
>> +
>> +	rtnl_lock();
>> +	for_each_netdev(net, dev) {
>> +		if (dev == bypass_netdev)
>> +			continue;
>> +		if (!bypass_validate_event_dev(dev))
>> +			continue;
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> +			bypass_slave_register(dev);
>> +	}
>> +	rtnl_unlock();
>> +}
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master)
>> +{
>> +	struct bypass_master *bypass_master;
>> +
>> +	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> +	if (!bypass_master)
>> +		return -ENOMEM;
>> +
>> +	rcu_assign_pointer(bypass_master->ops, ops);
>> +	dev_hold(dev);
>> +	dev->priv_flags |= IFF_BYPASS;
>> +	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_add_tail(&bypass_master->list, &bypass_master_list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	bypass_register_existing_slave(dev);
>> +
>> +	*pbypass_master = bypass_master;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_register);
>> +
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +
>> +	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> +	dev_put(bypass_netdev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_del(&bypass_master->list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	kfree(bypass_master);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master)
>> +{
>> +	struct device *dev = backup_netdev->dev.parent;
>> +	struct net_device *bypass_netdev;
>> +	int err;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> +	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> +				       IFF_TX_SKB_SHARING);
>> +
>> +	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> +	bypass_netdev->features |= NETIF_F_LLTX;
>> +
>> +	/* Don't allow bypass devices to change network namespaces. */
>> +	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> +	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> +				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> +				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>> +
>> +	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> +	bypass_netdev->features |= bypass_netdev->hw_features;
>> +
>> +	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> +	       bypass_netdev->addr_len);
>> +
>> +	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +
>> +	err = register_netdev(bypass_netdev);
>> +	if (err < 0) {
>> +		dev_err(dev, "Unable to register bypass_netdev!\n");
>> +		goto err_register_netdev;
>> +	}
>> +
>> +	netif_carrier_off(bypass_netdev);
>> +
>> +	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> +	if (err < 0)
> just "if (err)" would do.

OK

>
>
>> +		goto err_bypass;
>> +
>> +	return 0;
>> +
>> +err_bypass:
>> +	unregister_netdev(bypass_netdev);
>> +err_register_netdev:
>> +	free_netdev(bypass_netdev);
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_create);
>> +
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct net_device *slave_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (!bypass_master)
>> +		return;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	netif_device_detach(bypass_netdev);
>> +
>> +	rtnl_lock();
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	bypass_master_unregister(bypass_master);
>> +
>> +	unregister_netdevice(bypass_netdev);
>> +
>> +	rtnl_unlock();
>> +
>> +	free_netdev(bypass_netdev);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> +
>> +static __init int
>> +bypass_init(void)
>> +{
>> +	register_netdevice_notifier(&bypass_notifier);
>> +
>> +	return 0;
>> +}
>> +module_init(bypass_init);
>> +
>> +static __exit
>> +void bypass_exit(void)
>> +{
>> +	unregister_netdevice_notifier(&bypass_notifier);
>> +}
>> +module_exit(bypass_exit);
>> +
>> +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.14.3
>>

^ permalink raw reply

* [PULL] virtio: feature pull
From: Michael S. Tsirkin @ 2018-04-11 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, virtualization, netdev, linux-kernel, jonathan.helman, mst

The following changes since commit dc32bb678e103afbcfa4d814489af0566307f528:

  vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 6c64fe7f2adcee21d7c4247f1ec021fd18428fc4:

  virtio_balloon: export hugetlb page allocation counts (2018-04-10 21:23:55 +0300)

----------------------------------------------------------------
virtio: feature

This adds reporting hugepage stats to virtio-balloon.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Jonathan Helman (1):
      virtio_balloon: export hugetlb page allocation counts

 drivers/virtio/virtio_balloon.c     | 6 ++++++
 include/uapi/linux/virtio_balloon.h | 4 +++-
 2 files changed, 9 insertions(+), 1 deletion(-)

^ permalink raw reply

* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Saeed Mahameed @ 2018-04-11 18:59 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <75137dbf-4608-127e-1601-10a3c13e3a32@gmail.com>

On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> 
> On 04/10/2018 10:16 AM, David Miller wrote:
> > 
> > The tradeoff here is that now you are doing two unnecessary atomic
> > operations per stats dump.
> > 
> > That is what the RCU lock allows us to avoid.
> > 
> 
> dev_hold() and dev_put() are actually per cpu increment and
> decrements,
> pretty cheap these days.
> 

Yes, i am adding only 2 cpu instructions here.
I think the trade-off here is too small and the price to finally have
get_stats64 called from non atomic context is really worth it.

It  looks really odd to me that the device chain locks are held for
such long periods, while we already have the means to avoid this, same
goes for rtnl_lock, same trick can work here for many use cases and
many ndos, we are just over protective for no reasons.

> Problem here is that any preemption of the process holding device
> reference
> might trigger warnings in device unregister.
> 

This is true for any other place dev_hold is used,
as explained in the commit message dev_hold is used for a very brief
moment before calling the stats ndo and released directly after.

looking at 

netdev_wait_allrefs(...)
[...]
	msleep(250);

	refcnt = netdev_refcnt_read(dev);

	if (time_after(jiffies, warning_time + 10 * HZ)) {
		pr_emerg("unregister_netdevice: waiting for %s to
become free. Usage count = %d\n",
			 dev->name, refcnt);
		warning_time = jiffies;
	}

The holder will get enough time to release the netdev way before the
warning is triggered.

The warning is triggered only if someone holds the dev for more than 10
seconds which is impossible for the stats ndo to take more than this,
in fact i just did a quick measurement and it seems that in average
get_stats64 ndo takes 0.5us !

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-11 18:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <87efjllhcx.fsf@xmission.com>

On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> > Yeah, agreed.
> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> > user namespace actually receives uevents we need to:
> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >    namespaces that are owned by init_user_ns
> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> >    uevent socket that is owned by init_user_ns *from* a
> >> >    non-init_user_ns
> >> >
> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> > socket list who are owned by init_user_ns.
> >> > But to account for 2. we need to filter by the user namespace who owns
> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >
> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> > --- a/lib/kobject_uevent.c
> >> > +++ b/lib/kobject_uevent.c
> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >  		return sock_ns != ns;
> >> >  	}
> >> >  
> >> > -	return 0;
> >> > + 	/* Check if socket was opened from non-initial user namespace. */
> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >> >  }
> >> >  #endif
> >> >  
> >> >
> >> > But correct me if I'm wrong.
> >> 
> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> permissions and an explicit opt-in to receiving packets from multiple
> >> network namespaces.
> >
> > I don't think that's what I'm talking about unless that is somehow the
> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> > doing
> >
> > unshare -U --map-root
> >
> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> > uevents. Imho, this should not be possible because I'm in a
> > non-init_user_ns. But currently I'm able to - even with the patch to
> > come - since the uevent socket in the kernel was created when init_net
> > was created and hence is *owned* by the init_user_ns which means it is
> > in the list of uevent sockets. Here's a demo of what I mean:
> >
> > https://asciinema.org/a/175632
> 
> Why do you care about this case?

It's not so much that I care about this case since any workload that
wants to run a separate udevd will have to unshare the network namespace
and the user namespace for it to make complete sense.
What I do care about is that the two of us are absolutely in the clear
about what semantics we are going to expose to userspace and it seems
that we were talking past each other wrt to this "corner case".
For userspace, it needs to be very clear that the intention is to filter
by *owning user namespace of the network namespace a given task resides
in* and not by user namespace of the task per se. This is what this
corner case basically shows, I think.

Christian

> 
> Everyone is allowed to listen to the uevent netlink sockets with or
> without user namespaces.  So there are no permission issues, and
> this is not even a data information leak.
> 
> If you don't want programs in your user namespace to have access you
> will be able to unshare the network namespace.
> 
> Eric

^ permalink raw reply

* Re: [PATCH v1 net 2/3] lan78xx: Add support to dump lan78xx registers
From: kbuild test robot @ 2018-04-11 18:51 UTC (permalink / raw)
  To: Raghuram Chary J
  Cc: kbuild-all, davem, netdev, unglinuxdriver, woojung.huh,
	raghuramchary.jallipalli
In-Reply-To: <20180411072450.9809-3-raghuramchary.jallipalli@microchip.com>

Hi Raghuram,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v4.16 next-20180411]
[cannot apply to net/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Raghuram-Chary-J/lan78xx-Fixes-and-enhancements/20180411-231134
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/usb/lan78xx.c:281:5: sparse: symbol 'lan78xx_regs' was not declared. Should it be static?
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
   drivers/net/usb/lan78xx.c:2992:27: sparse: incorrect type in assignment (different base types) @@    expected restricted __wsum [usertype] csum @@    got e] csum @@
   drivers/net/usb/lan78xx.c:2992:27:    expected restricted __wsum [usertype] csum
   drivers/net/usb/lan78xx.c:2992:27:    got int

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [RFC PATCH] lan78xx: lan78xx_regs[] can be static
From: kbuild test robot @ 2018-04-11 18:51 UTC (permalink / raw)
  To: Raghuram Chary J
  Cc: kbuild-all, davem, netdev, unglinuxdriver, woojung.huh,
	raghuramchary.jallipalli
In-Reply-To: <20180411072450.9809-3-raghuramchary.jallipalli@microchip.com>


Fixes: 1373bfc60bd8 ("lan78xx: Add support to dump lan78xx registers")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 lan78xx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index cbeec78..df5c3eb 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -278,7 +278,7 @@ struct lan78xx_statstage64 {
 	u64 eee_tx_lpi_time;
 };
 
-u32 lan78xx_regs[] = {
+static u32 lan78xx_regs[] = {
 	ID_REV,
 	INT_STS,
 	HW_CFG,

^ permalink raw reply related

* Re: [PATCH net 2/2] tun: send netlink notification when the device is modified
From: David Miller @ 2018-04-11 18:48 UTC (permalink / raw)
  To: sd; +Cc: netdev, thaller, sbrivio
In-Reply-To: <d88d26513396435befee50bdc740549a4ea546a2.1523370272.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Tue, 10 Apr 2018 16:28:56 +0200

> I added dumping of link information about tun devices over netlink in
> commit 1ec010e70593 ("tun: export flags, uid, gid, queue information
> over netlink"), but didn't add the missing netlink notifications when
> the device's exported properties change.
> 
> This patch adds notifications when owner/group or flags are modified,
> when queues are attached/detached, and when a tun fd is closed.
> 
> Reported-by: Thomas Haller <thaller@redhat.com>
> Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over netlink")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 1/2] tun: set the flags before registering the netdevice
From: David Miller @ 2018-04-11 18:48 UTC (permalink / raw)
  To: sd; +Cc: netdev, thaller, sbrivio
In-Reply-To: <0b61a3298ffa3a3fbc4ae0cae1573b368301a211.1523370272.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Tue, 10 Apr 2018 16:28:55 +0200

> Otherwise, register_netdevice advertises the creation of the device with
> the default flags, instead of what the user requested.
> 
> Reported-by: Thomas Haller <thaller@redhat.com>
> Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over netlink")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH] lan78xx: Don't reset the interface on open
From: David Miller @ 2018-04-11 18:45 UTC (permalink / raw)
  To: phil
  Cc: woojung.huh, UNGLinuxDriver, agraf, tbogendoerfer, netdev,
	linux-usb, linux-kernel
In-Reply-To: <1523362705-30032-1-git-send-email-phil@raspberrypi.org>

From: Phil Elwell <phil@raspberrypi.org>
Date: Tue, 10 Apr 2018 13:18:25 +0100

> Commit 92571a1aae40 ("lan78xx: Connect phy early") moves the PHY
> initialisation into lan78xx_probe, but lan78xx_open subsequently calls
> lan78xx_reset. As well as forcing a second round of link negotiation,
> this reset frequently prevents the phy interrupt from being generated
> (even though the link is up), rendering the interface unusable.
> 
> Fix this issue by removing the lan78xx_reset call from lan78xx_open.
> 
> Fixes: 92571a1aae40 ("lan78xx: Connect phy early")
> Signed-off-by: Phil Elwell <phil@raspberrypi.org>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
From: Alexei Starovoitov @ 2018-04-11 18:43 UTC (permalink / raw)
  To: Björn Töpel, William Tu
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Jesper Dangaard Brouer, Willem de Bruijn,
	Daniel Borkmann, Linux Kernel Network Developers,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, Zhang, Qi Z, ravineet.singh
In-Reply-To: <CAJ+HfNiJ+tS6cG4_cCj1syQC8m8OZWM44-NX-hbdRYWQj=98TA@mail.gmail.com>

On 4/11/18 5:17 AM, Björn Töpel wrote:
>
> In the current RFC you are required to create both an Rx and Tx queue
> to bind the socket, which is just weird for your "Rx on one device, Tx
> to another" scenario. I'll fix that in the next RFC.

I would defer on adding new features until the key functionality lands.
imo it's in good shape and I would submit it without RFC tag as soon as
net-next reopens.

^ permalink raw reply

* Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations
From: David Miller @ 2018-04-11 18:43 UTC (permalink / raw)
  To: saeedm
  Cc: colin.king, jgg, matanb, netdev, linux-rdma, linux-kernel, leon,
	kernel-janitors
In-Reply-To: <1523471018.3402.42.camel@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 11 Apr 2018 18:23:41 +0000

> Dave, if it is ok with you I would like to apply this into mlx5-next
> branch so it won't cause any issues with our next pull requests to rdma
> and netdev.

Ok.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Jakub Kicinski @ 2018-04-11 18:43 UTC (permalink / raw)
  To: Michael Chan, Andy Gospodarek; +Cc: davem, netdev
In-Reply-To: <1523461818-15774-3-git-send-email-michael.chan@broadcom.com>

On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>  		return false;
>  	}
>  
> +	/* Currently source/dest MAC cannot be partial wildcard  */
> +	if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
> +	    !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> +		netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");

This wouldn't be something to do in net, but how do you feel about
using extack for messages like this?

> +		return false;
> +	}

^ permalink raw reply

* Re: [PATCH net v2 0/6] bnxt_en: Fixes for net.
From: David Miller @ 2018-04-11 18:42 UTC (permalink / raw)
  To: michael.chan; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Wed, 11 Apr 2018 11:50:12 -0400

> This bug fix series include NULL pointer fixes in ethtool -x code path
> and in the error clean up path when freeing IRQs, a ring accounting bug
> that missed rings used by the RDMA driver, and 3 bug fixes related to TC
> Flower and VF-reps.
> 
> v2: Fixed commit message of patch 4.  Changed the pound sign to $ sign
> in front of the ip command.

Yep, that looks better.

Series applied, thanks Michael.

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-11 18:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180411170333.GA4319@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> > Yeah, agreed.
>> > But I think the patch is not complete. To guarantee that no non-initial
>> > user namespace actually receives uevents we need to:
>> > 1. only sent uevents to uevent sockets that are located in network
>> >    namespaces that are owned by init_user_ns
>> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >    uevent socket that is owned by init_user_ns *from* a
>> >    non-init_user_ns
>> >
>> > We account for 1. by only recording uevent sockets in the global uevent
>> > socket list who are owned by init_user_ns.
>> > But to account for 2. we need to filter by the user namespace who owns
>> > the socket in mc_list. So in addition to that we also need to slightly
>> > change the filter logic in kobj_bcast_filter() I think:
>> >
>> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> > index 22a2c1a98b8f..064d7d29ace5 100644
>> > --- a/lib/kobject_uevent.c
>> > +++ b/lib/kobject_uevent.c
>> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >  		return sock_ns != ns;
>> >  	}
>> >  
>> > -	return 0;
>> > + 	/* Check if socket was opened from non-initial user namespace. */
>> > + 	return sk_user_ns(dsk) != &init_user_ns;
>> >  }
>> >  #endif
>> >  
>> >
>> > But correct me if I'm wrong.
>> 
>> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> permissions and an explicit opt-in to receiving packets from multiple
>> network namespaces.
>
> I don't think that's what I'm talking about unless that is somehow the
> default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> doing
>
> unshare -U --map-root
>
> then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> uevents. Imho, this should not be possible because I'm in a
> non-init_user_ns. But currently I'm able to - even with the patch to
> come - since the uevent socket in the kernel was created when init_net
> was created and hence is *owned* by the init_user_ns which means it is
> in the list of uevent sockets. Here's a demo of what I mean:
>
> https://asciinema.org/a/175632

Why do you care about this case?

Everyone is allowed to listen to the uevent netlink sockets with or
without user namespaces.  So there are no permission issues, and
this is not even a data information leak.

If you don't want programs in your user namespace to have access you
will be able to unshare the network namespace.

Eric

^ permalink raw reply

* [PATCH 4.4 004/190] x86/asm: Dont use RBP as a temporary register in csum_partial_copy_generic()
From: Greg Kroah-Hartman @ 2018-04-11 18:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Andrey Konovalov, Josh Poimboeuf,
	Cong Wang, David S . Miller, Dmitry Vyukov, Eric Dumazet,
	Kostya Serebryany, Linus Torvalds, Marcelo Ricardo Leitner,
	Neil Horman, Peter Zijlstra, Thomas Gleixner, Vlad Yasevich,
	linux-sctp, netdev, syzkaller, Ingo Molnar, Sasha Levin
In-Reply-To: <20180411183550.114495991@linuxfoundation.org>

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Josh Poimboeuf <jpoimboe@redhat.com>


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/lib/csum-copy_64.S |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
 	movq  %r12, 3*8(%rsp)
 	movq  %r14, 4*8(%rsp)
 	movq  %r13, 5*8(%rsp)
-	movq  %rbp, 6*8(%rsp)
+	movq  %r15, 6*8(%rsp)
 
 	movq  %r8, (%rsp)
 	movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
 	/* main loop. clear in 64 byte blocks */
 	/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
 	/* r11:	temp3, rdx: temp4, r12 loopcnt */
-	/* r10:	temp5, rbp: temp6, r14 temp7, r13 temp8 */
+	/* r10:	temp5, r15: temp6, r14 temp7, r13 temp8 */
 	.p2align 4
 .Lloop:
 	source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
 	source
 	movq  32(%rdi), %r10
 	source
-	movq  40(%rdi), %rbp
+	movq  40(%rdi), %r15
 	source
 	movq  48(%rdi), %r14
 	source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
 	adcq  %r11, %rax
 	adcq  %rdx, %rax
 	adcq  %r10, %rax
-	adcq  %rbp, %rax
+	adcq  %r15, %rax
 	adcq  %r14, %rax
 	adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
 	dest
 	movq %r10, 32(%rsi)
 	dest
-	movq %rbp, 40(%rsi)
+	movq %r15, 40(%rsi)
 	dest
 	movq %r14, 48(%rsi)
 	dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
 	movq 3*8(%rsp), %r12
 	movq 4*8(%rsp), %r14
 	movq 5*8(%rsp), %r13
-	movq 6*8(%rsp), %rbp
+	movq 6*8(%rsp), %r15
 	addq $7*8, %rsp
 	ret
 

^ permalink raw reply

* [PATCH 4.9 005/310] x86/asm: Dont use RBP as a temporary register in csum_partial_copy_generic()
From: Greg Kroah-Hartman @ 2018-04-11 18:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Andrey Konovalov, Josh Poimboeuf,
	Cong Wang, David S . Miller, Dmitry Vyukov, Eric Dumazet,
	Kostya Serebryany, Linus Torvalds, Marcelo Ricardo Leitner,
	Neil Horman, Peter Zijlstra, Thomas Gleixner, Vlad Yasevich,
	linux-sctp, netdev, syzkaller, Ingo Molnar, Sasha Levin
In-Reply-To: <20180411183622.305902791@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Josh Poimboeuf <jpoimboe@redhat.com>


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/lib/csum-copy_64.S |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
 	movq  %r12, 3*8(%rsp)
 	movq  %r14, 4*8(%rsp)
 	movq  %r13, 5*8(%rsp)
-	movq  %rbp, 6*8(%rsp)
+	movq  %r15, 6*8(%rsp)
 
 	movq  %r8, (%rsp)
 	movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
 	/* main loop. clear in 64 byte blocks */
 	/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
 	/* r11:	temp3, rdx: temp4, r12 loopcnt */
-	/* r10:	temp5, rbp: temp6, r14 temp7, r13 temp8 */
+	/* r10:	temp5, r15: temp6, r14 temp7, r13 temp8 */
 	.p2align 4
 .Lloop:
 	source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
 	source
 	movq  32(%rdi), %r10
 	source
-	movq  40(%rdi), %rbp
+	movq  40(%rdi), %r15
 	source
 	movq  48(%rdi), %r14
 	source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
 	adcq  %r11, %rax
 	adcq  %rdx, %rax
 	adcq  %r10, %rax
-	adcq  %rbp, %rax
+	adcq  %r15, %rax
 	adcq  %r14, %rax
 	adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
 	dest
 	movq %r10, 32(%rsi)
 	dest
-	movq %rbp, 40(%rsi)
+	movq %r15, 40(%rsi)
 	dest
 	movq %r14, 48(%rsi)
 	dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
 	movq 3*8(%rsp), %r12
 	movq 4*8(%rsp), %r14
 	movq 5*8(%rsp), %r13
-	movq 6*8(%rsp), %rbp
+	movq 6*8(%rsp), %r15
 	addq $7*8, %rsp
 	ret
 

^ permalink raw reply

* Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations
From: Saeed Mahameed @ 2018-04-11 18:23 UTC (permalink / raw)
  To: colin.king@canonical.com, jgg@ziepe.ca
  Cc: Matan Barak, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, leon@kernel.org,
	kernel-janitors@vger.kernel.org
In-Reply-To: <20180410232719.qzaltu7hu57aqps3@ziepe.ca>

On Tue, 2018-04-10 at 17:27 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 10, 2018 at 05:22:54PM -0600, Jason Gunthorpe wrote:
> > On Mon, Apr 09, 2018 at 01:43:36PM +0100, Colin Ian King wrote:
> > > From: Colin Ian King <colin.king@canonical.com>
> > > 
> > > There are several lines where there is an extraneous space
> > > causing
> > > indentation misalignment. Remove them.
> > > 
> > > Cleans up Cocconelle warnings:
> > > 
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code
> > > aligned
> > > with following code on line 410
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code
> > > aligned
> > > with following code on line 416
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code
> > > aligned
> > > with following code on line 422
> > > 
> > > Signed-off-by: Colin Ian King <colin.king@canonical.com>
> > >  drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +++++++++-----
> > > ----
> > >  1 file changed, 9 insertions(+), 9 deletions(-)
> > 
> > applied to for-next thanks
> 
> Oh wait, this is for netdev, not rdma sorry. Never mind, DaveM should
> pick it up.
> 

net-next is currently closed.

Dave, if it is ok with you I would like to apply this into mlx5-next
branch so it won't cause any issues with our next pull requests to rdma
and netdev.

> Jason

^ permalink raw reply

* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 18:01 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <7e388b10-ccea-a2b0-e776-5420c8e7f521@netronome.com>

2018-04-11 16:44 UTC+0100 ~ Quentin Monnet <quentin.monnet@netronome.com>
> 2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
>> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions:
>>>
>>> Helpers from Lawrence:
>>> - bpf_setsockopt()
>>> - bpf_getsockopt()
>>> - bpf_sock_ops_cb_flags_set()
>>>
>>> Helpers from Yonghong:
>>> - bpf_perf_event_read_value()
>>> - bpf_perf_prog_read_value()
>>>
>>> Helper from Josef:
>>> - bpf_override_return()
>>>
>>> Helper from Andrey:
>>> - bpf_bind()
>>>
>>> Cc: Lawrence Brakmo <brakmo@fb.com>
>>> Cc: Yonghong Song <yhs@fb.com>
>>> Cc: Josef Bacik <jbacik@fb.com>
>>> Cc: Andrey Ignatov <rdna@fb.com>
>>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>>> ---

> [...]
> 
> Thanks Yonghong for the review!
> 
> I have a favor to ask of you. I got a bounce for Kaixu Xia's email
> address, and I don't know what alternative email address I could use. I
> CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
> this series), which is rather close to bpf_perf_event_read_value().
> Would you mind having a look at that one too, please? The description is
> not long.

Well I read again the description I wrote, and actually the one for
bpf_perf_evnet_read() is nearly a subset of the one for
perf_event_read_value(). So the same comments that you raised earlier
apply, there's probably nothing more to review. But if you notice that
some important info is missing for bpf_perf_event_read(), I'm interested
too!

Quentin

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox