Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Roman Gushchin @ 2020-06-20  0:51 UTC (permalink / raw)
  To: Zefan Li
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <4f17229e-1843-5bfc-ea2f-67ebaa9056da@huawei.com>

On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>
> >>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> > 
> > Yeah, I will update the Fixes tag and send V2.
> > 
> 
> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> but this is fine, because when the socket is to be freed:
> 
>  sk_prot_free()
>    cgroup_sk_free()
>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> 
> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> that needs to be fixed.

Hm, does it mean that the problem always happens with the root cgroup?

From the stacktrace provided by Peter it looks like that the problem
is bpf-related, but the original patch says nothing about it.

So from the test above it sounds like the problem is that we're trying
to release root's cgroup_bpf, which is a bad idea, I totally agree.
Is this the problem? If so, we might wanna fix it in a different way,
just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
like in cgroup_put(). It feels more reliable to me.

Thanks!


^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Zefan Li @ 2020-06-20  0:45 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo
In-Reply-To: <CAM_iQpVKqFi00ohqPARxaDw2UN1m6CtjqsmBAP-pcK0GT2p_fQ@mail.gmail.com>

On 2020/6/20 3:51, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>
>> On 2020/6/19 5:09, Cong Wang wrote:
>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>
>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>
>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>
>>>>>> Thanks for fixing this.
>>>>>>
>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>
>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>> to make it more readable.
>>>>>>>
>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>
>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>> with systemd invovled.
>>>>>>
>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>
>>>>> Good point.
>>>>>
>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>> is the one to blame, because it is the first commit that began to
>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>
>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>> seems closer to the origin.
>>>
>>> Yeah, I will update the Fixes tag and send V2.
>>>
>>
>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>> but this is fine, because when the socket is to be freed:
>>
>>  sk_prot_free()
>>    cgroup_sk_free()
>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>
>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> But skcd->val can be a pointer to a non-root cgroup:

It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
when cgroup_sk_alloc is disabled.

> 
> static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
> {
> #if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
>         unsigned long v;
> 
>         /*
>          * @skcd->val is 64bit but the following is safe on 32bit too as we
>          * just need the lower ulong to be written and read atomically.
>          */
>         v = READ_ONCE(skcd->val);
> 
>         if (v & 1)
>                 return &cgrp_dfl_root.cgrp;
> 
>         return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
> #else
>         return (struct cgroup *)(unsigned long)skcd->val;
> #endif
> }
> .
> 


^ permalink raw reply

* Re: [PATCH v2 net-next] ipv6: icmp6: avoid indirect call for icmpv6_send()
From: kernel test robot @ 2020-06-20  0:43 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: kbuild-all, clang-built-linux, netdev, Eric Dumazet,
	Jakub Kicinski
In-Reply-To: <20200619190259.170189-1-edumazet@google.com>

[-- Attachment #1: Type: text/plain, Size: 7810 bytes --]

Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/ipv6-icmp6-avoid-indirect-call-for-icmpv6_send/20200620-030444
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 0fb9fbab405351aa0c18973881c4103e4da886b6
config: riscv-randconfig-r033-20200619 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 487ca07fcc75d52755c9fe2ee05bcb3b6eeeec44)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> net/ipv6/icmp.c:442:6: warning: no previous prototype for function 'icmp6_send' [-Wmissing-prototypes]
void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
^
net/ipv6/icmp.c:442:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
^
static
1 warning generated.

vim +/icmp6_send +442 net/ipv6/icmp.c

   438	
   439	/*
   440	 *	Send an ICMP message in response to a packet in error
   441	 */
 > 442	void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
   443			const struct in6_addr *force_saddr)
   444	{
   445		struct inet6_dev *idev = NULL;
   446		struct ipv6hdr *hdr = ipv6_hdr(skb);
   447		struct sock *sk;
   448		struct net *net;
   449		struct ipv6_pinfo *np;
   450		const struct in6_addr *saddr = NULL;
   451		struct dst_entry *dst;
   452		struct icmp6hdr tmp_hdr;
   453		struct flowi6 fl6;
   454		struct icmpv6_msg msg;
   455		struct ipcm6_cookie ipc6;
   456		int iif = 0;
   457		int addr_type = 0;
   458		int len;
   459		u32 mark;
   460	
   461		if ((u8 *)hdr < skb->head ||
   462		    (skb_network_header(skb) + sizeof(*hdr)) > skb_tail_pointer(skb))
   463			return;
   464	
   465		if (!skb->dev)
   466			return;
   467		net = dev_net(skb->dev);
   468		mark = IP6_REPLY_MARK(net, skb->mark);
   469		/*
   470		 *	Make sure we respect the rules
   471		 *	i.e. RFC 1885 2.4(e)
   472		 *	Rule (e.1) is enforced by not using icmp6_send
   473		 *	in any code that processes icmp errors.
   474		 */
   475		addr_type = ipv6_addr_type(&hdr->daddr);
   476	
   477		if (ipv6_chk_addr(net, &hdr->daddr, skb->dev, 0) ||
   478		    ipv6_chk_acast_addr_src(net, skb->dev, &hdr->daddr))
   479			saddr = &hdr->daddr;
   480	
   481		/*
   482		 *	Dest addr check
   483		 */
   484	
   485		if (addr_type & IPV6_ADDR_MULTICAST || skb->pkt_type != PACKET_HOST) {
   486			if (type != ICMPV6_PKT_TOOBIG &&
   487			    !(type == ICMPV6_PARAMPROB &&
   488			      code == ICMPV6_UNK_OPTION &&
   489			      (opt_unrec(skb, info))))
   490				return;
   491	
   492			saddr = NULL;
   493		}
   494	
   495		addr_type = ipv6_addr_type(&hdr->saddr);
   496	
   497		/*
   498		 *	Source addr check
   499		 */
   500	
   501		if (__ipv6_addr_needs_scope_id(addr_type)) {
   502			iif = icmp6_iif(skb);
   503		} else {
   504			dst = skb_dst(skb);
   505			iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev);
   506		}
   507	
   508		/*
   509		 *	Must not send error if the source does not uniquely
   510		 *	identify a single node (RFC2463 Section 2.4).
   511		 *	We check unspecified / multicast addresses here,
   512		 *	and anycast addresses will be checked later.
   513		 */
   514		if ((addr_type == IPV6_ADDR_ANY) || (addr_type & IPV6_ADDR_MULTICAST)) {
   515			net_dbg_ratelimited("icmp6_send: addr_any/mcast source [%pI6c > %pI6c]\n",
   516					    &hdr->saddr, &hdr->daddr);
   517			return;
   518		}
   519	
   520		/*
   521		 *	Never answer to a ICMP packet.
   522		 */
   523		if (is_ineligible(skb)) {
   524			net_dbg_ratelimited("icmp6_send: no reply to icmp error [%pI6c > %pI6c]\n",
   525					    &hdr->saddr, &hdr->daddr);
   526			return;
   527		}
   528	
   529		/* Needed by both icmp_global_allow and icmpv6_xmit_lock */
   530		local_bh_disable();
   531	
   532		/* Check global sysctl_icmp_msgs_per_sec ratelimit */
   533		if (!(skb->dev->flags & IFF_LOOPBACK) && !icmpv6_global_allow(net, type))
   534			goto out_bh_enable;
   535	
   536		mip6_addr_swap(skb);
   537	
   538		sk = icmpv6_xmit_lock(net);
   539		if (!sk)
   540			goto out_bh_enable;
   541	
   542		memset(&fl6, 0, sizeof(fl6));
   543		fl6.flowi6_proto = IPPROTO_ICMPV6;
   544		fl6.daddr = hdr->saddr;
   545		if (force_saddr)
   546			saddr = force_saddr;
   547		if (saddr) {
   548			fl6.saddr = *saddr;
   549		} else if (!icmpv6_rt_has_prefsrc(sk, type, &fl6)) {
   550			/* select a more meaningful saddr from input if */
   551			struct net_device *in_netdev;
   552	
   553			in_netdev = dev_get_by_index(net, IP6CB(skb)->iif);
   554			if (in_netdev) {
   555				ipv6_dev_get_saddr(net, in_netdev, &fl6.daddr,
   556						   inet6_sk(sk)->srcprefs,
   557						   &fl6.saddr);
   558				dev_put(in_netdev);
   559			}
   560		}
   561		fl6.flowi6_mark = mark;
   562		fl6.flowi6_oif = iif;
   563		fl6.fl6_icmp_type = type;
   564		fl6.fl6_icmp_code = code;
   565		fl6.flowi6_uid = sock_net_uid(net, NULL);
   566		fl6.mp_hash = rt6_multipath_hash(net, &fl6, skb, NULL);
   567		security_skb_classify_flow(skb, flowi6_to_flowi(&fl6));
   568	
   569		sk->sk_mark = mark;
   570		np = inet6_sk(sk);
   571	
   572		if (!icmpv6_xrlim_allow(sk, type, &fl6))
   573			goto out;
   574	
   575		tmp_hdr.icmp6_type = type;
   576		tmp_hdr.icmp6_code = code;
   577		tmp_hdr.icmp6_cksum = 0;
   578		tmp_hdr.icmp6_pointer = htonl(info);
   579	
   580		if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr))
   581			fl6.flowi6_oif = np->mcast_oif;
   582		else if (!fl6.flowi6_oif)
   583			fl6.flowi6_oif = np->ucast_oif;
   584	
   585		ipcm6_init_sk(&ipc6, np);
   586		fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
   587	
   588		dst = icmpv6_route_lookup(net, skb, sk, &fl6);
   589		if (IS_ERR(dst))
   590			goto out;
   591	
   592		ipc6.hlimit = ip6_sk_dst_hoplimit(np, &fl6, dst);
   593	
   594		msg.skb = skb;
   595		msg.offset = skb_network_offset(skb);
   596		msg.type = type;
   597	
   598		len = skb->len - msg.offset;
   599		len = min_t(unsigned int, len, IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr));
   600		if (len < 0) {
   601			net_dbg_ratelimited("icmp: len problem [%pI6c > %pI6c]\n",
   602					    &hdr->saddr, &hdr->daddr);
   603			goto out_dst_release;
   604		}
   605	
   606		rcu_read_lock();
   607		idev = __in6_dev_get(skb->dev);
   608	
   609		if (ip6_append_data(sk, icmpv6_getfrag, &msg,
   610				    len + sizeof(struct icmp6hdr),
   611				    sizeof(struct icmp6hdr),
   612				    &ipc6, &fl6, (struct rt6_info *)dst,
   613				    MSG_DONTWAIT)) {
   614			ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
   615			ip6_flush_pending_frames(sk);
   616		} else {
   617			icmpv6_push_pending_frames(sk, &fl6, &tmp_hdr,
   618						   len + sizeof(struct icmp6hdr));
   619		}
   620		rcu_read_unlock();
   621	out_dst_release:
   622		dst_release(dst);
   623	out:
   624		icmpv6_xmit_unlock(sk);
   625	out_bh_enable:
   626		local_bh_enable();
   627	}
   628	EXPORT_SYMBOL(icmp6_send);
   629	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31496 bytes --]

^ permalink raw reply

* [PATCH] Bluetooth: Add hci_dev_lock to get/set device flags
From: Abhishek Pandit-Subedi @ 2020-06-20  0:10 UTC (permalink / raw)
  To: marcel, linux-bluetooth, mcchou
  Cc: chromeos-bluetooth-upstreaming, Abhishek Pandit-Subedi,
	David S. Miller, Johan Hedberg, netdev, linux-kernel,
	Jakub Kicinski

Adding hci_dev_lock since hci_conn_params_(lookup|add) require this
lock.

Suggested-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
---

 net/bluetooth/mgmt.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/bluetooth/mgmt.c b/net/bluetooth/mgmt.c
index 2a732cab1dc99d..5e9b9728eeac03 100644
--- a/net/bluetooth/mgmt.c
+++ b/net/bluetooth/mgmt.c
@@ -3895,6 +3895,8 @@ static int get_device_flags(struct sock *sk, struct hci_dev *hdev, void *data,
 	bt_dev_dbg(hdev, "Get device flags %pMR (type 0x%x)\n",
 		   &cp->addr.bdaddr, cp->addr.type);
 
+	hci_dev_lock(hdev);
+
 	if (cp->addr.type == BDADDR_BREDR) {
 		br_params = hci_bdaddr_list_lookup_with_flags(&hdev->whitelist,
 							      &cp->addr.bdaddr,
@@ -3921,6 +3923,8 @@ static int get_device_flags(struct sock *sk, struct hci_dev *hdev, void *data,
 	status = MGMT_STATUS_SUCCESS;
 
 done:
+	hci_dev_unlock(hdev);
+
 	return mgmt_cmd_complete(sk, hdev->id, MGMT_OP_GET_DEVICE_FLAGS, status,
 				&rp, sizeof(rp));
 }
@@ -3959,6 +3963,8 @@ static int set_device_flags(struct sock *sk, struct hci_dev *hdev, void *data,
 		goto done;
 	}
 
+	hci_dev_lock(hdev);
+
 	if (cp->addr.type == BDADDR_BREDR) {
 		br_params = hci_bdaddr_list_lookup_with_flags(&hdev->whitelist,
 							      &cp->addr.bdaddr,
@@ -3985,6 +3991,8 @@ static int set_device_flags(struct sock *sk, struct hci_dev *hdev, void *data,
 	}
 
 done:
+	hci_dev_unlock(hdev);
+
 	if (status == MGMT_STATUS_SUCCESS)
 		device_flags_changed(sk, hdev, &cp->addr.bdaddr, cp->addr.type,
 				     supported_flags, current_flags);
-- 
2.27.0.111.gc72c7da667-goog


^ permalink raw reply related

* Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]
From: Anchal Agarwal @ 2020-06-19 23:43 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Boris Ostrovsky, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, hpa@zytor.com, x86@kernel.org, jgross@suse.com,
	linux-pm@vger.kernel.org, linux-mm@kvack.org, Kamata, Munehisa,
	sstabellini@kernel.org, konrad.wilk@oracle.com, axboe@kernel.dk,
	davem@davemloft.net, rjw@rjwysocki.net, len.brown@intel.com,
	pavel@ucw.cz, peterz@infradead.org, Valentin, Eduardo,
	Singh, Balbir, xen-devel@lists.xenproject.org,
	vkuznets@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Woodhouse, David,
	benh@kernel.crashing.org
In-Reply-To: <20200617083528.GW735@Air-de-Roger>

On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Jun 16, 2020 at 09:49:25PM +0000, Anchal Agarwal wrote:
> > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > On Wed, Jun 03, 2020 at 11:33:52PM +0000, Agarwal, Anchal wrote:
> > > >  CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > >     > +             xenbus_dev_error(dev, err, "Freezing timed out;"
> > > >     > +                              "the device may become inconsistent state");
> > > >
> > > >     Leaving the device in this state is quite bad, as it's in a closed
> > > >     state and with the queues frozen. You should make an attempt to
> > > >     restore things to a working state.
> > > >
> > > > You mean if backend closed after timeout? Is there a way to know that? I understand it's not good to
> > > > leave it in this state however, I am still trying to find if there is a good way to know if backend is still connected after timeout.
> > > > Hence the message " the device may become inconsistent state".  I didn't see a timeout not even once on my end so that's why
> > > > I may be looking for an alternate perspective here. may be need to thaw everything back intentionally is one thing I could think of.
> > >
> > > You can manually force this state, and then check that it will behave
> > > correctly. I would expect that on a failure to disconnect from the
> > > backend you should switch the frontend to the 'Init' state in order to
> > > try to reconnect to the backend when possible.
> > >
> > From what I understand forcing manually is, failing the freeze without
> > disconnect and try to revive the connection by unfreezing the
> > queues->reconnecting to backend [which never got diconnected]. May be even
> > tearing down things manually because I am not sure what state will frontend
> > see if backend fails to to disconnect at any point in time. I assumed connected.
> > Then again if its "CONNECTED" I may not need to tear down everything and start
> > from Initialising state because that may not work.
> >
> > So I am not so sure about backend's state so much, lets say if  xen_blkif_disconnect fail,
> > I don't see it getting handled in the backend then what will be backend's state?
> > Will it still switch xenbus state to 'Closed'? If not what will frontend see,
> > if it tries to read backend's state through xenbus_read_driver_state ?
> >
> > So the flow be like:
> > Front end marks XenbusStateClosing
> > Backend marks its state as XenbusStateClosing
> >     Frontend marks XenbusStateClosed
> >     Backend disconnects calls xen_blkif_disconnect
> >        Backend fails to disconnect, the above function returns EBUSY
> >        What will be state of backend here?
> 
> Backend should stay in state 'Closing' then, until it can finish
> tearing down.
> 
It disconnects the ring after switching to connected state too. 
> >        Frontend did not tear down the rings if backend does not switches the
> >        state to 'Closed' in case of failure.
> >
> > If backend stays in CONNECTED state, then even if we mark it Initialised in frontend, backend
> 
> Backend will stay in state 'Closing' I think.
> 
> > won't be calling connect(). {From reading code in frontend_changed}
> > IMU, Initialising will fail since backend dev->state != XenbusStateClosed plus
> > we did not tear down anything so calling talk_to_blkback may not be needed
> >
> > Does that sound correct?
> 
> I think switching to the initial state in order to try to attempt a
> reconnection would be our best bet here.
>
It does not seems to work correctly, I get hung tasks all over and all the
requests to filesystem gets stuck. Backend does shows the state as connected
after xenbus_dev_suspend fails but I think there may be something missing.
I don't seem to get IO interrupts thereafter i.e hitting the function blkif_interrupts.
I think just marking it initialised may not be the only thing.
Here is a short description of what I am trying to do:
So, on timeout:
    Switch XenBusState to "Initialized"
    unquiesce/unfreeze the queues and return
    mark info->connected = BLKIF_STATE_CONNECTED
    return EBUSY

I even allowed blkfront_connect to switch state to "CONNECTED" rather me doing
it explicitly as mentioned above without re-allocating/re-registering the device
just to make sure bklfront_info object has all the right values.
Do you see anythign missing here?

Also, while wrapping my brain around this recovery, one of the reasons I see
backend may not disconnct is if there are inflight I/O requests. There cannot be
pending I/O on shared ring because that check is already there before we switch
bus state to Closing. Also, queues are frozen so there will be no new I/O.
The only situation I can think of is since there too much of memory state to be
written and modified  that may not get completed within the timeout provided and
disconnect may fail. In that case, the time out needs to be configurable by the 
user since the hibernation may always fail depending on infrastructure or workload
running during hibernation.

Thanks,
Anchal

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
From: Andrii Nakryiko @ 2020-06-19 23:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Kernel Team
In-Reply-To: <24ac4e42-5831-f698-02f4-5f63d4620f1c@iogearbox.net>

On Fri, Jun 19, 2020 at 3:21 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/19/20 8:41 PM, Andrii Nakryiko wrote:
> > On Fri, Jun 19, 2020 at 6:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 6/19/20 2:39 AM, John Fastabend wrote:
> >>> John Fastabend wrote:
> >>>> Andrii Nakryiko wrote:
> >>>>> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> >>>>> <john.fastabend@gmail.com> wrote:
> >>>
> >>> [...]
> >>>
> >>>>> That would be great. Self-tests do work, but having more testing with
> >>>>> real-world application would certainly help as well.
> >>>>
> >>>> Thanks for all the follow up.
> >>>>
> >>>> I ran the change through some CI on my side and it passed so I can
> >>>> complain about a few shifts here and there or just update my code or
> >>>> just not change the return types on my side but I'm convinced its OK
> >>>> in most cases and helps in some so...
> >>>>
> >>>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> >>>
> >>> I'll follow this up with a few more selftests to capture a couple of our
> >>> patterns. These changes are subtle and I worry a bit that additional
> >>> <<,s>> pattern could have the potential to break something.
> >>>
> >>> Another one we didn't discuss that I found in our code base is feeding
> >>> the output of a probe_* helper back into the size field (after some
> >>> alu ops) of subsequent probe_* call. Unfortunately, the tests I ran
> >>> today didn't cover that case.
> >>>
> >>> I'll put it on the list tomorrow and encode these in selftests. I'll
> >>> let the mainainers decide if they want to wait for those or not.
> >>
> >> Given potential fragility on verifier side, my preference would be that we
> >> have the known variations all covered in selftests before moving forward in
> >> order to make sure they don't break in any way. Back in [0] I've seen mostly
> >> similar cases in the way John mentioned in other projects, iirc, sysdig was
> >> another one. If both of you could hack up the remaining cases we need to
> >> cover and then submit a combined series, that would be great. I don't think
> >> we need to rush this optimization w/o necessary selftests.
> >
> > There is no rush, but there is also no reason to delay it. I'd rather
> > land it early in the libbpf release cycle and let people try it in
> > their prod environments, for those concerned about such code patterns.
>
> Andrii, define 'delay'. John mentioned above to put together few more
> selftests today so that there is better coverage at least, why is that
> an 'issue'? I'm not sure how you read 'late in release cycle' out of it,
> it's still as early. The unsigned optimization for len <= MAX_LEN is
> reasonable and makes sense, but it's still one [specific] variant only.

I'm totally fine waiting for John's tests, but I read your reply as a
request to go dig up some more examples from sysdig and other
projects, which I don't think I can commit to. So if it's just about
waiting for John's examples, that's fine and sorry for
misunderstanding.

>
> > I don't have a list of all the patterns that we might need to test.
> > Going through all open-source BPF source code to identify possible
> > patterns and then coding them up in minimal selftests is a bit too
> > much for me, honestly.
>
> I think we're probably talking past each other. John wrote above:

Yep, sorry, I assumed more general context, not specifically John's reply.

>
>  >>> I'll follow this up with a few more selftests to capture a couple of our
>  >>> patterns. These changes are subtle and I worry a bit that additional
>  >>> <<,s>> pattern could have the potential to break something.
>
> So submitting this as a full series together makes absolutely sense to me,
> so there's maybe not perfect but certainly more confidence that also other
> patterns where the shifts optimized out in one case are then appearing in
> another are tested on a best effort and run our kselftest suite.
>
> Thanks,
> Daniel

^ permalink raw reply

* [PATCH bpf] libbpf: fix CO-RE relocs against .text section
From: Andrii Nakryiko @ 2020-06-19 23:04 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Yonghong Song

bpf_object__find_program_by_title(), used by CO-RE relocation code, doesn't
return .text "BPF program", if it is a function storage for sub-programs.
Because of that, any CO-RE relocation in helper non-inlined functions will
fail. Fix this by searching for .text-corresponding BPF program manually.

Adjust one of bpf_iter selftest to exhibit this pattern.

Reported-by: Yonghong Song <yhs@fb.com>
Fixes: ddc7c3042614 ("libbpf: implement BPF CO-RE offset relocation algorithm")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/libbpf.c                               | 8 +++++++-
 tools/testing/selftests/bpf/progs/bpf_iter_netlink.c | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 477c679ed945..f17151d866e6 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4818,7 +4818,13 @@ bpf_core_reloc_fields(struct bpf_object *obj, const char *targ_btf_path)
 			err = -EINVAL;
 			goto out;
 		}
-		prog = bpf_object__find_program_by_title(obj, sec_name);
+		prog = NULL;
+		for (i = 0; i < obj->nr_programs; i++) {
+			if (!strcmp(obj->programs[i].section_name, sec_name)) {
+				prog = &obj->programs[i];
+				break;
+			}
+		}
 		if (!prog) {
 			pr_warn("failed to find program '%s' for CO-RE offset relocation\n",
 				sec_name);
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
index e7b8753eac0b..75ecf956a2df 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
@@ -25,7 +25,7 @@ struct bpf_iter__netlink {
 	struct netlink_sock *sk;
 } __attribute__((preserve_access_index));
 
-static inline struct inode *SOCK_INODE(struct socket *socket)
+static __attribute__((noinline)) struct inode *SOCK_INODE(struct socket *socket)
 {
 	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
 }
-- 
2.24.1


^ permalink raw reply related

* [PATCH v2 bpf-next 5/8] bpf: cpumap: implement XDP_REDIRECT for eBPF programs attached to map entries
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

Introduce XDP_REDIRECT support for eBPF programs attached to cpumap
entries.
This patch has been tested on Marvell ESPRESSObin using a modified
version of xdp_redirect_cpu sample in order to attach a XDP program
to CPUMAP entries to perform a redirect on the mvneta interface.
In particular the following scenario has been tested:

rq (cpu0) --> mvneta - XDP_REDIRECT (cpu0) --> CPUMAP - XDP_REDIRECT (cpu1) --> mvneta

$./xdp_redirect_cpu -p xdp_cpu_map0 -d eth0 -c 1 -e xdp_redirect \
	-f xdp_redirect_kern.o -m tx_port -r eth0

tx: 285.2 Kpps rx: 285.2 Kpps

Attacching a simple XDP program on eth0 to perform XDP_TX gives
comparable results:

tx: 288.4 Kpps rx: 288.4 Kpps

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h          |  1 +
 include/trace/events/xdp.h |  6 ++++--
 kernel/bpf/cpumap.c        | 17 +++++++++++++++--
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 441716a0c0a4..71a4e30003e5 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -99,6 +99,7 @@ struct xdp_frame {
 };
 
 struct xdp_cpumap_stats {
+	unsigned int redirect;
 	unsigned int pass;
 	unsigned int drop;
 };
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index e2c99f5bee39..cd24e8a59529 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -190,6 +190,7 @@ TRACE_EVENT(xdp_cpumap_kthread,
 		__field(int, sched)
 		__field(unsigned int, xdp_pass)
 		__field(unsigned int, xdp_drop)
+		__field(unsigned int, xdp_redirect)
 	),
 
 	TP_fast_assign(
@@ -201,18 +202,19 @@ TRACE_EVENT(xdp_cpumap_kthread,
 		__entry->sched	= sched;
 		__entry->xdp_pass	= xdp_stats->pass;
 		__entry->xdp_drop	= xdp_stats->drop;
+		__entry->xdp_redirect	= xdp_stats->redirect;
 	),
 
 	TP_printk("kthread"
 		  " cpu=%d map_id=%d action=%s"
 		  " processed=%u drops=%u"
 		  " sched=%d"
-		  " xdp_pass=%u xdp_drop=%u",
+		  " xdp_pass=%u xdp_drop=%u xdp_redirect=%u",
 		  __entry->cpu, __entry->map_id,
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
 		  __entry->processed, __entry->drops,
 		  __entry->sched,
-		  __entry->xdp_pass, __entry->xdp_drop)
+		  __entry->xdp_pass, __entry->xdp_drop, __entry->xdp_redirect)
 );
 
 TRACE_EVENT(xdp_cpumap_enqueue,
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index dedf33d8c8d0..e0160e24be81 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,7 +240,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 	xdp_set_return_frame_no_direct();
 	xdp.rxq = &rxq;
 
-	rcu_read_lock();
+	rcu_read_lock_bh();
 
 	prog = READ_ONCE(rcpu->prog);
 	for (i = 0; i < n; i++) {
@@ -266,6 +266,16 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 				stats->pass++;
 			}
 			break;
+		case XDP_REDIRECT:
+			err = xdp_do_redirect(xdpf->dev_rx, &xdp,
+					      prog);
+			if (unlikely(err)) {
+				xdp_return_frame(xdpf);
+				stats->drop++;
+			} else {
+				stats->redirect++;
+			}
+			break;
 		default:
 			bpf_warn_invalid_xdp_action(act);
 			/* fallthrough */
@@ -276,7 +286,10 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 		}
 	}
 
-	rcu_read_unlock();
+	if (stats->redirect)
+		xdp_do_flush_map();
+
+	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 	xdp_clear_return_frame_no_direct();
 
 	return nframes;
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 8/8] selftest: add tests for XDP programs in CPUMAP entries
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

Similar to what have been done for DEVMAP, introduce tests to verify
ability to add a XDP program to an entry in a CPUMAP.
Verify CPUMAP programs can not be attached to devices as a normal
XDP program, and only programs with BPF_XDP_CPUMAP attach type can
be loaded in a CPUMAP.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 .../bpf/prog_tests/xdp_cpumap_attach.c        | 70 +++++++++++++++++++
 .../bpf/progs/test_xdp_with_cpumap_helpers.c  | 38 ++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_cpumap_attach.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_with_cpumap_helpers.c

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_cpumap_attach.c b/tools/testing/selftests/bpf/prog_tests/xdp_cpumap_attach.c
new file mode 100644
index 000000000000..2baa41689f40
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_cpumap_attach.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <uapi/linux/bpf.h>
+#include <linux/if_link.h>
+#include <test_progs.h>
+
+#include "test_xdp_with_cpumap_helpers.skel.h"
+
+#define IFINDEX_LO	1
+
+void test_xdp_with_cpumap_helpers(void)
+{
+	struct test_xdp_with_cpumap_helpers *skel;
+	struct bpf_prog_info info = {};
+	struct bpf_cpumap_val val = {
+		.qsize = 192,
+	};
+	__u32 duration = 0, idx = 0;
+	__u32 len = sizeof(info);
+	int err, prog_fd, map_fd;
+
+	skel = test_xdp_with_cpumap_helpers__open_and_load();
+	if (CHECK_FAIL(!skel)) {
+		perror("test_xdp_with_cpumap_helpers__open_and_load");
+		return;
+	}
+
+	/* can not attach program with cpumaps that allow programs
+	 * as xdp generic
+	 */
+	prog_fd = bpf_program__fd(skel->progs.xdp_redir_prog);
+	err = bpf_set_link_xdp_fd(IFINDEX_LO, prog_fd, XDP_FLAGS_SKB_MODE);
+	CHECK(err == 0, "Generic attach of program with 8-byte CPUMAP",
+	      "should have failed\n");
+
+	prog_fd = bpf_program__fd(skel->progs.xdp_dummy_cm);
+	map_fd = bpf_map__fd(skel->maps.cpu_map);
+	err = bpf_obj_get_info_by_fd(prog_fd, &info, &len);
+	if (CHECK_FAIL(err))
+		goto out_close;
+
+	val.bpf_prog.fd = prog_fd;
+	err = bpf_map_update_elem(map_fd, &idx, &val, 0);
+	CHECK(err, "Add program to cpumap entry", "err %d errno %d\n",
+	      err, errno);
+
+	err = bpf_map_lookup_elem(map_fd, &idx, &val);
+	CHECK(err, "Read cpumap entry", "err %d errno %d\n", err, errno);
+	CHECK(info.id != val.bpf_prog.id, "Expected program id in cpumap entry",
+	      "expected %u read %u\n", info.id, val.bpf_prog.id);
+
+	/* can not attach BPF_XDP_CPUMAP program to a device */
+	err = bpf_set_link_xdp_fd(IFINDEX_LO, prog_fd, XDP_FLAGS_SKB_MODE);
+	CHECK(err == 0, "Attach of BPF_XDP_CPUMAP program",
+	      "should have failed\n");
+
+	val.qsize = 192;
+	val.bpf_prog.fd = bpf_program__fd(skel->progs.xdp_dummy_prog);
+	err = bpf_map_update_elem(map_fd, &idx, &val, 0);
+	CHECK(err == 0, "Add non-BPF_XDP_CPUMAP program to cpumap entry",
+	      "should have failed\n");
+
+out_close:
+	test_xdp_with_cpumap_helpers__destroy(skel);
+}
+
+void test_xdp_cpumap_attach(void)
+{
+	if (test__start_subtest("CPUMAP with programs in entries"))
+		test_xdp_with_cpumap_helpers();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_with_cpumap_helpers.c b/tools/testing/selftests/bpf/progs/test_xdp_with_cpumap_helpers.c
new file mode 100644
index 000000000000..acbbc62efa55
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_with_cpumap_helpers.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CPUMAP);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_cpumap_val));
+	__uint(max_entries, 4);
+} cpu_map SEC(".maps");
+
+SEC("xdp_redir")
+int xdp_redir_prog(struct xdp_md *ctx)
+{
+	return bpf_redirect_map(&cpu_map, 1, 0);
+}
+
+SEC("xdp_dummy")
+int xdp_dummy_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+SEC("xdp_cpumap")
+int xdp_dummy_cm(struct xdp_md *ctx)
+{
+	char fmt[] = "devmap redirect: dev %u len %u\n";
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	unsigned int len = data_end - data;
+
+	bpf_trace_printk(fmt, sizeof(fmt), ctx->ingress_ifindex, len);
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 7/8] samples/bpf: xdp_redirect_cpu: load a eBPF program on cpumap
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

Extend xdp_redirect_cpu_{usr,kern}.c adding the possibility to load
a XDP program on cpumap entries. The following options have been added:
- mprog-name: cpumap entry program name
- mprog-filename: cpumap entry program filename
- redirect-device: output interface if the cpumap program performs a
  XDP_REDIRECT to an egress interface
- redirect-map: bpf map used to perform XDP_REDIRECT to an egress
  interface
- mprog-disable: disable loading XDP program on cpumap entries

Add xdp_pass, xdp_drop, xdp_redirect stats accounting

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 samples/bpf/xdp_redirect_cpu_kern.c |  25 ++--
 samples/bpf/xdp_redirect_cpu_user.c | 174 +++++++++++++++++++++++++---
 2 files changed, 177 insertions(+), 22 deletions(-)

diff --git a/samples/bpf/xdp_redirect_cpu_kern.c b/samples/bpf/xdp_redirect_cpu_kern.c
index 2baf8db1f7e7..8255025dea97 100644
--- a/samples/bpf/xdp_redirect_cpu_kern.c
+++ b/samples/bpf/xdp_redirect_cpu_kern.c
@@ -21,7 +21,7 @@
 struct {
 	__uint(type, BPF_MAP_TYPE_CPUMAP);
 	__uint(key_size, sizeof(u32));
-	__uint(value_size, sizeof(u32));
+	__uint(value_size, sizeof(struct bpf_cpumap_val));
 	__uint(max_entries, MAX_CPUS);
 } cpu_map SEC(".maps");
 
@@ -30,6 +30,9 @@ struct datarec {
 	__u64 processed;
 	__u64 dropped;
 	__u64 issue;
+	__u64 xdp_pass;
+	__u64 xdp_drop;
+	__u64 xdp_redirect;
 };
 
 /* Count RX packets, as XDP bpf_prog doesn't get direct TX-success
@@ -692,13 +695,16 @@ int trace_xdp_cpumap_enqueue(struct cpumap_enqueue_ctx *ctx)
  * Code in:         kernel/include/trace/events/xdp.h
  */
 struct cpumap_kthread_ctx {
-	u64 __pad;		// First 8 bytes are not accessible by bpf code
-	int map_id;		//	offset:8;  size:4; signed:1;
-	u32 act;		//	offset:12; size:4; signed:0;
-	int cpu;		//	offset:16; size:4; signed:1;
-	unsigned int drops;	//	offset:20; size:4; signed:0;
-	unsigned int processed;	//	offset:24; size:4; signed:0;
-	int sched;		//	offset:28; size:4; signed:1;
+	u64 __pad;			// First 8 bytes are not accessible
+	int map_id;			//	offset:8;  size:4; signed:1;
+	u32 act;			//	offset:12; size:4; signed:0;
+	int cpu;			//	offset:16; size:4; signed:1;
+	unsigned int drops;		//	offset:20; size:4; signed:0;
+	unsigned int processed;		//	offset:24; size:4; signed:0;
+	int sched;			//	offset:28; size:4; signed:1;
+	unsigned int xdp_pass;		//	offset:32; size:4; signed:0;
+	unsigned int xdp_drop;		//	offset:36; size:4; signed:0;
+	unsigned int xdp_redirect;	//	offset:40; size:4; signed:0;
 };
 
 SEC("tracepoint/xdp/xdp_cpumap_kthread")
@@ -712,6 +718,9 @@ int trace_xdp_cpumap_kthread(struct cpumap_kthread_ctx *ctx)
 		return 0;
 	rec->processed += ctx->processed;
 	rec->dropped   += ctx->drops;
+	rec->xdp_pass  += ctx->xdp_pass;
+	rec->xdp_drop  += ctx->xdp_drop;
+	rec->xdp_redirect  += ctx->xdp_redirect;
 
 	/* Count times kthread yielded CPU via schedule call */
 	if (ctx->sched)
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index 1a054737c35a..4b1264ca7ab7 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -70,6 +70,11 @@ static const struct option long_options[] = {
 	{"stress-mode", no_argument,		NULL, 'x' },
 	{"no-separators", no_argument,		NULL, 'z' },
 	{"force",	no_argument,		NULL, 'F' },
+	{"mprog-disable", no_argument,		NULL, 'n' },
+	{"mprog-name",	required_argument,	NULL, 'e' },
+	{"mprog-filename", required_argument,	NULL, 'f' },
+	{"redirect-device", required_argument,	NULL, 'r' },
+	{"redirect-map", required_argument,	NULL, 'm' },
 	{0, 0, NULL,  0 }
 };
 
@@ -156,6 +161,9 @@ struct datarec {
 	__u64 processed;
 	__u64 dropped;
 	__u64 issue;
+	__u64 xdp_pass;
+	__u64 xdp_drop;
+	__u64 xdp_redirect;
 };
 struct record {
 	__u64 timestamp;
@@ -175,6 +183,9 @@ static bool map_collect_percpu(int fd, __u32 key, struct record *rec)
 	/* For percpu maps, userspace gets a value per possible CPU */
 	unsigned int nr_cpus = bpf_num_possible_cpus();
 	struct datarec values[nr_cpus];
+	__u64 sum_xdp_redirect = 0;
+	__u64 sum_xdp_pass = 0;
+	__u64 sum_xdp_drop = 0;
 	__u64 sum_processed = 0;
 	__u64 sum_dropped = 0;
 	__u64 sum_issue = 0;
@@ -196,10 +207,19 @@ static bool map_collect_percpu(int fd, __u32 key, struct record *rec)
 		sum_dropped        += values[i].dropped;
 		rec->cpu[i].issue = values[i].issue;
 		sum_issue        += values[i].issue;
+		rec->cpu[i].xdp_pass = values[i].xdp_pass;
+		sum_xdp_pass += values[i].xdp_pass;
+		rec->cpu[i].xdp_drop = values[i].xdp_drop;
+		sum_xdp_drop += values[i].xdp_drop;
+		rec->cpu[i].xdp_redirect = values[i].xdp_redirect;
+		sum_xdp_redirect += values[i].xdp_redirect;
 	}
 	rec->total.processed = sum_processed;
 	rec->total.dropped   = sum_dropped;
 	rec->total.issue     = sum_issue;
+	rec->total.xdp_pass  = sum_xdp_pass;
+	rec->total.xdp_drop  = sum_xdp_drop;
+	rec->total.xdp_redirect = sum_xdp_redirect;
 	return true;
 }
 
@@ -303,17 +323,33 @@ static __u64 calc_errs_pps(struct datarec *r,
 	return pps;
 }
 
+static void calc_xdp_pps(struct datarec *r, struct datarec *p,
+			 double *xdp_pass, double *xdp_drop,
+			 double *xdp_redirect, double period_)
+{
+	*xdp_pass = 0, *xdp_drop = 0, *xdp_redirect = 0;
+	if (period_ > 0) {
+		*xdp_redirect = (r->xdp_redirect - p->xdp_redirect) / period_;
+		*xdp_pass = (r->xdp_pass - p->xdp_pass) / period_;
+		*xdp_drop = (r->xdp_drop - p->xdp_drop) / period_;
+	}
+}
+
 static void stats_print(struct stats_record *stats_rec,
 			struct stats_record *stats_prev,
-			char *prog_name)
+			char *prog_name, char *mprog_name, int mprog_fd)
 {
 	unsigned int nr_cpus = bpf_num_possible_cpus();
 	double pps = 0, drop = 0, err = 0;
+	bool mprog_enabled = false;
 	struct record *rec, *prev;
 	int to_cpu;
 	double t;
 	int i;
 
+	if (mprog_fd > 0)
+		mprog_enabled = true;
+
 	/* Header */
 	printf("Running XDP/eBPF prog_name:%s\n", prog_name);
 	printf("%-15s %-7s %-14s %-11s %-9s\n",
@@ -458,6 +494,33 @@ static void stats_print(struct stats_record *stats_rec,
 		printf(fm2_err, "xdp_exception", "total", pps, drop);
 	}
 
+	/* CPUMAP attached XDP program that runs on remote/destination CPU */
+	if (mprog_enabled) {
+		char *fmt_k = "%-15s %-7d %'-14.0f %'-11.0f %'-10.0f\n";
+		char *fm2_k = "%-15s %-7s %'-14.0f %'-11.0f %'-10.0f\n";
+		double xdp_pass, xdp_drop, xdp_redirect;
+
+		printf("\n2nd remote XDP/eBPF prog_name: %s\n", mprog_name);
+		printf("%-15s %-7s %-14s %-11s %-9s\n",
+		       "XDP-cpumap", "CPU:to", "xdp-pass", "xdp-drop", "xdp-redir");
+
+		rec  = &stats_rec->kthread;
+		prev = &stats_prev->kthread;
+		t = calc_period(rec, prev);
+		for (i = 0; i < nr_cpus; i++) {
+			struct datarec *r = &rec->cpu[i];
+			struct datarec *p = &prev->cpu[i];
+
+			calc_xdp_pps(r, p, &xdp_pass, &xdp_drop,
+				     &xdp_redirect, t);
+			if (xdp_pass > 0 || xdp_drop > 0 || xdp_redirect > 0)
+				printf(fmt_k, "xdp-in-kthread", i, xdp_pass, xdp_drop, xdp_redirect);
+		}
+		calc_xdp_pps(&rec->total, &prev->total, &xdp_pass, &xdp_drop,
+			     &xdp_redirect, t);
+		printf(fm2_k, "xdp-in-kthread", "total", xdp_pass, xdp_drop, xdp_redirect);
+	}
+
 	printf("\n");
 	fflush(stdout);
 }
@@ -494,7 +557,7 @@ static inline void swap(struct stats_record **a, struct stats_record **b)
 	*b = tmp;
 }
 
-static int create_cpu_entry(__u32 cpu, __u32 queue_size,
+static int create_cpu_entry(__u32 cpu, struct bpf_cpumap_val *value,
 			    __u32 avail_idx, bool new)
 {
 	__u32 curr_cpus_count = 0;
@@ -504,7 +567,7 @@ static int create_cpu_entry(__u32 cpu, __u32 queue_size,
 	/* Add a CPU entry to cpumap, as this allocate a cpu entry in
 	 * the kernel for the cpu.
 	 */
-	ret = bpf_map_update_elem(cpu_map_fd, &cpu, &queue_size, 0);
+	ret = bpf_map_update_elem(cpu_map_fd, &cpu, value, 0);
 	if (ret) {
 		fprintf(stderr, "Create CPU entry failed (err:%d)\n", ret);
 		exit(EXIT_FAIL_BPF);
@@ -535,9 +598,9 @@ static int create_cpu_entry(__u32 cpu, __u32 queue_size,
 		}
 	}
 	/* map_fd[7] = cpus_iterator */
-	printf("%s CPU:%u as idx:%u queue_size:%d (total cpus_count:%u)\n",
+	printf("%s CPU:%u as idx:%u qsize:%d prog_fd: %d (cpus_count:%u)\n",
 	       new ? "Add-new":"Replace", cpu, avail_idx,
-	       queue_size, curr_cpus_count);
+	       value->qsize, value->bpf_prog.fd, curr_cpus_count);
 
 	return 0;
 }
@@ -561,21 +624,26 @@ static void mark_cpus_unavailable(void)
 }
 
 /* Stress cpumap management code by concurrently changing underlying cpumap */
-static void stress_cpumap(void)
+static void stress_cpumap(struct bpf_cpumap_val *value)
 {
 	/* Changing qsize will cause kernel to free and alloc a new
 	 * bpf_cpu_map_entry, with an associated/complicated tear-down
 	 * procedure.
 	 */
-	create_cpu_entry(1,  1024, 0, false);
-	create_cpu_entry(1,     8, 0, false);
-	create_cpu_entry(1, 16000, 0, false);
+	value->qsize = 1024;
+	create_cpu_entry(1, value, 0, false);
+	value->qsize = 8;
+	create_cpu_entry(1, value, 0, false);
+	value->qsize = 16000;
+	create_cpu_entry(1, value, 0, false);
 }
 
 static void stats_poll(int interval, bool use_separators, char *prog_name,
+		       char *mprog_name, struct bpf_cpumap_val *value,
 		       bool stress_mode)
 {
 	struct stats_record *record, *prev;
+	int mprog_fd;
 
 	record = alloc_stats_record();
 	prev   = alloc_stats_record();
@@ -587,11 +655,12 @@ static void stats_poll(int interval, bool use_separators, char *prog_name,
 
 	while (1) {
 		swap(&prev, &record);
+		mprog_fd = value->bpf_prog.fd;
 		stats_collect(record);
-		stats_print(record, prev, prog_name);
+		stats_print(record, prev, prog_name, mprog_name, mprog_fd);
 		sleep(interval);
 		if (stress_mode)
-			stress_cpumap();
+			stress_cpumap(value);
 	}
 
 	free_stats_record(record);
@@ -664,15 +733,66 @@ static int init_map_fds(struct bpf_object *obj)
 	return 0;
 }
 
+static int load_cpumap_prog(char *file_name, char *prog_name,
+			    char *redir_interface, char *redir_map)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type		= BPF_PROG_TYPE_XDP,
+		.expected_attach_type	= BPF_XDP_CPUMAP,
+		.file = file_name,
+	};
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	int fd;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &fd))
+		return -1;
+
+	if (fd < 0) {
+		fprintf(stderr, "ERR: bpf_prog_load_xattr: %s\n",
+			strerror(errno));
+		return fd;
+	}
+
+	if (redir_interface && redir_map) {
+		int err, map_fd, ifindex_out, key = 0;
+
+		map_fd = bpf_object__find_map_fd_by_name(obj, redir_map);
+		if (map_fd < 0)
+			return map_fd;
+
+		ifindex_out = if_nametoindex(redir_interface);
+		if (!ifindex_out)
+			return -1;
+
+		err = bpf_map_update_elem(map_fd, &key, &ifindex_out, 0);
+		if (err < 0)
+			return err;
+	}
+
+	prog = bpf_object__find_program_by_title(obj, prog_name);
+	if (!prog) {
+		fprintf(stderr, "bpf_object__find_program_by_title failed\n");
+		return EXIT_FAIL;
+	}
+
+	return bpf_program__fd(prog);
+}
+
 int main(int argc, char **argv)
 {
 	struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
 	char *prog_name = "xdp_cpu_map5_lb_hash_ip_pairs";
+	char *mprog_filename = "xdp_redirect_kern.o";
+	char *redir_interface = NULL, *redir_map = NULL;
+	char *mprog_name = "xdp_redirect_dummy";
+	bool mprog_disable = false;
 	struct bpf_prog_load_attr prog_load_attr = {
 		.prog_type	= BPF_PROG_TYPE_UNSPEC,
 	};
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
+	struct bpf_cpumap_val value;
 	bool use_separators = true;
 	bool stress_mode = false;
 	struct bpf_program *prog;
@@ -728,7 +848,7 @@ int main(int argc, char **argv)
 	memset(cpu, 0, n_cpus * sizeof(int));
 
 	/* Parse commands line args */
-	while ((opt = getopt_long(argc, argv, "hSd:s:p:q:c:xzF",
+	while ((opt = getopt_long(argc, argv, "hSd:s:p:q:c:xzFf:e:r:m:",
 				  long_options, &longindex)) != -1) {
 		switch (opt) {
 		case 'd':
@@ -762,6 +882,21 @@ int main(int argc, char **argv)
 			/* Selecting eBPF prog to load */
 			prog_name = optarg;
 			break;
+		case 'n':
+			mprog_disable = true;
+			break;
+		case 'f':
+			mprog_filename = optarg;
+			break;
+		case 'e':
+			mprog_name = optarg;
+			break;
+		case 'r':
+			redir_interface = optarg;
+			break;
+		case 'm':
+			redir_map = optarg;
+			break;
 		case 'c':
 			/* Add multiple CPUs */
 			add_cpu = strtoul(optarg, NULL, 0);
@@ -807,8 +942,18 @@ int main(int argc, char **argv)
 		goto out;
 	}
 
+	value.bpf_prog.fd = 0;
+	if (!mprog_disable)
+		value.bpf_prog.fd = load_cpumap_prog(mprog_filename, mprog_name,
+						     redir_interface, redir_map);
+	if (value.bpf_prog.fd < 0) {
+		err = value.bpf_prog.fd;
+		goto out;
+	}
+	value.qsize = qsize;
+
 	for (i = 0; i < added_cpus; i++)
-		create_cpu_entry(cpu[i], qsize, i, true);
+		create_cpu_entry(cpu[i], &value, i, true);
 
 	/* Remove XDP program when program is interrupted or killed */
 	signal(SIGINT, int_exit);
@@ -841,7 +986,8 @@ int main(int argc, char **argv)
 	}
 	prog_id = info.id;
 
-	stats_poll(interval, use_separators, prog_name, stress_mode);
+	stats_poll(interval, use_separators, prog_name, mprog_name,
+		   &value, stress_mode);
 out:
 	free(cpu);
 	return err;
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 6/8] libbpf: add SEC name for xdp programs attached to CPUMAP
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

As for DEVMAP, support SEC("xdp_cpumap*") as a short cut for loading
the program with type BPF_PROG_TYPE_XDP and expected attach type
BPF_XDP_CPUMAP.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 tools/lib/bpf/libbpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 477c679ed945..0cd13cad8375 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6655,6 +6655,8 @@ static const struct bpf_sec_def section_defs[] = {
 		.attach_fn = attach_iter),
 	BPF_EAPROG_SEC("xdp_devmap",		BPF_PROG_TYPE_XDP,
 						BPF_XDP_DEVMAP),
+	BPF_EAPROG_SEC("xdp_cpumap",		BPF_PROG_TYPE_XDP,
+						BPF_XDP_CPUMAP),
 	BPF_PROG_SEC("xdp",			BPF_PROG_TYPE_XDP),
 	BPF_PROG_SEC("perf_event",		BPF_PROG_TYPE_PERF_EVENT),
 	BPF_PROG_SEC("lwt_in",			BPF_PROG_TYPE_LWT_IN),
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 4/8] bpf: cpumap: add the possibility to attach an eBPF program to cpumap
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

Introduce the capability to attach an eBPF program to cpumap entries.
The idea behind this feature is to add the possibility to define on
which CPU run the eBPF program if the underlying hw does not support
RSS. Current supported verdicts are XDP_DROP and XDP_PASS.

This patch has been tested on Marvell ESPRESSObin using xdp_redirect_cpu
sample available in the kernel tree to identify possible performance
regressions. Results show there are no observable differences in
packet-per-second:

$./xdp_redirect_cpu --progname xdp_cpu_map0 --dev eth0 --cpu 1
rx: 354.8 Kpps
rx: 356.0 Kpps
rx: 356.8 Kpps
rx: 356.3 Kpps
rx: 356.6 Kpps
rx: 356.6 Kpps
rx: 356.7 Kpps
rx: 355.8 Kpps
rx: 356.8 Kpps
rx: 356.8 Kpps

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/linux/bpf.h            |   6 ++
 include/net/xdp.h              |   5 ++
 include/trace/events/xdp.h     |  14 ++--
 include/uapi/linux/bpf.h       |   5 ++
 kernel/bpf/cpumap.c            | 123 +++++++++++++++++++++++++++++----
 net/core/dev.c                 |   8 +++
 tools/include/uapi/linux/bpf.h |   5 ++
 7 files changed, 149 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 07052d44bca1..3643af9d08a2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1256,6 +1256,7 @@ struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
 void __cpu_map_flush(void);
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool cpu_map_prog_allowed(struct bpf_map *map);
 
 /* Return map's numa specified by userspace */
 static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
@@ -1416,6 +1417,11 @@ static inline int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu,
 	return 0;
 }
 
+static inline bool cpu_map_prog_allowed(struct bpf_map *map)
+{
+	return false;
+}
+
 static inline struct bpf_prog *bpf_prog_get_type_path(const char *name,
 				enum bpf_prog_type type)
 {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index ab1c503808a4..441716a0c0a4 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -98,6 +98,11 @@ struct xdp_frame {
 	struct net_device *dev_rx; /* used by cpumap */
 };
 
+struct xdp_cpumap_stats {
+	unsigned int pass;
+	unsigned int drop;
+};
+
 /* Clear kernel pointers in xdp_frame */
 static inline void xdp_scrub_frame(struct xdp_frame *frame)
 {
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index b73d3e141323..e2c99f5bee39 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -177,9 +177,9 @@ DEFINE_EVENT(xdp_redirect_template, xdp_redirect_map_err,
 TRACE_EVENT(xdp_cpumap_kthread,
 
 	TP_PROTO(int map_id, unsigned int processed,  unsigned int drops,
-		 int sched),
+		 int sched, struct xdp_cpumap_stats *xdp_stats),
 
-	TP_ARGS(map_id, processed, drops, sched),
+	TP_ARGS(map_id, processed, drops, sched, xdp_stats),
 
 	TP_STRUCT__entry(
 		__field(int, map_id)
@@ -188,6 +188,8 @@ TRACE_EVENT(xdp_cpumap_kthread,
 		__field(unsigned int, drops)
 		__field(unsigned int, processed)
 		__field(int, sched)
+		__field(unsigned int, xdp_pass)
+		__field(unsigned int, xdp_drop)
 	),
 
 	TP_fast_assign(
@@ -197,16 +199,20 @@ TRACE_EVENT(xdp_cpumap_kthread,
 		__entry->drops		= drops;
 		__entry->processed	= processed;
 		__entry->sched	= sched;
+		__entry->xdp_pass	= xdp_stats->pass;
+		__entry->xdp_drop	= xdp_stats->drop;
 	),
 
 	TP_printk("kthread"
 		  " cpu=%d map_id=%d action=%s"
 		  " processed=%u drops=%u"
-		  " sched=%d",
+		  " sched=%d"
+		  " xdp_pass=%u xdp_drop=%u",
 		  __entry->cpu, __entry->map_id,
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
 		  __entry->processed, __entry->drops,
-		  __entry->sched)
+		  __entry->sched,
+		  __entry->xdp_pass, __entry->xdp_drop)
 );
 
 TRACE_EVENT(xdp_cpumap_enqueue,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a45d61bc886e..dec1d5e422b2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET4_GETSOCKNAME,
 	BPF_CGROUP_INET6_GETSOCKNAME,
 	BPF_XDP_DEVMAP,
+	BPF_XDP_CPUMAP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3781,6 +3782,10 @@ struct bpf_devmap_val {
  */
 struct bpf_cpumap_val {
 	__u32 qsize;	/* queue size */
+	union {
+		int   fd;	/* prog fd on map write */
+		__u32 id;	/* prog id on map read */
+	} bpf_prog;
 };
 
 enum sk_action {
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8951f187f6cf..dedf33d8c8d0 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -67,6 +67,7 @@ struct bpf_cpu_map_entry {
 	struct rcu_head rcu;
 
 	struct bpf_cpumap_val value;
+	struct bpf_prog *prog;
 };
 
 struct bpf_cpu_map {
@@ -81,6 +82,7 @@ static int bq_flush_to_queue(struct xdp_bulk_queue *bq);
 
 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 {
+	u32 value_size = attr->value_size;
 	struct bpf_cpu_map *cmap;
 	int err = -ENOMEM;
 	u64 cost;
@@ -91,7 +93,9 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
+	    (value_size != offsetofend(struct bpf_cpumap_val, qsize) &&
+	     value_size != offsetofend(struct bpf_cpumap_val, bpf_prog.fd)) ||
+	    attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
 	cmap = kzalloc(sizeof(*cmap), GFP_USER);
@@ -221,6 +225,63 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 	}
 }
 
+static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
+				    void **xdp_frames, int n,
+				    struct xdp_cpumap_stats *stats)
+{
+	struct xdp_rxq_info rxq;
+	struct bpf_prog *prog;
+	struct xdp_buff xdp;
+	int i, nframes = 0;
+
+	if (!rcpu->prog)
+		return n;
+
+	xdp_set_return_frame_no_direct();
+	xdp.rxq = &rxq;
+
+	rcu_read_lock();
+
+	prog = READ_ONCE(rcpu->prog);
+	for (i = 0; i < n; i++) {
+		struct xdp_frame *xdpf = xdp_frames[i];
+		u32 act;
+		int err;
+
+		rxq.dev = xdpf->dev_rx;
+		rxq.mem = xdpf->mem;
+		/* TODO: report queue_index to xdp_rxq_info */
+
+		xdp_convert_frame_to_buff(xdpf, &xdp);
+
+		act = bpf_prog_run_xdp(prog, &xdp);
+		switch (act) {
+		case XDP_PASS:
+			err = xdp_update_frame_from_buff(&xdp, xdpf);
+			if (err < 0) {
+				xdp_return_frame(xdpf);
+				stats->drop++;
+			} else {
+				xdp_frames[nframes++] = xdpf;
+				stats->pass++;
+			}
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fallthrough */
+		case XDP_DROP:
+			xdp_return_frame(xdpf);
+			stats->drop++;
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+	xdp_clear_return_frame_no_direct();
+
+	return nframes;
+}
+
 #define CPUMAP_BATCH 8
 
 static int cpu_map_kthread_run(void *data)
@@ -235,11 +296,12 @@ static int cpu_map_kthread_run(void *data)
 	 * kthread_stop signal until queue is empty.
 	 */
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
+		struct xdp_cpumap_stats stats = {}; /* zero stats */
+		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
 		unsigned int drops = 0, sched = 0;
-		void *frames[CPUMAP_BATCH];
+		void *xdp_frames[CPUMAP_BATCH];
 		void *skbs[CPUMAP_BATCH];
-		gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
-		int i, n, m;
+		int i, n, m, nframes;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -260,10 +322,11 @@ static int cpu_map_kthread_run(void *data)
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
+		n = ptr_ring_consume_batched(rcpu->queue, xdp_frames,
+					     CPUMAP_BATCH);
 
 		for (i = 0; i < n; i++) {
-			void *f = frames[i];
+			void *f = xdp_frames[i];
 			struct page *page = virt_to_page(f);
 
 			/* Bring struct page memory area to curr CPU. Read by
@@ -273,16 +336,20 @@ static int cpu_map_kthread_run(void *data)
 			prefetchw(page);
 		}
 
-		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
+		/* Support running another XDP prog on this CPU */
+		nframes = cpu_map_bpf_prog_run_xdp(rcpu, xdp_frames, n, &stats);
+
+		m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp,
+					  nframes, skbs);
 		if (unlikely(m == 0)) {
-			for (i = 0; i < n; i++)
+			for (i = 0; i < nframes; i++)
 				skbs[i] = NULL; /* effect: xdp_return_frame */
-			drops = n;
+			drops += nframes;
 		}
 
 		local_bh_disable();
-		for (i = 0; i < n; i++) {
-			struct xdp_frame *xdpf = frames[i];
+		for (i = 0; i < nframes; i++) {
+			struct xdp_frame *xdpf = xdp_frames[i];
 			struct sk_buff *skb = skbs[i];
 			int ret;
 
@@ -298,7 +365,7 @@ static int cpu_map_kthread_run(void *data)
 				drops++;
 		}
 		/* Feedback loop via tracepoint */
-		trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
+		trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched, &stats);
 
 		local_bh_enable(); /* resched point, may call do_softirq() */
 	}
@@ -308,13 +375,38 @@ static int cpu_map_kthread_run(void *data)
 	return 0;
 }
 
+bool cpu_map_prog_allowed(struct bpf_map *map)
+{
+	return map->map_type == BPF_MAP_TYPE_CPUMAP &&
+	       map->value_size != offsetofend(struct bpf_cpumap_val, qsize);
+}
+
+static int __cpu_map_load_bpf_program(struct bpf_cpu_map_entry *rcpu, int fd)
+{
+	struct bpf_prog *prog;
+
+	prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP, false);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	if (prog->expected_attach_type != BPF_XDP_CPUMAP) {
+		bpf_prog_put(prog);
+		return -EINVAL;
+	}
+
+	rcpu->value.bpf_prog.id = prog->aux->id;
+	rcpu->prog = prog;
+
+	return 0;
+}
+
 static struct bpf_cpu_map_entry *
 __cpu_map_entry_alloc(struct bpf_cpumap_val *value, u32 cpu, int map_id)
 {
+	int numa, err, i, fd = value->bpf_prog.fd;
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	struct bpf_cpu_map_entry *rcpu;
 	struct xdp_bulk_queue *bq;
-	int numa, err, i;
 
 	/* Have map->numa_node, but choose node of redirect target CPU */
 	numa = cpu_to_node(cpu);
@@ -356,6 +448,9 @@ __cpu_map_entry_alloc(struct bpf_cpumap_val *value, u32 cpu, int map_id)
 	get_cpu_map_entry(rcpu); /* 1-refcnt for being in cmap->cpu_map[] */
 	get_cpu_map_entry(rcpu); /* 1-refcnt for kthread */
 
+	if (fd > 0 && __cpu_map_load_bpf_program(rcpu, fd))
+		goto free_ptr_ring;
+
 	/* Make sure kthread runs on a single CPU */
 	kthread_bind(rcpu->kthread, cpu);
 	wake_up_process(rcpu->kthread);
@@ -415,6 +510,8 @@ static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap,
 
 	old_rcpu = xchg(&cmap->cpu_map[key_cpu], rcpu);
 	if (old_rcpu) {
+		if (old_rcpu->prog)
+			bpf_prog_put(old_rcpu->prog);
 		call_rcu(&old_rcpu->rcu, __cpu_map_entry_free);
 		INIT_WORK(&old_rcpu->kthread_stop_wq, cpu_map_kthread_stop);
 		schedule_work(&old_rcpu->kthread_stop_wq);
diff --git a/net/core/dev.c b/net/core/dev.c
index 6bc2388141f6..2867df05cf82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5440,6 +5440,8 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 		for (i = 0; i < new->aux->used_map_cnt; i++) {
 			if (dev_map_can_have_prog(new->aux->used_maps[i]))
 				return -EINVAL;
+			if (cpu_map_prog_allowed(new->aux->used_maps[i]))
+				return -EINVAL;
 		}
 	}
 
@@ -8864,6 +8866,12 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			return -EINVAL;
 		}
 
+		if (prog->expected_attach_type == BPF_XDP_CPUMAP) {
+			NL_SET_ERR_MSG(extack, "BPF_XDP_CPUMAP programs can not be attached to a device");
+			bpf_prog_put(prog);
+			return -EINVAL;
+		}
+
 		/* prog->aux->id may be 0 for orphaned device-bound progs */
 		if (prog->aux->id && prog->aux->id == prog_id) {
 			bpf_prog_put(prog);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a45d61bc886e..dec1d5e422b2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET4_GETSOCKNAME,
 	BPF_CGROUP_INET6_GETSOCKNAME,
 	BPF_XDP_DEVMAP,
+	BPF_XDP_CPUMAP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3781,6 +3782,10 @@ struct bpf_devmap_val {
  */
 struct bpf_cpumap_val {
 	__u32 qsize;	/* queue size */
+	union {
+		int   fd;	/* prog fd on map write */
+		__u32 id;	/* prog id on map read */
+	} bpf_prog;
 };
 
 enum sk_action {
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 3/8] cpumap: formalize map value as a named struct
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

As it has been already done for devmap, introduce 'struct bpf_cpumap_val'
to formalize the expected values that can be passed in for a CPUMAP.
Update cpumap code to use the struct.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/uapi/linux/bpf.h       |  9 +++++++++
 kernel/bpf/cpumap.c            | 25 +++++++++++++------------
 tools/include/uapi/linux/bpf.h |  9 +++++++++
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 19684813faae..a45d61bc886e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3774,6 +3774,15 @@ struct bpf_devmap_val {
 	} bpf_prog;
 };
 
+/* CPUMAP map-value layout
+ *
+ * The struct data-layout of map-value is a configuration interface.
+ * New members can only be added to the end of this structure.
+ */
+struct bpf_cpumap_val {
+	__u32 qsize;	/* queue size */
+};
+
 enum sk_action {
 	SK_DROP = 0,
 	SK_PASS,
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 27595fc6da56..8951f187f6cf 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -52,7 +52,6 @@ struct xdp_bulk_queue {
 struct bpf_cpu_map_entry {
 	u32 cpu;    /* kthread CPU and map index */
 	int map_id; /* Back reference to map */
-	u32 qsize;  /* Queue size placeholder for map lookup */
 
 	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
 	struct xdp_bulk_queue __percpu *bulkq;
@@ -66,6 +65,8 @@ struct bpf_cpu_map_entry {
 
 	atomic_t refcnt; /* Control when this struct can be free'ed */
 	struct rcu_head rcu;
+
+	struct bpf_cpumap_val value;
 };
 
 struct bpf_cpu_map {
@@ -307,8 +308,8 @@ static int cpu_map_kthread_run(void *data)
 	return 0;
 }
 
-static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu,
-						       int map_id)
+static struct bpf_cpu_map_entry *
+__cpu_map_entry_alloc(struct bpf_cpumap_val *value, u32 cpu, int map_id)
 {
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	struct bpf_cpu_map_entry *rcpu;
@@ -338,13 +339,13 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu,
 	if (!rcpu->queue)
 		goto free_bulkq;
 
-	err = ptr_ring_init(rcpu->queue, qsize, gfp);
+	err = ptr_ring_init(rcpu->queue, value->qsize, gfp);
 	if (err)
 		goto free_queue;
 
 	rcpu->cpu    = cpu;
 	rcpu->map_id = map_id;
-	rcpu->qsize  = qsize;
+	rcpu->value.qsize  = value->qsize;
 
 	/* Setup kthread */
 	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
@@ -437,12 +438,12 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
 			       u64 map_flags)
 {
 	struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
+	struct bpf_cpumap_val cpumap_value = {};
 	struct bpf_cpu_map_entry *rcpu;
-
 	/* Array index key correspond to CPU number */
 	u32 key_cpu = *(u32 *)key;
-	/* Value is the queue size */
-	u32 qsize = *(u32 *)value;
+
+	memcpy(&cpumap_value, value, map->value_size);
 
 	if (unlikely(map_flags > BPF_EXIST))
 		return -EINVAL;
@@ -450,18 +451,18 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -E2BIG;
 	if (unlikely(map_flags == BPF_NOEXIST))
 		return -EEXIST;
-	if (unlikely(qsize > 16384)) /* sanity limit on qsize */
+	if (unlikely(cpumap_value.qsize > 16384)) /* sanity limit on qsize */
 		return -EOVERFLOW;
 
 	/* Make sure CPU is a valid possible cpu */
 	if (key_cpu >= nr_cpumask_bits || !cpu_possible(key_cpu))
 		return -ENODEV;
 
-	if (qsize == 0) {
+	if (cpumap_value.qsize == 0) {
 		rcpu = NULL; /* Same as deleting */
 	} else {
 		/* Updating qsize cause re-allocation of bpf_cpu_map_entry */
-		rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id);
+		rcpu = __cpu_map_entry_alloc(&cpumap_value, key_cpu, map->id);
 		if (!rcpu)
 			return -ENOMEM;
 		rcpu->cmap = cmap;
@@ -523,7 +524,7 @@ static void *cpu_map_lookup_elem(struct bpf_map *map, void *key)
 	struct bpf_cpu_map_entry *rcpu =
 		__cpu_map_lookup_elem(map, *(u32 *)key);
 
-	return rcpu ? &rcpu->qsize : NULL;
+	return rcpu ? &rcpu->value : NULL;
 }
 
 static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 19684813faae..a45d61bc886e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3774,6 +3774,15 @@ struct bpf_devmap_val {
 	} bpf_prog;
 };
 
+/* CPUMAP map-value layout
+ *
+ * The struct data-layout of map-value is a configuration interface.
+ * New members can only be added to the end of this structure.
+ */
+struct bpf_cpumap_val {
+	__u32 qsize;	/* queue size */
+};
+
 enum sk_action {
 	SK_DROP = 0,
 	SK_PASS,
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 2/8] samples/bpf: xdp_redirect_cpu_user: do not update bpf maps in option loop
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

Do not update xdp_redirect_cpu maps running while option loop but
defer it after all available options have been parsed. This is a
preliminary patch to pass the program name we want to attach to the
map entries as a user option

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 samples/bpf/xdp_redirect_cpu_user.c | 36 +++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index f3468168982e..1a054737c35a 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -684,6 +684,7 @@ int main(int argc, char **argv)
 	int add_cpu = -1;
 	int opt, err;
 	int prog_fd;
+	int *cpu, i;
 	__u32 qsize;
 
 	n_cpus = get_nprocs_conf();
@@ -719,6 +720,13 @@ int main(int argc, char **argv)
 	}
 	mark_cpus_unavailable();
 
+	cpu = malloc(n_cpus * sizeof(int));
+	if (!cpu) {
+		fprintf(stderr, "failed to allocate cpu array\n");
+		return EXIT_FAIL;
+	}
+	memset(cpu, 0, n_cpus * sizeof(int));
+
 	/* Parse commands line args */
 	while ((opt = getopt_long(argc, argv, "hSd:s:p:q:c:xzF",
 				  long_options, &longindex)) != -1) {
@@ -763,8 +771,7 @@ int main(int argc, char **argv)
 					errno, strerror(errno));
 				goto error;
 			}
-			create_cpu_entry(add_cpu, qsize, added_cpus, true);
-			added_cpus++;
+			cpu[added_cpus++] = add_cpu;
 			break;
 		case 'q':
 			qsize = atoi(optarg);
@@ -775,6 +782,7 @@ int main(int argc, char **argv)
 		case 'h':
 		error:
 		default:
+			free(cpu);
 			usage(argv, obj);
 			return EXIT_FAIL_OPTION;
 		}
@@ -787,16 +795,21 @@ int main(int argc, char **argv)
 	if (ifindex == -1) {
 		fprintf(stderr, "ERR: required option --dev missing\n");
 		usage(argv, obj);
-		return EXIT_FAIL_OPTION;
+		err = EXIT_FAIL_OPTION;
+		goto out;
 	}
 	/* Required option */
 	if (add_cpu == -1) {
 		fprintf(stderr, "ERR: required option --cpu missing\n");
 		fprintf(stderr, " Specify multiple --cpu option to add more\n");
 		usage(argv, obj);
-		return EXIT_FAIL_OPTION;
+		err = EXIT_FAIL_OPTION;
+		goto out;
 	}
 
+	for (i = 0; i < added_cpus; i++)
+		create_cpu_entry(cpu[i], qsize, i, true);
+
 	/* Remove XDP program when program is interrupted or killed */
 	signal(SIGINT, int_exit);
 	signal(SIGTERM, int_exit);
@@ -804,27 +817,32 @@ int main(int argc, char **argv)
 	prog = bpf_object__find_program_by_title(obj, prog_name);
 	if (!prog) {
 		fprintf(stderr, "bpf_object__find_program_by_title failed\n");
-		return EXIT_FAIL;
+		err = EXIT_FAIL;
+		goto out;
 	}
 
 	prog_fd = bpf_program__fd(prog);
 	if (prog_fd < 0) {
 		fprintf(stderr, "bpf_program__fd failed\n");
-		return EXIT_FAIL;
+		err = EXIT_FAIL;
+		goto out;
 	}
 
 	if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
 		fprintf(stderr, "link set xdp fd failed\n");
-		return EXIT_FAIL_XDP;
+		err = EXIT_FAIL_XDP;
+		goto out;
 	}
 
 	err = bpf_obj_get_info_by_fd(prog_fd, &info, &info_len);
 	if (err) {
 		printf("can't get prog info - %s\n", strerror(errno));
-		return err;
+		goto out;
 	}
 	prog_id = info.id;
 
 	stats_poll(interval, use_separators, prog_name, stress_mode);
-	return EXIT_OK;
+out:
+	free(cpu);
+	return err;
 }
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 1/8] net: Refactor xdp_convert_buff_to_frame
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern,
	David Ahern
In-Reply-To: <cover.1592606391.git.lorenzo@kernel.org>

From: David Ahern <dahern@digitalocean.com>

Move the guts of xdp_convert_buff_to_frame to a new helper,
xdp_update_frame_from_buff so it can be reused removing code duplication

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David Ahern <dahern@digitalocean.com>
---
 include/net/xdp.h | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 609f819ed08b..ab1c503808a4 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -121,39 +121,48 @@ void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
 	xdp->frame_sz = frame->frame_sz;
 }
 
-/* Convert xdp_buff to xdp_frame */
 static inline
-struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp)
+int xdp_update_frame_from_buff(struct xdp_buff *xdp,
+			       struct xdp_frame *xdp_frame)
 {
-	struct xdp_frame *xdp_frame;
-	int metasize;
-	int headroom;
-
-	if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
-		return xdp_convert_zc_to_xdp_frame(xdp);
+	int metasize, headroom;
 
 	/* Assure headroom is available for storing info */
 	headroom = xdp->data - xdp->data_hard_start;
 	metasize = xdp->data - xdp->data_meta;
 	metasize = metasize > 0 ? metasize : 0;
 	if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
-		return NULL;
+		return -ENOMEM;
 
 	/* Catch if driver didn't reserve tailroom for skb_shared_info */
 	if (unlikely(xdp->data_end > xdp_data_hard_end(xdp))) {
 		XDP_WARN("Driver BUG: missing reserved tailroom");
-		return NULL;
+		return -ENOMEM;
 	}
 
-	/* Store info in top of packet */
-	xdp_frame = xdp->data_hard_start;
-
 	xdp_frame->data = xdp->data;
 	xdp_frame->len  = xdp->data_end - xdp->data;
 	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
 	xdp_frame->metasize = metasize;
 	xdp_frame->frame_sz = xdp->frame_sz;
 
+	return 0;
+}
+
+/* Convert xdp_buff to xdp_frame */
+static inline
+struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp)
+{
+	struct xdp_frame *xdp_frame;
+
+	if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
+		return xdp_convert_zc_to_xdp_frame(xdp);
+
+	/* Store info in top of packet */
+	xdp_frame = xdp->data_hard_start;
+	if (unlikely(xdp_update_frame_from_buff(xdp, xdp_frame) < 0))
+		return NULL;
+
 	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
 	xdp_frame->mem = xdp->rxq->mem;
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 bpf-next 0/8] introduce support for XDP programs in CPUMAP
From: Lorenzo Bianconi @ 2020-06-19 22:57 UTC (permalink / raw)
  To: bpf, netdev; +Cc: davem, ast, brouer, daniel, toke, lorenzo.bianconi, dsahern

Similar to what David Ahern proposed in [1] for DEVMAPs, introduce the
capability to attach and run a XDP program to CPUMAP entries.
The idea behind this feature is to add the possibility to define on which CPU
run the eBPF program if the underlying hw does not support RSS.
I respin patch 1/6 from a previous series sent by David [2].
The functionality has been tested on Marvell Espressobin, i40e and mlx5.
Detailed tests results can be found here:
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap04-map-xdp-prog.org

Changes since v1:
- added performance test results
- added kselftest support
- fixed memory accounting with page_pool
- extended xdp_redirect_cpu_user.c to load an external program to perform
  redirect
- reported ifindex to attached eBPF program
- moved bpf_cpumap_val definition to include/uapi/linux/bpf.h

[1] https://patchwork.ozlabs.org/project/netdev/cover/20200529220716.75383-1-dsahern@kernel.org/
[2] https://patchwork.ozlabs.org/project/netdev/patch/20200513014607.40418-2-dsahern@kernel.org/


David Ahern (1):
  net: Refactor xdp_convert_buff_to_frame

Lorenzo Bianconi (7):
  samples/bpf: xdp_redirect_cpu_user: do not update bpf maps in option
    loop
  cpumap: formalize map value as a named struct
  bpf: cpumap: add the possibility to attach an eBPF program to cpumap
  bpf: cpumap: implement XDP_REDIRECT for eBPF programs attached to map
    entries
  libbpf: add SEC name for xdp programs attached to CPUMAP
  samples/bpf: xdp_redirect_cpu: load a eBPF program on cpumap
  selftest: add tests for XDP programs in CPUMAP entries

 include/linux/bpf.h                           |   6 +
 include/net/xdp.h                             |  41 ++--
 include/trace/events/xdp.h                    |  16 +-
 include/uapi/linux/bpf.h                      |  14 ++
 kernel/bpf/cpumap.c                           | 161 +++++++++++---
 net/core/dev.c                                |   8 +
 samples/bpf/xdp_redirect_cpu_kern.c           |  25 ++-
 samples/bpf/xdp_redirect_cpu_user.c           | 208 ++++++++++++++++--
 tools/include/uapi/linux/bpf.h                |  14 ++
 tools/lib/bpf/libbpf.c                        |   2 +
 .../bpf/prog_tests/xdp_cpumap_attach.c        |  70 ++++++
 .../bpf/progs/test_xdp_with_cpumap_helpers.c  |  38 ++++
 12 files changed, 531 insertions(+), 72 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_cpumap_attach.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_with_cpumap_helpers.c

-- 
2.26.2


^ permalink raw reply

* [net-next,v1,1/5] l3mdev: add infrastructure for table to VRF mapping
From: Andrea Mayer @ 2020-06-19 22:54 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Shrijeet Mukherjee, Jakub Kicinski,
	Shuah Khan, netdev, linux-kernel, linux-kselftest
  Cc: Donald Sharp, Roopa Prabhu, Dinesh Dutt, Stephen Hemminger,
	Stefano Salsano, Paolo Lungaroni, Ahmed Abdelsalam, Andrea Mayer
In-Reply-To: <20200619225447.1445-1-andrea.mayer@uniroma2.it>

Add infrastructure to l3mdev (the core code for Layer 3 master devices) in
order to find out the corresponding VRF device for a given table id.
Therefore, the l3mdev implementations:
 - can register a callback that returns the device index of the l3mdev
   associated with a given table id;
 - can offer the lookup function (table to VRF device).

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
---
 include/net/l3mdev.h | 39 +++++++++++++++++++
 net/l3mdev/l3mdev.c  | 93 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+)

diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index e942372b077b..031c661aa14d 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -10,6 +10,16 @@
 #include <net/dst.h>
 #include <net/fib_rules.h>
 
+enum l3mdev_type {
+	L3MDEV_TYPE_UNSPEC,
+	L3MDEV_TYPE_VRF,
+	__L3MDEV_TYPE_MAX
+};
+
+#define L3MDEV_TYPE_MAX (__L3MDEV_TYPE_MAX - 1)
+
+typedef int (*lookup_by_table_id_t)(struct net *net, u32 table_d);
+
 /**
  * struct l3mdev_ops - l3mdev operations
  *
@@ -37,6 +47,15 @@ struct l3mdev_ops {
 
 #ifdef CONFIG_NET_L3_MASTER_DEV
 
+int l3mdev_table_lookup_register(enum l3mdev_type l3type,
+				 lookup_by_table_id_t fn);
+
+void l3mdev_table_lookup_unregister(enum l3mdev_type l3type,
+				    lookup_by_table_id_t fn);
+
+int l3mdev_ifindex_lookup_by_table_id(enum l3mdev_type l3type, struct net *net,
+				      u32 table_id);
+
 int l3mdev_fib_rule_match(struct net *net, struct flowi *fl,
 			  struct fib_lookup_arg *arg);
 
@@ -280,6 +299,26 @@ struct sk_buff *l3mdev_ip6_out(struct sock *sk, struct sk_buff *skb)
 	return skb;
 }
 
+static inline
+int l3mdev_table_lookup_register(enum l3mdev_type l3type,
+				 lookup_by_table_id_t fn)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline
+void l3mdev_table_lookup_unregister(enum l3mdev_type l3type,
+				    lookup_by_table_id_t fn)
+{
+}
+
+static inline
+int l3mdev_ifindex_lookup_by_table_id(enum l3mdev_type l3type, struct net *net,
+				      u32 table_id)
+{
+	return -ENODEV;
+}
+
 static inline
 int l3mdev_fib_rule_match(struct net *net, struct flowi *fl,
 			  struct fib_lookup_arg *arg)
diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
index f35899d45a9a..e71ca5aec684 100644
--- a/net/l3mdev/l3mdev.c
+++ b/net/l3mdev/l3mdev.c
@@ -9,6 +9,99 @@
 #include <net/fib_rules.h>
 #include <net/l3mdev.h>
 
+static DEFINE_SPINLOCK(l3mdev_lock);
+
+struct l3mdev_handler {
+	lookup_by_table_id_t dev_lookup;
+};
+
+static struct l3mdev_handler l3mdev_handlers[L3MDEV_TYPE_MAX + 1];
+
+static int l3mdev_check_type(enum l3mdev_type l3type)
+{
+	if (l3type <= L3MDEV_TYPE_UNSPEC || l3type > L3MDEV_TYPE_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
+int l3mdev_table_lookup_register(enum l3mdev_type l3type,
+				 lookup_by_table_id_t fn)
+{
+	struct l3mdev_handler *hdlr;
+	int res;
+
+	res = l3mdev_check_type(l3type);
+	if (res)
+		return res;
+
+	hdlr = &l3mdev_handlers[l3type];
+
+	spin_lock(&l3mdev_lock);
+
+	if (hdlr->dev_lookup) {
+		res = -EBUSY;
+		goto unlock;
+	}
+
+	hdlr->dev_lookup = fn;
+	res = 0;
+
+unlock:
+	spin_unlock(&l3mdev_lock);
+
+	return res;
+}
+EXPORT_SYMBOL_GPL(l3mdev_table_lookup_register);
+
+void l3mdev_table_lookup_unregister(enum l3mdev_type l3type,
+				    lookup_by_table_id_t fn)
+{
+	struct l3mdev_handler *hdlr;
+
+	if (l3mdev_check_type(l3type))
+		return;
+
+	hdlr = &l3mdev_handlers[l3type];
+
+	spin_lock(&l3mdev_lock);
+
+	if (hdlr->dev_lookup == fn)
+		hdlr->dev_lookup = NULL;
+
+	spin_unlock(&l3mdev_lock);
+}
+EXPORT_SYMBOL_GPL(l3mdev_table_lookup_unregister);
+
+int l3mdev_ifindex_lookup_by_table_id(enum l3mdev_type l3type,
+				      struct net *net, u32 table_id)
+{
+	lookup_by_table_id_t lookup;
+	struct l3mdev_handler *hdlr;
+	int ifindex = -EINVAL;
+	int res;
+
+	res = l3mdev_check_type(l3type);
+	if (res)
+		return res;
+
+	hdlr = &l3mdev_handlers[l3type];
+
+	spin_lock(&l3mdev_lock);
+
+	lookup = hdlr->dev_lookup;
+	if (!lookup)
+		goto unlock;
+
+	ifindex = lookup(net, table_id);
+
+unlock:
+	spin_unlock(&l3mdev_lock);
+
+	return ifindex;
+}
+EXPORT_SYMBOL_GPL(l3mdev_ifindex_lookup_by_table_id);
+
 /**
  *	l3mdev_master_ifindex - get index of L3 master device
  *	@dev: targeted interface
-- 
2.20.1


^ permalink raw reply related

* [net-next,v1,5/5] selftests: add selftest for the VRF strict mode
From: Andrea Mayer @ 2020-06-19 22:54 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Shrijeet Mukherjee, Jakub Kicinski,
	Shuah Khan, netdev, linux-kernel, linux-kselftest
  Cc: Donald Sharp, Roopa Prabhu, Dinesh Dutt, Stephen Hemminger,
	Stefano Salsano, Paolo Lungaroni, Ahmed Abdelsalam, Andrea Mayer
In-Reply-To: <20200619225447.1445-1-andrea.mayer@uniroma2.it>

The new strict mode functionality is tested in different configurations and
on different network namespaces.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
---
 .../selftests/net/vrf_strict_mode_test.sh     | 390 ++++++++++++++++++
 1 file changed, 390 insertions(+)
 create mode 100755 tools/testing/selftests/net/vrf_strict_mode_test.sh

diff --git a/tools/testing/selftests/net/vrf_strict_mode_test.sh b/tools/testing/selftests/net/vrf_strict_mode_test.sh
new file mode 100755
index 000000000000..5274f4a1fba1
--- /dev/null
+++ b/tools/testing/selftests/net/vrf_strict_mode_test.sh
@@ -0,0 +1,390 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test is designed for testing the new VRF strict_mode functionality.
+
+ret=0
+
+# identifies the "init" network namespace which is often called root network
+# namespace.
+INIT_NETNS_NAME="init"
+
+PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
+
+log_test()
+{
+	local rc=$1
+	local expected=$2
+	local msg="$3"
+
+	if [ ${rc} -eq ${expected} ]; then
+		nsuccess=$((nsuccess+1))
+		printf "\n    TEST: %-60s  [ OK ]\n" "${msg}"
+	else
+		ret=1
+		nfail=$((nfail+1))
+		printf "\n    TEST: %-60s  [FAIL]\n" "${msg}"
+		if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+			echo
+			echo "hit enter to continue, 'q' to quit"
+			read a
+			[ "$a" = "q" ] && exit 1
+		fi
+	fi
+}
+
+print_log_test_results()
+{
+	if [ "$TESTS" != "none" ]; then
+		printf "\nTests passed: %3d\n" ${nsuccess}
+		printf "Tests failed: %3d\n"   ${nfail}
+	fi
+}
+
+log_section()
+{
+	echo
+	echo "################################################################################"
+	echo "TEST SECTION: $*"
+	echo "################################################################################"
+}
+
+ip_expand_args()
+{
+	local nsname=$1
+	local nsarg=""
+
+	if [ "${nsname}" != "${INIT_NETNS_NAME}" ]; then
+		nsarg="-netns ${nsname}"
+	fi
+
+	echo "${nsarg}"
+}
+
+vrf_count()
+{
+	local nsname=$1
+	local nsarg="$(ip_expand_args ${nsname})"
+
+	ip ${nsarg} -o link show type vrf | wc -l
+}
+
+count_vrf_by_table_id()
+{
+	local nsname=$1
+	local tableid=$2
+	local nsarg="$(ip_expand_args ${nsname})"
+
+	ip ${nsarg} -d -o link show type vrf | grep "table ${tableid}" | wc -l
+}
+
+add_vrf()
+{
+	local nsname=$1
+	local vrfname=$2
+	local vrftable=$3
+	local nsarg="$(ip_expand_args ${nsname})"
+
+	ip ${nsarg} link add ${vrfname} type vrf table ${vrftable} &>/dev/null
+}
+
+add_vrf_and_check()
+{
+	local nsname=$1
+	local vrfname=$2
+	local vrftable=$3
+	local cnt
+	local rc
+
+	add_vrf ${nsname} ${vrfname} ${vrftable}; rc=$?
+
+	cnt=$(count_vrf_by_table_id ${nsname} ${vrftable})
+
+	log_test ${rc} 0 "${nsname}: add vrf ${vrfname}, ${cnt} vrfs for table ${vrftable}"
+}
+
+add_vrf_and_check_fail()
+{
+	local nsname=$1
+	local vrfname=$2
+	local vrftable=$3
+	local cnt
+	local rc
+
+	add_vrf ${nsname} ${vrfname} ${vrftable}; rc=$?
+
+	cnt=$(count_vrf_by_table_id ${nsname} ${vrftable})
+
+	log_test ${rc} 2 "${nsname}: CANNOT add vrf ${vrfname}, ${cnt} vrfs for table ${vrftable}"
+}
+
+del_vrf_and_check()
+{
+	local nsname=$1
+	local vrfname=$2
+	local nsarg="$(ip_expand_args ${nsname})"
+
+	ip ${nsarg} link del ${vrfname}
+	log_test $? 0 "${nsname}: remove vrf ${vrfname}"
+}
+
+config_vrf_and_check()
+{
+	local nsname=$1
+	local addr=$2
+	local vrfname=$3
+	local nsarg="$(ip_expand_args ${nsname})"
+
+	ip ${nsarg} link set dev ${vrfname} up && \
+		ip ${nsarg} addr add ${addr} dev ${vrfname}
+	log_test $? 0 "${nsname}: vrf ${vrfname} up, addr ${addr}"
+}
+
+read_strict_mode()
+{
+	local nsname=$1
+	local rval
+	local rc=0
+	local nsexec=""
+
+	if [ "${nsname}" != "${INIT_NETNS_NAME}" ]; then
+		# a custom network namespace is provided
+		nsexec="ip netns exec ${nsname}"
+	fi
+
+	rval="$(${nsexec} bash -c "cat /proc/sys/net/vrf/strict_mode" | \
+		grep -E "^[0-1]$")" &> /dev/null
+	if [ $? -ne 0 ]; then
+		# set errors
+		rval=255
+		rc=1
+	fi
+
+	# on success, rval can be only 0 or 1; on error, rval is equal to 255
+	echo ${rval}
+	return ${rc}
+}
+
+read_strict_mode_compare_and_check()
+{
+	local nsname=$1
+	local expected=$2
+	local res
+
+	res="$(read_strict_mode ${nsname})"
+	log_test ${res} ${expected} "${nsname}: check strict_mode=${res}"
+}
+
+set_strict_mode()
+{
+	local nsname=$1
+	local val=$2
+	local nsexec=""
+
+	if [ "${nsname}" != "${INIT_NETNS_NAME}" ]; then
+		# a custom network namespace is provided
+		nsexec="ip netns exec ${nsname}"
+	fi
+
+	${nsexec} bash -c "echo ${val} >/proc/sys/net/vrf/strict_mode" &>/dev/null
+}
+
+enable_strict_mode()
+{
+	local nsname=$1
+
+	set_strict_mode ${nsname} 1
+}
+
+disable_strict_mode()
+{
+	local nsname=$1
+
+	set_strict_mode ${nsname} 0
+}
+
+disable_strict_mode_and_check()
+{
+	local nsname=$1
+
+	disable_strict_mode ${nsname}
+	log_test $? 0 "${nsname}: disable strict_mode (=0)"
+}
+
+enable_strict_mode_and_check()
+{
+	local nsname=$1
+
+	enable_strict_mode ${nsname}
+	log_test $? 0 "${nsname}: enable strict_mode (=1)"
+}
+
+enable_strict_mode_and_check_fail()
+{
+	local nsname=$1
+
+	enable_strict_mode ${nsname}
+	log_test $? 1 "${nsname}: CANNOT enable strict_mode"
+}
+
+strict_mode_check_default()
+{
+	local nsname=$1
+	local strictmode
+	local vrfcnt
+
+	vrfcnt=$(vrf_count ${nsname})
+	strictmode=$(read_strict_mode ${nsname})
+	log_test ${strictmode} 0 "${nsname}: strict_mode=0 by default, ${vrfcnt} vrfs"
+}
+
+setup()
+{
+	modprobe vrf
+
+	ip netns add testns
+	ip netns exec testns ip link set lo up
+}
+
+cleanup()
+{
+	ip netns del testns 2>/dev/null
+
+	ip link del vrf100 2>/dev/null
+	ip link del vrf101 2>/dev/null
+	ip link del vrf102 2>/dev/null
+
+	echo 0 >/proc/sys/net/vrf/strict_mode 2>/dev/null
+}
+
+vrf_strict_mode_tests_init()
+{
+	vrf_strict_mode_check_support init
+
+	strict_mode_check_default init
+
+	add_vrf_and_check init vrf100 100
+	config_vrf_and_check init 172.16.100.1/24 vrf100
+
+	enable_strict_mode_and_check init
+
+	add_vrf_and_check_fail init vrf101 100
+
+	disable_strict_mode_and_check init
+
+	add_vrf_and_check init vrf101 100
+	config_vrf_and_check init 172.16.101.1/24 vrf101
+
+	enable_strict_mode_and_check_fail init
+
+	del_vrf_and_check init vrf101
+
+	enable_strict_mode_and_check init
+
+	add_vrf_and_check init vrf102 102
+	config_vrf_and_check init 172.16.102.1/24 vrf102
+
+	# the strict_modle is enabled in the init
+}
+
+vrf_strict_mode_tests_testns()
+{
+	vrf_strict_mode_check_support testns
+
+	strict_mode_check_default testns
+
+	enable_strict_mode_and_check testns
+
+	add_vrf_and_check testns vrf100 100
+	config_vrf_and_check testns 10.0.100.1/24 vrf100
+
+	add_vrf_and_check_fail testns vrf101 100
+
+	add_vrf_and_check_fail testns vrf102 100
+
+	add_vrf_and_check testns vrf200 200
+
+	disable_strict_mode_and_check testns
+
+	add_vrf_and_check testns vrf101 100
+
+	add_vrf_and_check testns vrf102 100
+
+	#the strict_mode is disabled in the testns
+}
+
+vrf_strict_mode_tests_mix()
+{
+	read_strict_mode_compare_and_check init 1
+
+	read_strict_mode_compare_and_check testns 0
+
+	del_vrf_and_check testns vrf101
+
+	del_vrf_and_check testns vrf102
+
+	disable_strict_mode_and_check init
+
+	enable_strict_mode_and_check testns
+
+	enable_strict_mode_and_check init
+	enable_strict_mode_and_check init
+
+	disable_strict_mode_and_check testns
+	disable_strict_mode_and_check testns
+
+	read_strict_mode_compare_and_check init 1
+
+	read_strict_mode_compare_and_check testns 0
+}
+
+vrf_strict_mode_tests()
+{
+	log_section "VRF strict_mode test on init network namespace"
+	vrf_strict_mode_tests_init
+
+	log_section "VRF strict_mode test on testns network namespace"
+	vrf_strict_mode_tests_testns
+
+	log_section "VRF strict_mode test mixing init and testns network namespaces"
+	vrf_strict_mode_tests_mix
+}
+
+vrf_strict_mode_check_support()
+{
+	local nsname=$1
+	local output
+	local rc
+
+	output="$(lsmod | grep '^vrf' | awk '{print $1}')"
+	if [ -z "${output}" ]; then
+		modinfo vrf || return $?
+	fi
+
+	# we do not care about the value of the strict_mode; we only check if
+	# the strict_mode parameter is available or not.
+	read_strict_mode ${nsname} &>/dev/null; rc=$?
+	log_test ${rc} 0 "${nsname}: net.vrf.strict_mode is available"
+
+	return ${rc}
+}
+
+if [ "$(id -u)" -ne 0 ];then
+	echo "SKIP: Need root privileges"
+	exit 0
+fi
+
+if [ ! -x "$(command -v ip)" ]; then
+	echo "SKIP: Could not run test without ip tool"
+	exit 0
+fi
+
+cleanup &> /dev/null
+
+setup
+vrf_strict_mode_tests
+cleanup
+
+print_log_test_results
+
+exit $ret
-- 
2.20.1


^ permalink raw reply related

* [net-next,v1,4/5] vrf: add l3mdev registration for table to VRF device lookup
From: Andrea Mayer @ 2020-06-19 22:54 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Shrijeet Mukherjee, Jakub Kicinski,
	Shuah Khan, netdev, linux-kernel, linux-kselftest
  Cc: Donald Sharp, Roopa Prabhu, Dinesh Dutt, Stephen Hemminger,
	Stefano Salsano, Paolo Lungaroni, Ahmed Abdelsalam, Andrea Mayer
In-Reply-To: <20200619225447.1445-1-andrea.mayer@uniroma2.it>

During the initialization phase of the VRF module, the callback for table
to VRF device lookup is registered in l3mdev.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
---
 drivers/net/vrf.c | 59 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index c53e57354d2c..46599606ff10 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -183,6 +183,19 @@ static struct vrf_map *netns_vrf_map_by_dev(struct net_device *dev)
 	return netns_vrf_map(dev_net(dev));
 }
 
+static int vrf_map_elem_get_vrf_ifindex(struct vrf_map_elem *me)
+{
+	struct list_head *me_head = &me->vrf_list;
+	struct net_vrf *vrf;
+
+	if (list_empty(me_head))
+		return -ENODEV;
+
+	vrf = list_first_entry(me_head, struct net_vrf, me_list);
+
+	return vrf->ifindex;
+}
+
 static struct vrf_map_elem *vrf_map_elem_alloc(gfp_t flags)
 {
 	struct vrf_map_elem *me;
@@ -384,6 +397,34 @@ static void vrf_map_unregister_dev(struct net_device *dev)
 	vrf_map_unlock(vmap);
 }
 
+/* return the vrf device index associated with the table_id */
+static int vrf_ifindex_lookup_by_table_id(struct net *net, u32 table_id)
+{
+	struct vrf_map *vmap = netns_vrf_map(net);
+	struct vrf_map_elem *me;
+	int ifindex;
+
+	vrf_map_lock(vmap);
+
+	if (!vmap->strict_mode) {
+		ifindex = -EPERM;
+		goto unlock;
+	}
+
+	me = vrf_map_lookup_elem(vmap, table_id);
+	if (!me) {
+		ifindex = -ENODEV;
+		goto unlock;
+	}
+
+	ifindex = vrf_map_elem_get_vrf_ifindex(me);
+
+unlock:
+	vrf_map_unlock(vmap);
+
+	return ifindex;
+}
+
 /* by default VRF devices do not have a qdisc and are expected
  * to be created with only a single queue.
  */
@@ -1847,14 +1888,24 @@ static int __init vrf_init_module(void)
 	if (rc < 0)
 		goto error;
 
+	rc = l3mdev_table_lookup_register(L3MDEV_TYPE_VRF,
+					  vrf_ifindex_lookup_by_table_id);
+	if (rc < 0)
+		goto unreg_pernet;
+
 	rc = rtnl_link_register(&vrf_link_ops);
-	if (rc < 0) {
-		unregister_pernet_subsys(&vrf_net_ops);
-		goto error;
-	}
+	if (rc < 0)
+		goto table_lookup_unreg;
 
 	return 0;
 
+table_lookup_unreg:
+	l3mdev_table_lookup_unregister(L3MDEV_TYPE_VRF,
+				       vrf_ifindex_lookup_by_table_id);
+
+unreg_pernet:
+	unregister_pernet_subsys(&vrf_net_ops);
+
 error:
 	unregister_netdevice_notifier(&vrf_notifier_block);
 	return rc;
-- 
2.20.1


^ permalink raw reply related

* [net-next,v1,3/5] vrf: add sysctl parameter for strict mode
From: Andrea Mayer @ 2020-06-19 22:54 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Shrijeet Mukherjee, Jakub Kicinski,
	Shuah Khan, netdev, linux-kernel, linux-kselftest
  Cc: Donald Sharp, Roopa Prabhu, Dinesh Dutt, Stephen Hemminger,
	Stefano Salsano, Paolo Lungaroni, Ahmed Abdelsalam, Andrea Mayer
In-Reply-To: <20200619225447.1445-1-andrea.mayer@uniroma2.it>

Add net.vrf.strict_mode sysctl parameter.

When net.vrf.strict_mode=0 (default) it is possible to associate multiple
VRF devices to the same table. Conversely, when net.vrf.strict_mode=1 a
table can be associated to a single VRF device.

When switching from net.vrf.strict_mode=0 to net.vrf.strict_mode=1, a check
is performed to verify that all tables have at most one VRF associated,
otherwise the switch is not allowed.

The net.vrf.strict_mode parameter is per network namespace.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
---
 drivers/net/vrf.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 098fdabaa4c5..c53e57354d2c 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -103,6 +103,7 @@ struct netns_vrf {
 	bool add_fib_rules;
 
 	struct vrf_map vmap;
+	struct ctl_table_header	*ctl_hdr;
 };
 
 struct net_vrf {
@@ -246,6 +247,52 @@ static void vrf_map_unlock(struct vrf_map *vmap) __releases(&vmap->vmap_lock)
 	spin_unlock(&vmap->vmap_lock);
 }
 
+static bool vrf_strict_mode(struct vrf_map *vmap)
+{
+	bool strict_mode;
+
+	vrf_map_lock(vmap);
+	strict_mode = vmap->strict_mode;
+	vrf_map_unlock(vmap);
+
+	return strict_mode;
+}
+
+static int vrf_strict_mode_change(struct vrf_map *vmap, bool new_mode)
+{
+	bool *cur_mode;
+	int res = 0;
+
+	vrf_map_lock(vmap);
+
+	cur_mode = &vmap->strict_mode;
+	if (*cur_mode == new_mode)
+		goto unlock;
+
+	if (*cur_mode) {
+		/* disable strict mode */
+		*cur_mode = false;
+	} else {
+		if (vmap->shared_tables) {
+			/* we cannot allow strict_mode because there are some
+			 * vrfs that share one or more tables.
+			 */
+			res = -EBUSY;
+			goto unlock;
+		}
+
+		/* no tables are shared among vrfs, so we can go back
+		 * to 1:1 association between a vrf with its table.
+		 */
+		*cur_mode = true;
+	}
+
+unlock:
+	vrf_map_unlock(vmap);
+
+	return res;
+}
+
 /* called with rtnl lock held */
 static int
 vrf_map_register_dev(struct net_device *dev, struct netlink_ext_ack *extack)
@@ -1702,19 +1749,90 @@ static int vrf_map_init(struct vrf_map *vmap)
 	return 0;
 }
 
+static int vrf_shared_table_handler(struct ctl_table *table, int write,
+				    void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = (struct net *)table->extra1;
+	struct vrf_map *vmap = netns_vrf_map(net);
+	int proc_strict_mode = 0;
+	struct ctl_table tmp = {
+		.procname	= table->procname,
+		.data		= &proc_strict_mode,
+		.maxlen		= sizeof(int),
+		.mode		= table->mode,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	};
+	int ret;
+
+	if (!write)
+		proc_strict_mode = vrf_strict_mode(vmap);
+
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0)
+		ret = vrf_strict_mode_change(vmap, (bool)proc_strict_mode);
+
+	return ret;
+}
+
+static const struct ctl_table vrf_table[] = {
+	{
+		.procname	= "strict_mode",
+		.data		= NULL,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= vrf_shared_table_handler,
+		/* set by the vrf_netns_init */
+		.extra1		= NULL,
+	},
+	{ },
+};
+
 /* Initialize per network namespace state */
 static int __net_init vrf_netns_init(struct net *net)
 {
 	struct netns_vrf *nn_vrf = net_generic(net, vrf_net_id);
+	struct ctl_table *table;
+	int res;
 
 	nn_vrf->add_fib_rules = true;
 	vrf_map_init(&nn_vrf->vmap);
 
+	table = kmemdup(vrf_table, sizeof(vrf_table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	/* init the extra1 parameter with the reference to current netns */
+	table[0].extra1 = net;
+
+	nn_vrf->ctl_hdr = register_net_sysctl(net, "net/vrf", table);
+	if (!nn_vrf->ctl_hdr) {
+		res = -ENOMEM;
+		goto free_table;
+	}
+
 	return 0;
+
+free_table:
+	kfree(table);
+
+	return res;
+}
+
+static void __net_exit vrf_netns_exit(struct net *net)
+{
+	struct netns_vrf *nn_vrf = net_generic(net, vrf_net_id);
+	struct ctl_table *table;
+
+	table = nn_vrf->ctl_hdr->ctl_table_arg;
+	unregister_net_sysctl_table(nn_vrf->ctl_hdr);
+	kfree(table);
 }
 
 static struct pernet_operations vrf_net_ops __net_initdata = {
 	.init = vrf_netns_init,
+	.exit = vrf_netns_exit,
 	.id   = &vrf_net_id,
 	.size = sizeof(struct netns_vrf),
 };
-- 
2.20.1


^ permalink raw reply related

* [net-next,v1,0/5] Strict mode for VRF
From: Andrea Mayer @ 2020-06-19 22:54 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Shrijeet Mukherjee, Jakub Kicinski,
	Shuah Khan, netdev, linux-kernel, linux-kselftest
  Cc: Donald Sharp, Roopa Prabhu, Dinesh Dutt, Stephen Hemminger,
	Stefano Salsano, Paolo Lungaroni, Ahmed Abdelsalam, Andrea Mayer

This patch set adds the new "strict mode" functionality to the Virtual
Routing and Forwarding infrastructure (VRF). Hereafter we discuss the
requirements and the main features of the "strict mode" for VRF.

On VRF creation, it is necessary to specify the associated routing table used
during the lookup operations. Currently, there is no mechanism that avoids
creating multiple VRFs sharing the same routing table. In other words, it is not
possible to force a one-to-one relationship between a specific VRF and the table
associated with it.

The "strict mode" imposes that each VRF can be associated to a routing table
only if such routing table is not already in use by any other VRF.
In particular, the strict mode ensures that:

 1) given a specific routing table, the VRF (if exists) is uniquely identified;
 2) given a specific VRF, the related table is not shared with any other VRF.

Constraints (1) and (2) force a one-to-one relationship between each VRF and the
corresponding routing table.

The strict mode feature is designed to be network-namespace aware and it can be
directly enabled/disabled acting on the "strict_mode" parameter.
Read and write operations are carried out through the classic sysctl command on
net.vrf.strict_mode path, i.e: sysctl -w net.vrf.strict_mode=1.

Only two distinct values {0,1} are accepted by the strict_mode parameter:

 - with strict_mode=0, multiple VRFs can be associated with the same table.
   This is the (legacy) default kernel behavior, the same that we experience
   when the strict mode patch set is not applied;

 - with strict_mode=1, the one-to-one relationship between the VRFs and the
   associated tables is guaranteed. In this configuration, the creation of a VRF
   which refers to a routing table already associated with another VRF fails and
   the error is returned to the user.

The kernel keeps track of the associations between a VRF and the routing table
during the VRF setup, in the "management" plane. Therefore, the strict mode does
not impact the performance or the intrinsic functionality of the data plane in
any way.

When the strict mode is active it is always possible to disable the strict mode,
while the reverse operation is not always allowed.
Setting the strict_mode parameter to 0 is equivalent to removing the one-to-one
constraint between any single VRF and its associated routing table.

Conversely, if the strict mode is disabled and there are multiple VRFs that
refer to the same routing table, then it is prohibited to set the strict_mode
parameter to 1. In this configuration, any attempt to perform the operation will
lead to an error and it will be reported to the user.
To enable strict mode once again (by setting the strict_mode parameter to 1),
you must first remove all the VRFs that share common tables.

There are several use cases which can take advantage from the introduction of
the strict mode feature. In particular, the strict mode allows us to:

  i) guarantee the proper functioning of some applications which deal with
     routing protocols;

 ii) perform some tunneling decap operations which require to use specific
     routing tables for segregating and forwarding the traffic.

Considering (i), the creation of different VRFs that point to the same table
leads to the situation where two different routing entities believe they have
exclusive access to the same table. This leads to the situation where different
routing daemons can conflict for gaining routes control due to overlapping
tables. By enabling strict mode it is possible to prevent this situation which
often occurs due to incorrect configurations done by the users. 
The ability to enable/disable the strict mode functionality does not depend on
the tool used for configuring the networking. In essence, the strict mode patch
solves, at the kernel level, what some other patches [1] had tried to solve at
the userspace level (using only iproute2) with all the related problems.

Considering (ii), the introduction of the strict mode functionality allows us
implementing the SRv6 End.DT4 behavior. Such behavior terminates a SR tunnel and
it forwards the IPv4 traffic according to the routes present in the routing
table supplied during the configuration. The SRv6 End.DT4 can be realized
exploiting the routing capabilities made available by the VRF infrastructure.
This behavior could leverage a specific VRF for forcing the traffic to be
forwarded in accordance with the routes available in the VRF table.
Anyway, in order to make the End.DT4 properly work, it must be guaranteed that
the table used for the route lookup operations is bound to one and only one VRF.
In this way, it is possible to use the table for uniquely retrieving the
associated VRF and for routing packets.

I would like to thank David Ahern for his constant and valuable support during
the design and development phases of this patch set.

Comments, suggestions and improvements are very welcome!

Thanks,
Andrea Mayer

v1
 l3mdev: add infrastructure for table to VRF mapping
  - define l3mdev_lock as static, thanks to Jakub Kicinski;
  - move lookup_by_table_id_t from l3mdev.c to l3mdev.h and update the
    l3mdev_dev_table_lookup_{un}register functions accordingly, thanks to
    David Ahern.

 vrf: track associations between VRF devices and tables
  - change shared_tables type from 'int' to 'u32', thanks to Stephen Hemminger
    and David Ahern;
  - update comments for share_tables.

 vrf: add sysctl parameter for strict mode
  - change type 'void __user *buffer' to 'void *buffer' in argument 3 of
    vrf_shared_table_handler function, thanks to Jakub Kicinski.

[1] https://lore.kernel.org/netdev/20200307205916.15646-1-sharpd@cumulusnetworks.com/

Andrea Mayer (5):
  l3mdev: add infrastructure for table to VRF mapping
  vrf: track associations between VRF devices and tables
  vrf: add sysctl parameter for strict mode
  vrf: add l3mdev registration for table to VRF device lookup
  selftests: add selftest for the VRF strict mode

 drivers/net/vrf.c                             | 450 +++++++++++++++++-
 include/net/l3mdev.h                          |  39 ++
 net/l3mdev/l3mdev.c                           |  93 ++++
 .../selftests/net/vrf_strict_mode_test.sh     | 390 +++++++++++++++
 4 files changed, 963 insertions(+), 9 deletions(-)
 create mode 100755 tools/testing/selftests/net/vrf_strict_mode_test.sh

-- 
2.20.1

^ permalink raw reply

* [net-next,v1,2/5] vrf: track associations between VRF devices and tables
From: Andrea Mayer @ 2020-06-19 22:54 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Shrijeet Mukherjee, Jakub Kicinski,
	Shuah Khan, netdev, linux-kernel, linux-kselftest
  Cc: Donald Sharp, Roopa Prabhu, Dinesh Dutt, Stephen Hemminger,
	Stefano Salsano, Paolo Lungaroni, Ahmed Abdelsalam, Andrea Mayer
In-Reply-To: <20200619225447.1445-1-andrea.mayer@uniroma2.it>

Add the data structures and the processing logic to keep track of the
associations between VRF devices and routing tables.
When a VRF is instantiated, it needs to refer to a given routing table.
For each table, we explicitly keep track of the existing VRFs that refer to
the table.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
---
 drivers/net/vrf.c | 273 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 268 insertions(+), 5 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 43928a1c2f2a..098fdabaa4c5 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -21,6 +21,7 @@
 #include <net/rtnetlink.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/hashtable.h>
+#include <linux/spinlock_types.h>
 
 #include <linux/inetdevice.h>
 #include <net/arp.h>
@@ -35,12 +36,75 @@
 #include <net/netns/generic.h>
 
 #define DRV_NAME	"vrf"
-#define DRV_VERSION	"1.0"
+#define DRV_VERSION	"1.1"
 
 #define FIB_RULE_PREF  1000       /* default preference for FIB rules */
 
+#define HT_MAP_BITS	4
+#define HASH_INITVAL	((u32)0xcafef00d)
+
+struct  vrf_map {
+	DECLARE_HASHTABLE(ht, HT_MAP_BITS);
+	spinlock_t vmap_lock;
+
+	/* shared_tables:
+	 * count how many distinct tables do not comply with the strict mode
+	 * requirement.
+	 * shared_tables value must be 0 in order to enable the strict mode.
+	 *
+	 * example of the evolution of shared_tables:
+	 *                                                        | time
+	 * add  vrf0 --> table 100        shared_tables = 0       | t0
+	 * add  vrf1 --> table 101        shared_tables = 0       | t1
+	 * add  vrf2 --> table 100        shared_tables = 1       | t2
+	 * add  vrf3 --> table 100        shared_tables = 1       | t3
+	 * add  vrf4 --> table 101        shared_tables = 2       v t4
+	 *
+	 * shared_tables is a "step function" (or "staircase function")
+	 * and it is increased by one when the second vrf is associated to a
+	 * table.
+	 *
+	 * at t2, vrf0 and vrf2 are bound to table 100: shared_tables = 1.
+	 *
+	 * at t3, another dev (vrf3) is bound to the same table 100 but the
+	 * value of shared_tables is still 1.
+	 * This means that no matter how many new vrfs will register on the
+	 * table 100, the shared_tables will not increase (considering only
+	 * table 100).
+	 *
+	 * at t4, vrf4 is bound to table 101, and shared_tables = 2.
+	 *
+	 * Looking at the value of shared_tables we can immediately know if
+	 * the strict_mode can or cannot be enforced. Indeed, strict_mode
+	 * can be enforced iff shared_tables = 0.
+	 *
+	 * Conversely, shared_tables is decreased when a vrf is de-associated
+	 * from a table with exactly two associated vrfs.
+	 */
+	u32 shared_tables;
+
+	bool strict_mode;
+};
+
+struct vrf_map_elem {
+	struct hlist_node hnode;
+	struct list_head vrf_list;  /* VRFs registered to this table */
+
+	u32 table_id;
+	int users;
+	int ifindex;
+};
+
 static unsigned int vrf_net_id;
 
+/* per netns vrf data */
+struct netns_vrf {
+	/* protected by rtnl lock */
+	bool add_fib_rules;
+
+	struct vrf_map vmap;
+};
+
 struct net_vrf {
 	struct rtable __rcu	*rth;
 	struct rt6_info	__rcu	*rt6;
@@ -48,6 +112,9 @@ struct net_vrf {
 	struct fib6_table	*fib6_table;
 #endif
 	u32                     tb_id;
+
+	struct list_head	me_list;   /* entry in vrf_map_elem */
+	int			ifindex;
 };
 
 struct pcpu_dstats {
@@ -103,6 +170,173 @@ static void vrf_get_stats64(struct net_device *dev,
 	}
 }
 
+static struct vrf_map *netns_vrf_map(struct net *net)
+{
+	struct netns_vrf *nn_vrf = net_generic(net, vrf_net_id);
+
+	return &nn_vrf->vmap;
+}
+
+static struct vrf_map *netns_vrf_map_by_dev(struct net_device *dev)
+{
+	return netns_vrf_map(dev_net(dev));
+}
+
+static struct vrf_map_elem *vrf_map_elem_alloc(gfp_t flags)
+{
+	struct vrf_map_elem *me;
+
+	me = kmalloc(sizeof(*me), flags);
+	if (!me)
+		return NULL;
+
+	return me;
+}
+
+static void vrf_map_elem_free(struct vrf_map_elem *me)
+{
+	kfree(me);
+}
+
+static void vrf_map_elem_init(struct vrf_map_elem *me, int table_id,
+			      int ifindex, int users)
+{
+	me->table_id = table_id;
+	me->ifindex = ifindex;
+	me->users = users;
+	INIT_LIST_HEAD(&me->vrf_list);
+}
+
+static struct vrf_map_elem *vrf_map_lookup_elem(struct vrf_map *vmap,
+						u32 table_id)
+{
+	struct vrf_map_elem *me;
+	u32 key;
+
+	key = jhash_1word(table_id, HASH_INITVAL);
+	hash_for_each_possible(vmap->ht, me, hnode, key) {
+		if (me->table_id == table_id)
+			return me;
+	}
+
+	return NULL;
+}
+
+static void vrf_map_add_elem(struct vrf_map *vmap, struct vrf_map_elem *me)
+{
+	u32 table_id = me->table_id;
+	u32 key;
+
+	key = jhash_1word(table_id, HASH_INITVAL);
+	hash_add(vmap->ht, &me->hnode, key);
+}
+
+static void vrf_map_del_elem(struct vrf_map_elem *me)
+{
+	hash_del(&me->hnode);
+}
+
+static void vrf_map_lock(struct vrf_map *vmap) __acquires(&vmap->vmap_lock)
+{
+	spin_lock(&vmap->vmap_lock);
+}
+
+static void vrf_map_unlock(struct vrf_map *vmap) __releases(&vmap->vmap_lock)
+{
+	spin_unlock(&vmap->vmap_lock);
+}
+
+/* called with rtnl lock held */
+static int
+vrf_map_register_dev(struct net_device *dev, struct netlink_ext_ack *extack)
+{
+	struct vrf_map *vmap = netns_vrf_map_by_dev(dev);
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct vrf_map_elem *new_me, *me;
+	u32 table_id = vrf->tb_id;
+	bool free_new_me = false;
+	int users;
+	int res;
+
+	/* we pre-allocate elements used in the spin-locked section (so that we
+	 * keep the spinlock as short as possibile).
+	 */
+	new_me = vrf_map_elem_alloc(GFP_KERNEL);
+	if (!new_me)
+		return -ENOMEM;
+
+	vrf_map_elem_init(new_me, table_id, dev->ifindex, 0);
+
+	vrf_map_lock(vmap);
+
+	me = vrf_map_lookup_elem(vmap, table_id);
+	if (!me) {
+		me = new_me;
+		vrf_map_add_elem(vmap, me);
+		goto link_vrf;
+	}
+
+	/* we already have an entry in the vrf_map, so it means there is (at
+	 * least) a vrf registered on the specific table.
+	 */
+	free_new_me = true;
+	if (vmap->strict_mode) {
+		/* vrfs cannot share the same table */
+		NL_SET_ERR_MSG(extack, "Table is used by another VRF");
+		res = -EBUSY;
+		goto unlock;
+	}
+
+link_vrf:
+	users = ++me->users;
+	if (users == 2)
+		++vmap->shared_tables;
+
+	list_add(&vrf->me_list, &me->vrf_list);
+
+	res = 0;
+
+unlock:
+	vrf_map_unlock(vmap);
+
+	/* clean-up, if needed */
+	if (free_new_me)
+		vrf_map_elem_free(new_me);
+
+	return res;
+}
+
+/* called with rtnl lock held */
+static void vrf_map_unregister_dev(struct net_device *dev)
+{
+	struct vrf_map *vmap = netns_vrf_map_by_dev(dev);
+	struct net_vrf *vrf = netdev_priv(dev);
+	u32 table_id = vrf->tb_id;
+	struct vrf_map_elem *me;
+	int users;
+
+	vrf_map_lock(vmap);
+
+	me = vrf_map_lookup_elem(vmap, table_id);
+	if (!me)
+		goto unlock;
+
+	list_del(&vrf->me_list);
+
+	users = --me->users;
+	if (users == 1) {
+		--vmap->shared_tables;
+	} else if (users == 0) {
+		vrf_map_del_elem(me);
+
+		/* no one will refer to this element anymore */
+		vrf_map_elem_free(me);
+	}
+
+unlock:
+	vrf_map_unlock(vmap);
+}
+
 /* by default VRF devices do not have a qdisc and are expected
  * to be created with only a single queue.
  */
@@ -1319,6 +1553,8 @@ static void vrf_dellink(struct net_device *dev, struct list_head *head)
 	netdev_for_each_lower_dev(dev, port_dev, iter)
 		vrf_del_slave(dev, port_dev);
 
+	vrf_map_unregister_dev(dev);
+
 	unregister_netdevice_queue(dev, head);
 }
 
@@ -1327,6 +1563,7 @@ static int vrf_newlink(struct net *src_net, struct net_device *dev,
 		       struct netlink_ext_ack *extack)
 {
 	struct net_vrf *vrf = netdev_priv(dev);
+	struct netns_vrf *nn_vrf;
 	bool *add_fib_rules;
 	struct net *net;
 	int err;
@@ -1349,11 +1586,26 @@ static int vrf_newlink(struct net *src_net, struct net_device *dev,
 	if (err)
 		goto out;
 
+	/* mapping between table_id and vrf;
+	 * note: such binding could not be done in the dev init function
+	 * because dev->ifindex id is not available yet.
+	 */
+	vrf->ifindex = dev->ifindex;
+
+	err = vrf_map_register_dev(dev, extack);
+	if (err) {
+		unregister_netdevice(dev);
+		goto out;
+	}
+
 	net = dev_net(dev);
-	add_fib_rules = net_generic(net, vrf_net_id);
+	nn_vrf = net_generic(net, vrf_net_id);
+
+	add_fib_rules = &nn_vrf->add_fib_rules;
 	if (*add_fib_rules) {
 		err = vrf_add_fib_rules(dev);
 		if (err) {
+			vrf_map_unregister_dev(dev);
 			unregister_netdevice(dev);
 			goto out;
 		}
@@ -1440,12 +1692,23 @@ static struct notifier_block vrf_notifier_block __read_mostly = {
 	.notifier_call = vrf_device_event,
 };
 
+static int vrf_map_init(struct vrf_map *vmap)
+{
+	spin_lock_init(&vmap->vmap_lock);
+	hash_init(vmap->ht);
+
+	vmap->strict_mode = false;
+
+	return 0;
+}
+
 /* Initialize per network namespace state */
 static int __net_init vrf_netns_init(struct net *net)
 {
-	bool *add_fib_rules = net_generic(net, vrf_net_id);
+	struct netns_vrf *nn_vrf = net_generic(net, vrf_net_id);
 
-	*add_fib_rules = true;
+	nn_vrf->add_fib_rules = true;
+	vrf_map_init(&nn_vrf->vmap);
 
 	return 0;
 }
@@ -1453,7 +1716,7 @@ static int __net_init vrf_netns_init(struct net *net)
 static struct pernet_operations vrf_net_ops __net_initdata = {
 	.init = vrf_netns_init,
 	.id   = &vrf_net_id,
-	.size = sizeof(bool),
+	.size = sizeof(struct netns_vrf),
 };
 
 static int __init vrf_init_module(void)
-- 
2.20.1


^ permalink raw reply related

* [PATCH net] rxrpc: Fix notification call on completion of discarded calls
From: David Howells @ 2020-06-19 22:38 UTC (permalink / raw)
  To: netdev; +Cc: syzbot+d3eccef36ddbd02713e9, dhowells, linux-afs, linux-kernel

When preallocated service calls are being discarded, they're passed to
->discard_new_call() to have the caller clean up any attached higher-layer
preallocated pieces before being marked completed.  However, the act of
marking them completed now invokes the call's notification function - which
causes a problem because that function might assume that the previously
freed pieces of memory are still there.

Fix this by setting a dummy notification function on the socket after
calling ->discard_new_call().

This results in the following kasan message when the kafs module is
removed.

==================================================================
BUG: KASAN: use-after-free in afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707
Write of size 1 at addr ffff8880946c39e4 by task kworker/u4:1/21

CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.8.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:383
 __kasan_report mm/kasan/report.c:513 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
 afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707
 rxrpc_notify_socket+0x1db/0x5d0 net/rxrpc/recvmsg.c:40
 __rxrpc_set_call_completion.part.0+0x172/0x410 net/rxrpc/recvmsg.c:76
 __rxrpc_call_completed net/rxrpc/recvmsg.c:112 [inline]
 rxrpc_call_completed+0xca/0xf0 net/rxrpc/recvmsg.c:111
 rxrpc_discard_prealloc+0x781/0xab0 net/rxrpc/call_accept.c:233
 rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245
 afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110
 afs_net_exit+0x1bc/0x310 fs/afs/main.c:155
 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
 cleanup_net+0x511/0xa50 net/core/net_namespace.c:603
 process_one_work+0x965/0x1690 kernel/workqueue.c:2269
 worker_thread+0x96/0xe10 kernel/workqueue.c:2415
 kthread+0x3b5/0x4a0 kernel/kthread.c:291
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293

Allocated by task 6820:
 save_stack+0x1b/0x40 mm/kasan/common.c:48
 set_track mm/kasan/common.c:56 [inline]
 __kasan_kmalloc mm/kasan/common.c:494 [inline]
 __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:467
 kmem_cache_alloc_trace+0x153/0x7d0 mm/slab.c:3551
 kmalloc include/linux/slab.h:555 [inline]
 kzalloc include/linux/slab.h:669 [inline]
 afs_alloc_call+0x55/0x630 fs/afs/rxrpc.c:141
 afs_charge_preallocation+0xe9/0x2d0 fs/afs/rxrpc.c:757
 afs_open_socket+0x292/0x360 fs/afs/rxrpc.c:92
 afs_net_init+0xa6c/0xe30 fs/afs/main.c:125
 ops_init+0xaf/0x420 net/core/net_namespace.c:151
 setup_net+0x2de/0x860 net/core/net_namespace.c:341
 copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
 create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:231
 ksys_unshare+0x43d/0x8e0 kernel/fork.c:2983
 __do_sys_unshare kernel/fork.c:3051 [inline]
 __se_sys_unshare kernel/fork.c:3049 [inline]
 __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3049
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 21:
 save_stack+0x1b/0x40 mm/kasan/common.c:48
 set_track mm/kasan/common.c:56 [inline]
 kasan_set_free_info mm/kasan/common.c:316 [inline]
 __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:455
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x109/0x2b0 mm/slab.c:3757
 afs_put_call+0x585/0xa40 fs/afs/rxrpc.c:190
 rxrpc_discard_prealloc+0x764/0xab0 net/rxrpc/call_accept.c:230
 rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245
 afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110
 afs_net_exit+0x1bc/0x310 fs/afs/main.c:155
 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
 cleanup_net+0x511/0xa50 net/core/net_namespace.c:603
 process_one_work+0x965/0x1690 kernel/workqueue.c:2269
 worker_thread+0x96/0xe10 kernel/workqueue.c:2415
 kthread+0x3b5/0x4a0 kernel/kthread.c:291
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293

The buggy address belongs to the object at ffff8880946c3800
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 484 bytes inside of
 1024-byte region [ffff8880946c3800, ffff8880946c3c00)
The buggy address belongs to the page:
page:ffffea000251b0c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0xfffe0000000200(slab)
raw: 00fffe0000000200 ffffea0002546508 ffffea00024fa248 ffff8880aa000c40
raw: 0000000000000000 ffff8880946c3000 0000000100000002 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880946c3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880946c3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880946c3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                       ^
 ffff8880946c3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880946c3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Reported-by: syzbot+d3eccef36ddbd02713e9@syzkaller.appspotmail.com
Fixes: 5ac0d62226a0 ("rxrpc: Fix missing notification")
Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/call_accept.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c
index b7611cc159e5..032ed76c0166 100644
--- a/net/rxrpc/call_accept.c
+++ b/net/rxrpc/call_accept.c
@@ -22,6 +22,11 @@
 #include <net/ip.h>
 #include "ar-internal.h"
 
+static void rxrpc_dummy_notify(struct sock *sk, struct rxrpc_call *call,
+			       unsigned long user_call_ID)
+{
+}
+
 /*
  * Preallocate a single service call, connection and peer and, if possible,
  * give them a user ID and attach the user's side of the ID to them.
@@ -228,6 +233,8 @@ void rxrpc_discard_prealloc(struct rxrpc_sock *rx)
 		if (rx->discard_new_call) {
 			_debug("discard %lx", call->user_call_ID);
 			rx->discard_new_call(call, call->user_call_ID);
+			if (call->notify_rx)
+				call->notify_rx = rxrpc_dummy_notify;
 			rxrpc_put_call(call, rxrpc_call_put_kernel);
 		}
 		rxrpc_call_completed(call);



^ permalink raw reply related

* Re: Question on DSA switches, IGMP forwarding and switchdev
From: Andrew Lunn @ 2020-06-19 22:36 UTC (permalink / raw)
  To: Jason Cobham
  Cc: Daniel Mack, netdev@vger.kernel.org, Ido Schimmel, Jiri Pirko,
	Ivan Vecera, Florian Fainelli
In-Reply-To: <72f92622c69143b0880125dfe9f9a955@questertangent.com>

> I've run into the same issue. To resolve it,  In my case, in the same file, I've had to send all IGMP control traffic to the CPU:
> 	skb->offload_fwd_mark = 1;
> 	switch (ih->type) {
> 		case IGMP_HOST_MEMBERSHIP_REPORT:
> 		case IGMPV2_HOST_MEMBERSHIP_REPORT:
> 		case IGMPV3_HOST_MEMBERSHIP_REPORT:
> 		case IGMP_HOST_MEMBERSHIP_QUERY:
> 		case IGMP_HOST_LEAVE_MESSAGE:
> 			skb->offload_fwd_mark = 0;
> 		break;
> 	}
> 
> I'd be interested if there is a better way.

It might depend on the switch generation, but i think some switches
indicate the sort of packet in the DSA header. For 6390, Octet 3 of
the header, bits 3-5 contains a code.

0=BDPU
1=Frame2Reg
2=IGMP/MLD
3=Policy
4=ARP Mirror
5=Policy Mirror

We can look at these bits and not set skb->offload_fwd_mark depending
on its value.

The 6352 family has the same bits. 6161 has a few less bits, but does
have IGMP/MLD. I don't know about the 6085. Do you have the datasheet?

     Andrew

^ permalink raw reply

* RE: Question on DSA switches, IGMP forwarding and switchdev
From: Jason Cobham @ 2020-06-19 22:05 UTC (permalink / raw)
  To: 'Andrew Lunn', Daniel Mack
  Cc: netdev@vger.kernel.org, Ido Schimmel, Jiri Pirko, Ivan Vecera,
	Florian Fainelli
In-Reply-To: <20200619215817.GN279339@lunn.ch>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Andrew Lunn
> Sent: Friday, June 19, 2020 2:58 PM
> To: Daniel Mack
> Cc: netdev@vger.kernel.org; Ido Schimmel; Jiri Pirko; Ivan Vecera; Florian
> Fainelli
> Subject: Re: Question on DSA switches, IGMP forwarding and switchdev
> 
> On Fri, Jun 19, 2020 at 11:31:04PM +0200, Daniel Mack wrote:
> > Hi,
> >
> > I'm working on a custom board featuring a Marvell mv88e6085 Ethernet
> > switch controlled by the Linux DSA driver, and I'm facing an issue with
> > IGMP packet flows.
> >
> > Consider two Ethernet stations, each connected to the switch on a
> > dedicated port. A Linux bridge combines the two ports. In my setup, I
> > need these two stations to send and receive multicast traffic, with IGMP
> > snooping enabled.
> >
> > When an IGMP query enters the switch, it is redirected to the CPU port
> > as all 'external' ports are configured for IGMP/MLP snooping by the
> > driver. The issue that I'm seeing is that the Linux bridge does not
> > forward the IGMP frames to any other port, no matter whether the bridge
> > is in snooping mode or not. This needs to happen however, otherwise the
> > stations will not see IGMP queries, and unsolicited membership reports
> > are not being transferred either.
> 
> Hi Daniel
> 
> I think all the testing i've done in this area i've had the bridge
> acting as the IGMP queirer. Hence it has replied to the query, rather
> than forward it out other ports.
> 
> So this could be a bug.
> 
> > I've traced these frames through the bridge code and figured forwarding
> > fails in should_deliver() in net/bridge/br_forward.c because
> > nbp_switchdev_allowed_egress() denies it due to the fact that the frame
> > has already been forwarded by the same parent device.
> 
> To get this far, has the bridge determined it is not the elected
> querier?  I guess it must of done. Otherwise it would not be
> forwarding it.
> 
> > So my question now is how to fix that. Would the DSA driver need to mark
> > the ports as independent somehow?
> 
> The problem here is:
> 
> https://elixir.bootlin.com/linux/v5.8-rc1/source/net/dsa/tag_edsa.c#L159
> 
> Setting offload_fwd_mark means the switch has forwarded the frame as
> needed to other ports of the switch. If the frame is an IGMP query
> frame, and the bridge is not the elected quierer, i guess we need to
> set this false? Or we need an FDB in the switch to forward it. What
> group address is being used?
> 
>     Andrew

I've run into the same issue. To resolve it,  In my case, in the same file, I've had to send all IGMP control traffic to the CPU:
	skb->offload_fwd_mark = 1;
	switch (ih->type) {
		case IGMP_HOST_MEMBERSHIP_REPORT:
		case IGMPV2_HOST_MEMBERSHIP_REPORT:
		case IGMPV3_HOST_MEMBERSHIP_REPORT:
		case IGMP_HOST_MEMBERSHIP_QUERY:
		case IGMP_HOST_LEAVE_MESSAGE:
			skb->offload_fwd_mark = 0;
		break;
	}

I'd be interested if there is a better way.

Jason

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox