Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing
From: David Miller @ 2018-04-24  3:21 UTC (permalink / raw)
  To: roopa; +Cc: netdev, dsa
In-Reply-To: <1524539321-9103-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Mon, 23 Apr 2018 20:08:41 -0700

> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> Fixes: b16fb418b1bf ("net: fib_rules: add extack support")
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>

Applied.

It would be nice to get rid of these if() conditionals dangling
around ifdef blocks.  They are quite error prone.

^ permalink raw reply

* Re: [PATCH 1/1] Revert "rds: ib: add error handle"
From: santosh.shilimkar @ 2018-04-24  3:27 UTC (permalink / raw)
  To: Zhu Yanjun, linux-rdma, rds-devel; +Cc: davem, netdev
In-Reply-To: <1524533941-4072-1-git-send-email-yanjun.zhu@oracle.com>

On 4/23/18 6:39 PM, Zhu Yanjun wrote:
> This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.
> 
> After long time discussion and investigations, it seems that there
> is no mem leak. So this patch is reverted.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
Well your fix was not for any leaks but just proper labels for
graceful exits. I don't know which long time discussion
you are referring but there is no need to revert this change
unless you see any issue with your change.

Regards,
Santosh

^ permalink raw reply

* Re: [PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing
From: David Ahern @ 2018-04-24  3:43 UTC (permalink / raw)
  To: David Miller, roopa; +Cc: netdev
In-Reply-To: <20180423.232158.1633374650832506893.davem@davemloft.net>

On 4/23/18 9:21 PM, David Miller wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date: Mon, 23 Apr 2018 20:08:41 -0700
> 
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> Fixes: b16fb418b1bf ("net: fib_rules: add extack support")
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> Applied.
> 
> It would be nice to get rid of these if() conditionals dangling
> around ifdef blocks.  They are quite error prone.
> 

I'll send a patch. I'd prefer a different message when NET_L3_MASTER_DEV
is not enabled.

^ permalink raw reply

* Re: [PATCH 2/2] alx: add disable_wol paramenter
From: AceLan Kao @ 2018-04-24  3:45 UTC (permalink / raw)
  To: David Miller
  Cc: Andrew Lunn, James Cliburn, Chris Snook, rakesh, netdev,
	Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <CAFv23Q=QYastpgLxHiwr_PSdoeWfEjiNp6x2F9kM5EYhRee2hA@mail.gmail.com>

Hi,

May I know the final decision of this patch?
Thanks.

Best regards,
AceLan Kao.

2018-04-10 10:40 GMT+08:00 AceLan Kao <acelan.kao@canonical.com>:
> The problem is I don't have a machine with that wakeup issue, and I
> need WoL feature.
> Instead of spreading "alx with WoL" dkms package everywhere, I would
> like to see it's supported in the driver and is disabled by default.
>
> Moreover, the wakeup issue may come from old Atheros chips, or result
> from buggy BIOS.
> With the WoL has been removed from the driver, no one will report
> issue about that, and we don't have any chance to find a fix for it.
>
> Adding this feature back is not covering a paper on the issue, it
> makes people have a chance to examine this feature.
>
> 2018-04-09 22:50 GMT+08:00 David Miller <davem@davemloft.net>:
>> From: Andrew Lunn <andrew@lunn.ch>
>> Date: Mon, 9 Apr 2018 14:39:10 +0200
>>
>>> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote:
>>>> The WoL feature was reported broken and will lead to
>>>> the system resume immediately after suspending.
>>>> This symptom is not happening on every system, so adding
>>>> disable_wol option and disable WoL by default to prevent the issue from
>>>> happening again.
>>>
>>>>  const char alx_drv_name[] = "alx";
>>>>
>>>> +/* disable WoL by default */
>>>> +bool disable_wol = 1;
>>>> +module_param(disable_wol, bool, 0);
>>>> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
>>>> +
>>>
>>> Hi AceLan
>>>
>>> This seems like you are papering over the cracks. And module
>>> parameters are not liked.
>>>
>>> Please try to find the real problem.
>>
>> Agreed.

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Matthew Wilcox @ 2018-04-24  3:46 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, David Miller, Andrew Morton, linux-mm, eric.dumazet,
	edumazet, netdev, linux-kernel, mst, jasowang, virtualization,
	dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232003100.2299@file01.intranet.prod.int.rdu2.redhat.com>

On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> Some bugs (such as buffer overflows) are better detected
> with kmalloc code, so we must test the kmalloc path too.

Well now, this brings up another item for the collective TODO list --
implement redzone checks for vmalloc.  Unless this is something already
taken care of by kasan or similar.

^ permalink raw reply

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
From: Eric Dumazet @ 2018-04-24  4:30 UTC (permalink / raw)
  To: Andy Lutomirski, Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, linux-kernel,
	Soheil Hassas Yeganeh, linux-mm, Linux API
In-Reply-To: <CALCETrWOLU+P_jVpuOUQT2e_5ZShAP3OM0yJZMbC=pv5La9Cvg@mail.gmail.com>



On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> 
>>> I would suggest that you rework the interface a bit.  First a user would call mmap() on a TCP socket, which would create an empty VMA.  (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>>   void *address;
>>>   size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
> 
> MM layer.  I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
> 
> len = get_user(...);
> 
> down_read(&current->mm->mmap_sem);
> 
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
>   return -EFAULT;
> 
> /* This is buggy.  You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
>   return -EINVAL;
> 
> if (len > vma->vm_end - start)
>   return -EFAULT;  /* too big a request. */
> 
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right.  Instead you'd just stop and tell the caller that you
> didn't map the full requested size.  You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
> 
> You also need to provide some way for user programs to signal that
> they're done with the page in question.  MADV_DONTNEED might be
> sufficient.
> 
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable.  And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
> 
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers.  The latter is awful, especially in
> multithreaded programs.  In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes.  (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on.  I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
> 
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
> 
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?
> 
> Exactly.  If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
> 
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)
> 
> There's nothing to account.  It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
> 

Thanks Andy, I am working on all this, and initial patch looks sane enough.

 include/uapi/linux/tcp.h |    7 +
 net/ipv4/tcp.c           |  175 +++++++++++++++++++++++------------------------
 2 files changed, 93 insertions(+), 89 deletions(-)


I will test all this before sending for review asap.

( I have not done the compat code yet, this can be done later I guess)

^ permalink raw reply

* Re: [PATCH net-next] net: init sk_cookie for inet socket
From: Yafang Shao @ 2018-04-24  4:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Alexei Starovoitov, netdev, LKML
In-Reply-To: <788ce3f1-6534-5c2e-1870-5ebd8ea4ae7f@gmail.com>

On Tue, Apr 24, 2018 at 12:09 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 04/23/2018 08:58 AM, David Miller wrote:
>> From: Yafang Shao <laoar.shao@gmail.com>
>> Date: Sun, 22 Apr 2018 21:50:04 +0800
>>
>>> With sk_cookie we can identify a socket, that is very helpful for
>>> traceing and statistic, i.e. tcp tracepiont and ebpf.
>>> So we'd better init it by default for inet socket.
>>> When using it, we just need call atomic64_read(&sk->sk_cookie).
>>>
>>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>>
>> Applied, thank you.
>>
>
> This is adding yet another atomic_inc on a global cache line.
>

That's a trade-off.

> Most applications do not need the cookie being ever set.
>
> The existing mechanism was fine. Set it on demand.

There are some drawback in the existing mechanism.
- we have to set the net->cookie_gen and then sk->sk_cookie when we
want to get the sk_cookie, that's a little expensive as well.
  After that change, sock_gen_cookie() could be replaced by
atomic64_read(&sk->sk_cookie) in most places.

- If the application want to get the sk_cookie, it must set it first.
   What if the application don't have the permision to write?
   Furthermore, maybe it is a security concern ?


Thanks
Yafang

^ permalink raw reply

* Re: [PATCH net v2] net: ethtool: Add missing kernel doc for FEC parameters
From: Roopa Prabhu @ 2018-04-24  4:41 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, David Miller, Vidya Sagar Ravipati, Dustin Byford
In-Reply-To: <20180423225138.8238-1-f.fainelli@gmail.com>

On Mon, Apr 23, 2018 at 3:51 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> While adding support for ethtool::get_fecparam and set_fecparam, kernel
> doc for these functions was missed, add those.
>
> Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction modes")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>

Thanks Florian.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-24  5:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, Jiri Pirko, Sridhar Samudrala, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang
In-Reply-To: <20180424043042-mutt-send-email-mst@kernel.org>

On Tue, 24 Apr 2018 04:42:22 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> > On Mon, 23 Apr 2018 12:44:39 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >   
> > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:  
> > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:    
> > > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >>    
> > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:    
> > > >> > > > >
> > > >> > > > >I will NAK patches to change to common code for netvsc especially the
> > > >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> > > >> > > > >mode, ans we really can't have a new model; or do backport.
> > > >> > > > >
> > > >> > > > >Plus, DPDK is now dependent on existing model.    
> > > >> > > >
> > > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.    
> > > >> > >
> > > >> > > The network device model is a userspace API, and DPDK is a userspace application.    
> > > >> >
> > > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > > >> > AFAIK it's normally banging device registers directly.
> > > >> >    
> > > >> > > You can't go breaking userspace even if you don't like the application.    
> > > >> >
> > > >> > Could you please explain how is the proposed patchset breaking
> > > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> > > >> > API at all.
> > > >> >    
> > > >>
> > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> > > >> to look for Linux netvsc device and the paired VF device and setup the
> > > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> > > >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> > > >> VF device.
> > > >>
> > > >> So it depends on existing 2 device model. You can't go to a 3 device model
> > > >> or start hiding devices from userspace.    
> > > >
> > > > Okay so how does the existing patch break that? IIUC does not go to
> > > > a 3 device model since netvsc calls failover_register directly.
> > > >    
> > > >> Also, I am working on associating netvsc and VF device based on serial number
> > > >> rather than MAC address. The serial number is how Windows works now, and it makes
> > > >> sense for Linux and Windows to use the same mechanism if possible.    
> > > >
> > > > Maybe we should support same for virtio ...
> > > > Which serial do you mean? From vpd?
> > > >
> > > > I guess you will want to keep supporting MAC for old hypervisors?  
> > 
> > The serial number has always been in the hypervisor since original support of SR-IOV
> > in WS2008.  So no backward compatibility special cases would be needed.  
> 
> Is that a serial from real hardware or a hypervisor thing?
> 
> 

It is a hypervisor thing in the PCI hyperv code and the hyperv Netvsc interface.
It might also be in the PCI spec, but the value in Hyper-V is being generated by the host.

^ permalink raw reply

* Re: [Patch nf] ipvs: initialize tbl->entries after allocation
From: Julian Anastasov @ 2018-04-24  5:16 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, lvs-devel, netfilter-devel, Simon Horman,
	Pablo Neira Ayuso
In-Reply-To: <20180423205341.13142-1-xiyou.wangcong@gmail.com>


	Hello,

On Mon, 23 Apr 2018, Cong Wang wrote:

> tbl->entries is not initialized after kmalloc(), therefore
> causes an uninit-value warning in ip_vs_lblc_check_expire()
> as reported by syzbot.
> 
> Reported-by: <syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com>
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

	Thanks!

Acked-by: Julian Anastasov <ja@ssi.bg>

> ---
>  net/netfilter/ipvs/ip_vs_lblcr.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
> index 92adc04557ed..bc2bc5eebcb8 100644
> --- a/net/netfilter/ipvs/ip_vs_lblcr.c
> +++ b/net/netfilter/ipvs/ip_vs_lblcr.c
> @@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
>  	tbl->counter = 1;
>  	tbl->dead = false;
>  	tbl->svc = svc;
> +	atomic_set(&tbl->entries, 0);
>  
>  	/*
>  	 *    Hook periodic timer for garbage collection
> -- 
> 2.13.0

Regards

^ permalink raw reply

* Re: [Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()
From: Julian Anastasov @ 2018-04-24  5:17 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, lvs-devel, netfilter-devel, Simon Horman,
	Pablo Neira Ayuso
In-Reply-To: <20180423210445.18336-1-xiyou.wangcong@gmail.com>


	Hello,

On Mon, 23 Apr 2018, Cong Wang wrote:

> Similarly, tbl->entries is not initialized after kmalloc(),
> therefore causes an uninit-value warning in ip_vs_lblc_check_expire(),
> as reported by syzbot.
> 
> Reported-by: <syzbot+3e9695f147fb529aa9bc@syzkaller.appspotmail.com>
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

	Thanks!

Acked-by: Julian Anastasov <ja@ssi.bg>

> ---
>  net/netfilter/ipvs/ip_vs_lblc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
> index 3057e453bf31..83918119ceb8 100644
> --- a/net/netfilter/ipvs/ip_vs_lblc.c
> +++ b/net/netfilter/ipvs/ip_vs_lblc.c
> @@ -371,6 +371,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
>  	tbl->counter = 1;
>  	tbl->dead = false;
>  	tbl->svc = svc;
> +	atomic_set(&tbl->entries, 0);
>  
>  	/*
>  	 *    Hook periodic timer for garbage collection
> -- 
> 2.13.0

Regards

^ permalink raw reply

* Re: [PATCH net] sfc: ARFS filter IDs
From: kbuild test robot @ 2018-04-24  5:29 UTC (permalink / raw)
  To: Edward Cree; +Cc: kbuild-all, linux-net-drivers, David Miller, netdev
In-Reply-To: <2ee1ef47-886d-278d-4a8d-234d74e26ad7@solarflare.com>

[-- Attachment #1: Type: text/plain, Size: 2956 bytes --]

Hi Edward,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/0day-ci/linux/commits/Edward-Cree/sfc-ARFS-filter-IDs/20180424-080737
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/sfc/farch.c: In function 'efx_farch_filter_rfs_expire_one':
>> drivers/net/ethernet/sfc/farch.c:2938:7: warning: 'rule' may be used uninitialized in this function [-Wmaybe-uninitialized]
       if (rule)
          ^

coccinelle warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/sfc/efx.c:3032:1-20: alloc with no test, possible model on line 3041
   drivers/net/ethernet/sfc/efx.c:3032:1-20: alloc with no test, possible model on line 3062

vim +/rule +2938 drivers/net/ethernet/sfc/farch.c

  2902	
  2903	bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
  2904					     unsigned int index)
  2905	{
  2906		struct efx_farch_filter_state *state = efx->filter_state;
  2907		struct efx_farch_filter_table *table;
  2908		bool ret = false, force = false;
  2909		u16 arfs_id;
  2910	
  2911		down_write(&state->lock);
  2912		spin_lock_bh(&efx->rps_hash_lock);
  2913		table = &state->table[EFX_FARCH_FILTER_TABLE_RX_IP];
  2914		if (test_bit(index, table->used_bitmap) &&
  2915		    table->spec[index].priority == EFX_FILTER_PRI_HINT) {
  2916			struct efx_filter_spec spec;
  2917			struct efx_arfs_rule *rule;
  2918	
  2919			efx_farch_filter_to_gen_spec(&spec, &table->spec[index]);
  2920			if (!efx->rps_hash_table) {
  2921				/* In the absence of the table, we always returned 0 to
  2922				 * ARFS, so use the same to query it.
  2923				 */
  2924				arfs_id = 0;
  2925			} else {
  2926				rule = efx_rps_hash_find(efx, &spec);
  2927				if (!rule) {
  2928					/* ARFS table doesn't know of this filter, remove it */
  2929					force = true;
  2930				} else {
  2931					arfs_id = rule->arfs_id;
  2932					if (!efx_rps_check_rule(rule, index, &force))
  2933						goto out_unlock;
  2934				}
  2935			}
  2936			if (force || rps_may_expire_flow(efx->net_dev, spec.dmaq_id,
  2937							 flow_id, arfs_id)) {
> 2938				if (rule)
  2939					rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
  2940				efx_rps_hash_del(efx, &spec);
  2941				efx_farch_filter_table_clear_entry(efx, table, index);
  2942				ret = true;
  2943			}
  2944		}
  2945	out_unlock:
  2946		spin_unlock_bh(&efx->rps_hash_lock);
  2947		up_write(&state->lock);
  2948		return ret;
  2949	}
  2950	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 62891 bytes --]

^ permalink raw reply

* VRF: Ingress IPv6 Linklocal/Multicast destined pkt from slave VRF device does not map to Master device socket
From: Sukumar Gopalakrishnan @ 2018-04-24  5:57 UTC (permalink / raw)
  To: netdev

VRF: Ingress IPv6 Linklocal/Multicast pkt from slave VRF device does
not map to Master device socket.

KERNEL VERSION:
================
4.14.28

BUG REPORT:
============
https://bugzilla.kernel.org/show_bug.cgi?id=199409

CONFIGURATION  AND PROBLEM ROOT CAUSE:
========================================

1) Created VRF device(Vrf_258) and enslaved network device(v1_F4246) to this
VRF.

/exos/bin # ip link show v1_F4246
54: v1_F4246: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
master vrf_258 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 00:04:96:98:c9:18 brd ff:ff:ff:ff:ff:ff

/exos/bin # ip link show vrf_258
14: vrf_258: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc noqueue state
UP mode DEFAULT group default qlen 1000
    link/ether 00:04:96:98:c9:18 brd ff:ff:ff:ff:ff:ff

2) Opened PIM protocol raw socket for AF_INET6 family

pim_socket = socket(AF_INET6, SOCK_RAW , IPPROTO_PIM )

3) PIM user daemon process per VRF so opened RX socket SO_BINDTODEVICE
to VRF_258 netdevice.
PIM control packets ingressing any slave devices belongs to this
master VRF device should be sent to this socket.

4) Ingressing PIM hello control packets which is having SrcIP =
fe80::204:96ff:fe98:c918 (IPv6 Link-local) and DestIP = ff02::0d
(Multicast pkt)
does not mapped to vrf_258 bounded socket and gets dropped in socket
lookup function.

5)  inet6_iif() is returning v1_F4246's ifindex 54 and inet6_sdif()
returns value zero.

__raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb),
inet6_sdif(skb));

sk->sk_bound_dev_if is having vrf_258(ifIndex value 14)  but dif(value
54) and sdif(value 0) does not match this socket hence socket not
found.

struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
                unsigned short num, const struct in6_addr *loc_addr,
                const struct in6_addr *rmt_addr, int dif, int sdif) {
<snip>
..
if (sk->sk_bound_dev_if &&
                            sk->sk_bound_dev_if != dif &&
                            sk->sk_bound_dev_if != sdif)
..

<snip>

}

6) This problem is seen for Raw, Udp and TCP socket look up function
for IPv6 packets destined to linklocal or multicast address.

7) This issue do not occur for all types of IPV4 address and IPv6
unicast global address.

TEMP FIX:
=========

Get master device address from (skb->dev) and  pass master  to socket
lookup up function for Ipv6 Linklocal/Multicast address.

ipv6_raw_deliver()
{
int mdif;
..
..
        mdif = (((nexthdr == IPPROTO_PIM || nexthdr == 89 /* IPPROTO_OSPF */ ||
                nexthdr == IPPROTO_ICMPV6 || nexthdr == 112 /*IPPROTO_VRRP*/) &&
                (ipv6_addr_type(daddr) &
                (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL))) ?
                l3mdev_master_ifindex_rcu(skb->dev) : inet6_iif(skb));

        sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, mdif,
inet6_sdif(skb));

...
..
}

Regards,
Sukumar

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit
From: Pravin Shelar @ 2018-04-24  6:30 UTC (permalink / raw)
  To: Yi-Hung Wei; +Cc: Linux Kernel Network Developers
In-Reply-To: <1524011429-14500-3-git-send-email-yihung.wei@gmail.com>

On Tue, Apr 17, 2018 at 5:30 PM, Yi-Hung Wei <yihung.wei@gmail.com> wrote:
> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max.  In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table.  Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
>
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone.  For example, we can designate different zone to different VM,
> and set conntrack limit to each zone.  By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs.  Moreover, the users can
> set various conntrack limit to different zone based on their preference.
>
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone.  If the number of
> connection is above a configured limitation, ovs will return ENOMEM to the
> userspace.  If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
>
> The following high leve APIs are provided to the userspace:
>   - OVS_CT_LIMIT_CMD_SET:
>     * set default connection limit for all zones
>     * set the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_DEL:
>     * remove the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_GET:
>     * get the default connection limit for all zones
>     * get the connection limit for a particular zone
>
> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
> ---
>  net/openvswitch/Kconfig     |   3 +-
>  net/openvswitch/conntrack.c | 498 +++++++++++++++++++++++++++++++++++++++++++-
>  net/openvswitch/conntrack.h |   9 +-
>  net/openvswitch/datapath.c  |   7 +-
>  net/openvswitch/datapath.h  |   1 +
>  5 files changed, 512 insertions(+), 6 deletions(-)
>
> diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
> index 2650205cdaf9..89da9512ec1e 100644
> --- a/net/openvswitch/Kconfig
> +++ b/net/openvswitch/Kconfig
> @@ -9,7 +9,8 @@ config OPENVSWITCH
>                    (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
>                                      (!NF_NAT || NF_NAT) && \
>                                      (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
> -                                    (!NF_NAT_IPV6 || NF_NAT_IPV6)))
> +                                    (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
> +                                    (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
>         select LIBCRC32C
>         select MPLS
>         select NET_MPLS_GSO
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index c5904f629091..d09b572f72b4 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -17,7 +17,9 @@
>  #include <linux/udp.h>
>  #include <linux/sctp.h>
>  #include <net/ip.h>
> +#include <net/genetlink.h>
>  #include <net/netfilter/nf_conntrack_core.h>
> +#include <net/netfilter/nf_conntrack_count.h>
>  #include <net/netfilter/nf_conntrack_helper.h>
>  #include <net/netfilter/nf_conntrack_labels.h>
>  #include <net/netfilter/nf_conntrack_seqadj.h>
> @@ -76,6 +78,38 @@ struct ovs_conntrack_info {
>  #endif
>  };
>
> +#if    IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
> +#define OVS_CT_LIMIT_UNLIMITED 0
> +#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
> +#define CT_LIMIT_HASH_BUCKETS 512
> +
Can you use static key when the limit is not set.
This would avoid overhead in datapath when these limits are not used.

> +struct ovs_ct_limit {
> +       /* Elements in ovs_ct_limit_info->limits hash table */
> +       struct hlist_node hlist_node;
> +       struct rcu_head rcu;
> +       u16 zone;
> +       u32 limit;
> +};
> +
...

> +#endif
> +
>  /* Lookup connection and confirm if unconfirmed. */
>  static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>                          const struct ovs_conntrack_info *info,
> @@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>         if (!ct)
>                 return 0;
>
> +#if    IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
> +       err = ovs_ct_check_limit(net, info,
> +                                &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
> +       if (err)
> +               return err;
> +#endif
> +

This could be checked during flow install time, so that only permitted
flows would have 'ct commit' action, we can avoid per packet cost
checking the limit.
returning error code form ovs_ct_commit() is lost in datapath and it
would be hard to debug packet lost in case of the limit is reached. So
another advantage of checking the limit in flow install be better
traceability. datapath would return error to usespace and it can log
the error code.

^ permalink raw reply

* [PATCHv2 net] team: fix netconsole setup over team
From: Xin Long @ 2018-04-24  6:33 UTC (permalink / raw)
  To: network dev; +Cc: davem, Jiri Pirko, Stephen Hemminger, Cong Wang

The same fix in Commit dbe173079ab5 ("bridge: fix netconsole
setup over bridge") is also needed for team driver.

While at it, remove the unnecessary parameter *team from
team_port_enable_netpoll().

v1->v2:
  - fix it in a better way, as does bridge.

Fixes: 0fb52a27a04a ("team: cleanup netpoll clode")
Reported-by: João Avelino Bellomo Filho <jbellomo@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/team/team.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index acbe849..ddb6bf8 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1072,14 +1072,11 @@ static void team_port_leave(struct team *team, struct team_port *port)
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-static int team_port_enable_netpoll(struct team *team, struct team_port *port)
+static int __team_port_enable_netpoll(struct team_port *port)
 {
 	struct netpoll *np;
 	int err;
 
-	if (!team->dev->npinfo)
-		return 0;
-
 	np = kzalloc(sizeof(*np), GFP_KERNEL);
 	if (!np)
 		return -ENOMEM;
@@ -1093,6 +1090,14 @@ static int team_port_enable_netpoll(struct team *team, struct team_port *port)
 	return err;
 }
 
+static int team_port_enable_netpoll(struct team_port *port)
+{
+	if (!port->team->dev->npinfo)
+		return 0;
+
+	return __team_port_enable_netpoll(port);
+}
+
 static void team_port_disable_netpoll(struct team_port *port)
 {
 	struct netpoll *np = port->np;
@@ -1107,7 +1112,7 @@ static void team_port_disable_netpoll(struct team_port *port)
 	kfree(np);
 }
 #else
-static int team_port_enable_netpoll(struct team *team, struct team_port *port)
+static int team_port_enable_netpoll(struct team_port *port)
 {
 	return 0;
 }
@@ -1221,7 +1226,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
 		goto err_vids_add;
 	}
 
-	err = team_port_enable_netpoll(team, port);
+	err = team_port_enable_netpoll(port);
 	if (err) {
 		netdev_err(dev, "Failed to enable netpoll on device %s\n",
 			   portname);
@@ -1918,7 +1923,7 @@ static int team_netpoll_setup(struct net_device *dev,
 
 	mutex_lock(&team->lock);
 	list_for_each_entry(port, &team->port_list, list) {
-		err = team_port_enable_netpoll(team, port);
+		err = __team_port_enable_netpoll(port);
 		if (err) {
 			__team_netpoll_cleanup(team);
 			break;
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: Pravin Shelar @ 2018-04-24  6:34 UTC (permalink / raw)
  To: Yi-Hung Wei
  Cc: David Miller, Linux Kernel Network Developers, Florian Westphal
In-Reply-To: <CAG1aQh+_KRCOSscVXAGc-2+09pB_WcMm4q0p9=Ewqr-YCT=FBA@mail.gmail.com>

On Mon, Apr 23, 2018 at 2:19 PM, Yi-Hung Wei <yihung.wei@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 1:10 PM, Pravin Shelar <pshelar@ovn.org> wrote:
>> On Mon, Apr 23, 2018 at 6:39 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Yi-Hung Wei <yihung.wei@gmail.com>
>>> Date: Tue, 17 Apr 2018 17:30:27 -0700
>>>
>>>> Currently, nf_conntrack_max is used to limit the maximum number of
>>>> conntrack entries in the conntrack table for every network namespace.
>>>> For the VMs and containers that reside in the same namespace,
>>>> they share the same conntrack table, and the total # of conntrack entries
>>>> for all the VMs and containers are limited by nf_conntrack_max.  In this
>>>> case, if one of the VM/container abuses the usage the conntrack entries,
>>>> it blocks the others from committing valid conntrack entries into the
>>>> conntrack table.  Even if we can possibly put the VM in different network
>>>> namespace, the current nf_conntrack_max configuration is kind of rigid
>>>> that we cannot limit different VM/container to have different # conntrack
>>>> entries.
>>>>
>>
>> Hi
>> This looks like general problem related to nf zone usage limit, Did
>> you considered changing nf-conntrack to have a per zone limit, so that
>> all users of nf-filter can use it. I prefer this to adding a wrapper
>> in OVS nf-filter layer.
>>
>> Thanks,
>> Pravin.
>>
>
> Hi Prvain,
>
> Thanks for your comment.  Originally, I was thinking to add this
> feature in nf_conntrack and had some discussion with Florian.  It
> turns out that iptables and nft have their own way to keep track of
> the connection limits, and it sounds reasonable to share the backend
> that counts the number of connections, but each module can enforce the
> connection limit in their own way.  Therefore, Florian helped to pull
> out the common backend to nf_conncount in the following commit. The
> nf_conncount then can be used by xtables, nft, and ovs.
>
> commit 625c556118f3c2fd28bb8ef6da18c53bd4037be4
> Author: Florian Westphal <fw@strlen.de>
> Date:   Sat Dec 9 21:01:08 2017 +0100
>
>     netfilter: connlimit: split xt_connlimit into front and backend
>
> This allows to reuse xt_connlimit infrastructure from nf_tables.
> The upcoming nf_tables frontend can just pass in an nftables register
> as input key, this allows limiting by any nft-supported key, including
> concatenations.  For xt_connlimit, pass in the zone and the ip/ipv6 addres.
> ....
>
>
> Basically, to achieve conntrack zone limit in OVS.  We need the
> following 3 parts.
> 1. Count the number of connections (this is provided by netfilter's
> nf_conncount backend)
> 2. Keep track of the connection limits of zones, and check if it
> exceeds the limit.
> 3. An API for userspace to set/delete/get the conntrack zone limit.
>
> This patch series implements item 2 and 3, and it reuses the
> nf_conncount from netfiler for the first part.
>
OK. Thanks for the info.

^ permalink raw reply

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-24  6:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180424022124-mutt-send-email-mst@kernel.org>

2018-04-24 1:22 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This RFC introduces a new address family called AF_XDP that is
>> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This RFC only supports copy-mode for
>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>> changes that Jesper Dangaard Brouer is working on. Some of his work
>> has already been accepted. We will publish our zero-copy support for
>> RX and TX on top of his patch sets at a later point in time.
>>
>> An AF_XDP socket (XSK) is created with the normal socket()
>> syscall. Associated with each XSK are two queues: the RX queue and the
>> TX queue. A socket can receive packets on the RX queue and it can send
>> packets on the TX queue. These queues are registered and sized with
>> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
>> mandatory to have at least one of these queues for each socket. In
>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>> packet buffers. An RX or TX descriptor points to a data buffer in a
>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>> packet does not have to be copied between RX and TX. Moreover, if a
>> packet needs to be kept for a while due to a possible retransmit, the
>> descriptor that points to that packet can be changed to point to
>> another and reused right away. This again avoids copying data.
>>
>> This new dedicated packet buffer area is call a UMEM. It consists of a
>> number of equally size frames and each frame has a unique frame id. A
>> descriptor in one of the queues references a frame by referencing its
>> frame id. The user space allocates memory for this UMEM using whatever
>> means it feels is most appropriate (malloc, mmap, huge pages,
>> etc). This memory area is then registered with the kernel using the new
>> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
>> and the COMPLETION queue. The fill queue is used by the application to
>> send down frame ids for the kernel to fill in with RX packet
>> data. References to these frames will then appear in the RX queue of
>> the XSK once they have been received. The completion queue, on the
>> other hand, contains frame ids that the kernel has transmitted
>> completely and can now be used again by user space, for either TX or
>> RX. Thus, the frame ids appearing in the completion queue are ids that
>> were previously transmitted using the TX queue. In summary, the RX and
>> FILL queues are used for the RX path and the TX and COMPLETION queues
>> are used for the TX path.
>>
>> The socket is then finally bound with a bind() call to a device and a
>> specific queue id on that device, and it is not until bind is
>> completed that traffic starts to flow. Note that in this RFC, all
>> packet data is copied out to user-space.
>>
>> A new feature in this RFC is that the UMEM can be shared between
>> processes, if desired. If a process wants to do this, it simply skips
>> the registration of the UMEM and its corresponding two queues, sets a
>> flag in the bind call and submits the XSK of the process it would like
>> to share UMEM with as well as its own newly created XSK socket. The
>> new process will then receive frame id references in its own RX queue
>> that point to this shared UMEM. Note that since the queue structures
>> are single-consumer / single-producer (for performance reasons), the
>> new process has to create its own socket with associated RX and TX
>> queues, since it cannot share this with the other process. This is
>> also the reason that there is only one set of FILL and COMPLETION
>> queues per UMEM. It is the responsibility of a single process to
>> handle the UMEM. If multiple-producer / multiple-consumer queues are
>> implemented in the future, this requirement could be relaxed.
>>
>> How is then packets distributed between these two XSK? We have
>> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
>> full). The user-space application can place an XSK at an arbitrary
>> place in this map. The XDP program can then redirect a packet to a
>> specific index in this map and at this point XDP validates that the
>> XSK in that map was indeed bound to that device and queue number. If
>> not, the packet is dropped. If the map is empty at that index, the
>> packet is also dropped. This also means that it is currently mandatory
>> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
>> to get any traffic to user space through the XSK.
>>
>> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
>> driver does not have support for XDP, or XDP_SKB is explicitly chosen
>> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
>> together with the generic XDP support and copies out the data to user
>> space. A fallback mode that works for any network device. On the other
>> hand, if the driver has support for XDP, it will be used by the AF_XDP
>> code to provide better performance, but there is still a copy of the
>> data into user space.
>>
>> There is a xdpsock benchmarking/test application included that
>> demonstrates how to use AF_XDP sockets with both private and shared
>> UMEMs. Say that you would like your UDP traffic from port 4242 to end
>> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
>> for this:
>>
>>       ethtool -N p3p2 rx-flow-hash udp4 fn
>>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>           action 16
>>
>> Running the rxdrop benchmark in XDP_DRV mode can then be done
>> using:
>>
>>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>>
>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>> can be displayed with "-h", as usual.
>>
>> We have run some benchmarks on a dual socket system with two Broadwell
>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>> cores which gives a total of 28, but only two cores are used in these
>> experiments. One for TR/RX and one for the user space application. The
>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>> Intel I40E 40Gbit/s using the i40e driver.
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> and 1500 byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.9(3.0)   9.4(9.3)
>> txpush       2.5(2.2)   NA*
>> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
>>
>> AF_XDP performance 1500 byte packets:
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.1(2.2)   3.3(3.1)
>> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
>>
>> * NA since we have no support for TX using the XDP_DRV infrastructure
>>   in this RFC. This is for a future patch set since it involves
>>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>>   Dangaard Brouer.
>>
>> XDP performance on our system as a base line:
>>
>> 64 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      32,921,521  0
>>
>> 1500 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      3,289,491   0
>>
>> Changes from RFC V2:
>>
>> * Optimizations and simplifications to the ring structures inspired by
>>   ptr_ring.h
>> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>>   consistent with AF_PACKET
>> * Support for only having an RX queue or a TX queue defined
>> * Some bug fixes and code cleanup
>>
>> The structure of the patch set is as follows:
>>
>> Patches 1-2: Basic socket and umem plumbing
>> Patches 3-10: RX support together with the new XSKMAP
>> Patches 11-14: TX support
>> Patch 15: Sample application
>>
>> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
>> Clean up btf.h in uapi")
>>
>> Questions:
>>
>> * How to deal with cache alignment for uapi when different
>>   architectures can have different cache line sizes? We have just
>>   aligned it to 64 bytes for now, which works for many popular
>>   architectures, but not all. Please advise.
>>
>> To do:
>>
>> * Optimize performance
>>
>> * Kernel selftest
>>
>> Post-series plan:
>>
>> * Kernel load module support of AF_XDP would be nice. Unclear how to
>>   achieve this though since our XDP code depends on net/core.
>>
>> * Support for AF_XDP sockets without an XPD program loaded. In this
>>   case all the traffic on a queue should go up to the user space socket.
>>
>> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>>   XDP_PASS" for a tcpdump-like functionality.
>>
>> * And of course getting to zero-copy support in small increments.
>>
>> Thanks: Björn and Magnus
>>
>> Björn Töpel (8):
>>   net: initial AF_XDP skeleton
>>   xsk: add user memory registration support sockopt
>>   xsk: add Rx queue setup and mmap support
>>   xdp: introduce xdp_return_buff API
>>   xsk: add Rx receive functions and poll support
>>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>>   xsk: wire up XDP_DRV side of AF_XDP
>>   xsk: wire up XDP_SKB side of AF_XDP
>>
>> Magnus Karlsson (7):
>>   xsk: add umem fill queue support and mmap
>>   xsk: add support for bind for Rx
>>   xsk: add umem completion queue support and mmap
>>   xsk: add Tx queue setup and mmap support
>>   xsk: support for Tx
>>   xsk: statistics support
>>   samples/bpf: sample application for AF_XDP sockets
>>
>>  MAINTAINERS                         |   8 +
>>  include/linux/bpf.h                 |  26 +
>>  include/linux/bpf_types.h           |   3 +
>>  include/linux/filter.h              |   2 +-
>>  include/linux/socket.h              |   5 +-
>>  include/net/xdp.h                   |   1 +
>>  include/net/xdp_sock.h              |  46 ++
>>  include/uapi/linux/bpf.h            |   1 +
>>  include/uapi/linux/if_xdp.h         |  87 ++++
>>  kernel/bpf/Makefile                 |   3 +
>>  kernel/bpf/verifier.c               |   8 +-
>>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>>  net/Kconfig                         |   1 +
>>  net/Makefile                        |   1 +
>>  net/core/dev.c                      |  34 +-
>>  net/core/filter.c                   |  40 +-
>>  net/core/sock.c                     |  12 +-
>>  net/core/xdp.c                      |  15 +-
>>  net/xdp/Kconfig                     |   7 +
>>  net/xdp/Makefile                    |   2 +
>>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>>  net/xdp/xdp_umem.h                  |  65 +++
>>  net/xdp/xdp_umem_props.h            |  23 +
>>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.c                 |  73 +++
>>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>>  samples/bpf/Makefile                |   4 +
>>  samples/bpf/xdpsock.h               |  11 +
>>  samples/bpf/xdpsock_kern.c          |  56 +++
>>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>>  security/selinux/hooks.c            |   4 +-
>>  security/selinux/include/classmap.h |   4 +-
>>  32 files changed, 2945 insertions(+), 35 deletions(-)
>>  create mode 100644 include/net/xdp_sock.h
>>  create mode 100644 include/uapi/linux/if_xdp.h
>>  create mode 100644 kernel/bpf/xskmap.c
>>  create mode 100644 net/xdp/Kconfig
>>  create mode 100644 net/xdp/Makefile
>>  create mode 100644 net/xdp/xdp_umem.c
>>  create mode 100644 net/xdp/xdp_umem.h
>>  create mode 100644 net/xdp/xdp_umem_props.h
>>  create mode 100644 net/xdp/xsk.c
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>  create mode 100644 samples/bpf/xdpsock.h
>>  create mode 100644 samples/bpf/xdpsock_kern.c
>>  create mode 100644 samples/bpf/xdpsock_user.c
>
> Is there a chance of Documentation/networking/af_xdp.txt ?
>

Yes. :-) We'll add that to the next spin!

>
>>
>> --
>> 2.14.1

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-24  7:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423232619-mutt-send-email-mst@kernel.org>

2018-04-23 22:26 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 10:15:18PM +0200, Björn Töpel wrote:
>> 2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>> > On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
>> >> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>> >>
>> >> [...]
>> >>
>> >> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> >> >> +{
>> >> >> +     unsigned int i;
>> >> >> +
>> >> >> +     if (umem->pgs) {
>> >> >> +             for (i = 0; i < umem->npgs; i++)
>> >> >
>> >> > Since you pin them with FOLL_WRITE, I assume these pages
>> >> > are written to.
>> >> > Don't you need set_page_dirty_lock here?
>> >> >
>> >>
>> >> Hmm, I actually *removed* it from the RFC V2, but after doing some
>> >> homework, I think you're right. Thanks for pointing this out!
>> >>
>> >> Thinking more about this; This function is called from sk_destruct,
>> >> and in the Tx case the sk_destruct can be called from interrupt
>> >> context, where set_page_dirty_lock cannot be called.
>> >>
>> >> Are there any preferred ways of solving this? Scheduling the whole
>> >> xsk_destruct call to a workqueue is one way (I think). Any
>> >> cleaner/better way?
>> >>
>> >> [...]
>> >
>> > Defer unpinning pages until the next tx call?
>> >
>>
>> If the sock is released, there wont be another tx call.
>
> unpin them on socket release too?
>

AF_XDP pins all memory up front, and unpins it when the socket is
released (final sock_put), which in this case is in the skb
destructor. So there's no later point from a sock lifetime
perspective.

I'll make a stab at doing umem clean up in a worker queue.

>> Or am I
>> missing something obvious?
>>
>> >
>> >> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> >> >> +{
>> >> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> >> >> +     u64 addr = mr->addr, size = mr->len;
>> >> >> +     unsigned int nframes;
>> >> >> +     int size_chk, err;
>> >> >> +
>> >> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> >> >> +             /* Strictly speaking we could support this, if:
>> >> >> +              * - huge pages, or*
>> >> >
>> >> > what does "or*" here mean?
>> >> >
>> >>
>> >> Oops, I'll change to just 'or' in the next revision.
>> >>
>> >>
>> >> Thanks!
>> >> Björn

^ permalink raw reply

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Jesper Dangaard Brouer @ 2018-04-24  7:27 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Michael S. Tsirkin, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z, brouer
In-Reply-To: <CAJ+HfNjJjVLPY_Si4-f91_o2HOQGCBmPuNN3cyAahpixTcRRXw@mail.gmail.com>

On Tue, 24 Apr 2018 08:55:33 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> > Is there a chance of Documentation/networking/af_xdp.txt ?
> >  
> 
> Yes. :-) We'll add that to the next spin!

Could we please create it using RST format (ReStructuredText) from the
start?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-24  7:30 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-+VKKspFwPCXrX_U9_rVgAXrFkarFXu8mfLsL2=QuLdPg@mail.gmail.com>

2018-04-24 1:04 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> In this commit the base structure of the AF_XDP address family is set
>> up. Further, we introduce the abilty register a window of user memory
>> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
>> window is viewed by an AF_XDP socket as a set of equally large
>> frames. After a user memory registration all frames are "owned" by the
>> user application, and not the kernel.
>>
>> Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>
>> +static void xdp_umem_release(struct xdp_umem *umem)
>> +{
>> +       struct task_struct *task;
>> +       struct mm_struct *mm;
>> +       unsigned long diff;
>> +
>> +       if (umem->pgs) {
>> +               xdp_umem_unpin_pages(umem);
>> +
>> +               task = get_pid_task(umem->pid, PIDTYPE_PID);
>> +               put_pid(umem->pid);
>> +               if (!task)
>> +                       goto out;
>> +               mm = get_task_mm(task);
>> +               put_task_struct(task);
>> +               if (!mm)
>> +                       goto out;
>> +
>> +               diff = umem->size >> PAGE_SHIFT;
>
> Need to round up or size must always be a multiple of PAGE_SIZE.
>

Yes, you're right! I'll add constraints to the umem setup. See further
down in the reply.

>> +
>> +               down_write(&mm->mmap_sem);
>> +               mm->pinned_vm -= diff;
>> +               up_write(&mm->mmap_sem);
>
> When using user->locked_vm for resource limit checks, no need
> to also update mm->pinned_vm?
>

Hmm, dug around in the code, and it looks like you're correct -- i.e.
if user->locked_vm is used, we shouldn't update the mm->pinned_vm.
I'll need to check a bit more, so that I'm certain, but if so, I'll
remove it in the next revision.

>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> +       u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> +       u64 addr = mr->addr, size = mr->len;
>> +       unsigned int nframes;
>> +       int size_chk, err;
>> +
>> +       if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +               /* Strictly speaking we could support this, if:
>> +                * - huge pages, or*
>> +                * - using an IOMMU, or
>> +                * - making sure the memory area is consecutive
>> +                * but for now, we simply say "computer says no".
>> +                */
>> +               return -EINVAL;
>> +       }
>
> Ideally, AF_XDP subsumes all packet socket use cases. It does not
> have packet v3's small packet optimizations of variable sized frames
> and block signaling.
>
> I don't suggest adding that now. But for the non-zerocopy case, it may
> make sense to ensure that nothing is blocking a later addition of these
> features. Especially for header-only (snaplen) workloads. So far, I don't
> see any issues.
>

Ok. Block signaling is sort of ring batching, so I think we're good
for that case. As for variable sized frames *within* a umem, that's
trickier. To support different sizes, multiple umems (and multiple
queues) -- if that makes sense?

>> +       if (!is_power_of_2(frame_size))
>> +               return -EINVAL;
>> +
>> +       if (!PAGE_ALIGNED(addr)) {
>> +               /* Memory area has to be page size aligned. For
>> +                * simplicity, this might change.
>> +                */
>> +               return -EINVAL;
>> +       }
>> +
>> +       if ((addr + size) < addr)
>> +               return -EINVAL;
>> +
>> +       nframes = size / frame_size;
>> +       if (nframes == 0 || nframes > UINT_MAX)
>> +               return -EINVAL;
>
> You may also want a check here that nframes * frame_size is at least
> PAGE_SIZE and probably a multiple of that.
>

Yup! I'll add those checks. This will make the "diff shift" in the
release code safe as well. Thanks!

>> +       frame_headroom = ALIGN(frame_headroom, 64);
>> +
>> +       size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
>> +       if (size_chk < 0)
>> +               return -EINVAL;
>> +
>> +       umem->pid = get_task_pid(current, PIDTYPE_PID);
>> +       umem->size = (size_t)size;
>> +       umem->address = (unsigned long)addr;
>> +       umem->props.frame_size = frame_size;
>> +       umem->props.nframes = nframes;
>> +       umem->frame_headroom = frame_headroom;
>> +       umem->npgs = size / PAGE_SIZE;
>> +       umem->pgs = NULL;
>> +       umem->user = NULL;
>> +
>> +       umem->frame_size_log2 = ilog2(frame_size);
>> +       umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
>> +       umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
>> +       atomic_set(&umem->users, 1);
>> +
>> +       err = xdp_umem_account_pages(umem);
>> +       if (err)
>> +               goto out;
>> +
>> +       err = xdp_umem_pin_pages(umem);
>> +       if (err)
>
> need to call xdp_umem_unaccount_pages on error

Indeed! I'll fix that!

>> +               goto out;
>> +       return 0;
>> +
>> +out:
>> +       put_pid(umem->pid);
>> +       return err;
>> +}

^ permalink raw reply

* [PATCH net-next] ipv6: addrconf: don't evaluate keep_addr_on_down twice
From: Ivan Vecera @ 2018-04-24  7:31 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern

The addrconf_ifdown() evaluates keep_addr_on_down state twice. There
is no need to do it.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Ivan Vecera <cera@cera.cz>
---
 net/ipv6/addrconf.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 78cef00c9596..f40e25fd15ee 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3612,8 +3612,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 	struct net *net = dev_net(dev);
 	struct inet6_dev *idev;
 	struct inet6_ifaddr *ifa, *tmp;
-	int _keep_addr;
-	bool keep_addr;
+	bool keep_addr = false;
 	int state, i;
 
 	ASSERT_RTNL();
@@ -3639,15 +3638,18 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 
 	}
 
-	/* aggregate the system setting and interface setting */
-	_keep_addr = net->ipv6.devconf_all->keep_addr_on_down;
-	if (!_keep_addr)
-		_keep_addr = idev->cnf.keep_addr_on_down;
-
 	/* combine the user config with event to determine if permanent
 	 * addresses are to be removed from address hash table
 	 */
-	keep_addr = !(how || _keep_addr <= 0 || idev->cnf.disable_ipv6);
+	if (!how && !idev->cnf.disable_ipv6) {
+		/* aggregate the system setting and interface setting */
+		int _keep_addr = net->ipv6.devconf_all->keep_addr_on_down;
+
+		if (!_keep_addr)
+			_keep_addr = idev->cnf.keep_addr_on_down;
+
+		keep_addr = (_keep_addr > 0);
+	}
 
 	/* Step 2: clear hash table */
 	for (i = 0; i < IN6_ADDR_HSIZE; i++) {
@@ -3697,11 +3699,6 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 		write_lock_bh(&idev->lock);
 	}
 
-	/* re-combine the user config with event to determine if permanent
-	 * addresses are to be removed from the interface list
-	 */
-	keep_addr = (!how && _keep_addr > 0 && !idev->cnf.disable_ipv6);
-
 	list_for_each_entry_safe(ifa, tmp, &idev->addr_list, if_list) {
 		struct rt6_info *rt = NULL;
 		bool keep;
-- 
2.16.1

^ permalink raw reply related

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-24  7:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Michael S. Tsirkin, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180424092747.2f01330f@redhat.com>

2018-04-24 9:27 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 24 Apr 2018 08:55:33 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> > Is there a chance of Documentation/networking/af_xdp.txt ?
>> >
>>
>> Yes. :-) We'll add that to the next spin!
>
> Could we please create it using RST format (ReStructuredText) from the
> start?
>

Good point! We'll do a Documentation/net/af_xdp.rst in favor of a text file!

^ permalink raw reply

* Re: ipset losing entries on its own
From: Akshat Kakkar @ 2018-04-24  7:58 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev, netdev-owner
In-Reply-To: <CAA5aLPgPU5u6k+rB+5zNCsqp3UBnx3oZoSv_0drNCRK0tcSSBQ@mail.gmail.com>

Has anybody got any clue in this?

^ permalink raw reply

* Summary of the Linux IPsec workshop 2018
From: Steffen Klassert @ 2018-04-24  8:02 UTC (permalink / raw)
  To: netdev; +Cc: lwn

We have created a webpage that summarizes the Linux IPsec workshop 2018
that was held March 26 - 28 in Dresden, Germany:

https://workshop.linux-ipsec.org/2018/

The page was created from the etherpad we used during the workshop,
so don't expect anything fancy. It still does not cover all session
notes, but it gets updated whenever new information comes in.

^ permalink raw reply

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Magnus Karlsson @ 2018-04-24  8:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Michael S. Tsirkin, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-+m5+5sKvo2Z1YOOX+zFKNYLVFqjq6+b4wpP6dTX=cyEA@mail.gmail.com>

On Tue, Apr 24, 2018 at 1:59 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>>
>>> Here, we add another setsockopt for registered user memory (umem)
>>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>>
>>> The queue is used to explicitly pass ownership of umem frames from the
>>> user process to the kernel. These frames will in a later patch be
>>> filled in with Rx packet data by the kernel.
>>>
>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>>> ---
>>>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>>>  net/xdp/Makefile            |  2 +-
>>>  net/xdp/xdp_umem.c          |  5 ++++
>>>  net/xdp/xdp_umem.h          |  2 ++
>>>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>>>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>>  create mode 100644 net/xdp/xsk_queue.c
>>>  create mode 100644 net/xdp/xsk_queue.h
>>>
>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>> index 41252135a0fe..975661e1baca 100644
>>> --- a/include/uapi/linux/if_xdp.h
>>> +++ b/include/uapi/linux/if_xdp.h
>>> @@ -23,6 +23,7 @@
>>>
>>>  /* XDP socket options */
>>>  #define XDP_UMEM_REG                 3
>>> +#define XDP_UMEM_FILL_RING           4
>>>
>>>  struct xdp_umem_reg {
>>>       __u64 addr; /* Start of packet data area */
>>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>>       __u32 frame_headroom; /* Frame head room */
>>>  };
>>>
>>> +/* Pgoff for mmaping the rings */
>>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>>> +
>>> +struct xdp_ring {
>>> +     __u32 producer __attribute__((aligned(64)));
>>> +     __u32 consumer __attribute__((aligned(64)));
>>> +};
>>
>> Why 64? And do you still need these guys in uapi?
>
> I was just about to ask the same. You mean cacheline_aligned?

Yes, I would like to have these cache aligned. How can I accomplish
this in a uapi?
I put a note around this in the cover letter:

* How to deal with cache alignment for uapi when different
  architectures can have different cache line sizes? We have just
  aligned it to 64 bytes for now, which works for many popular
  architectures, but not all. Please advise.

>
>>> +static int xsk_mmap(struct file *file, struct socket *sock,
>>> +                 struct vm_area_struct *vma)
>>> +{
>>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>>> +     unsigned long size = vma->vm_end - vma->vm_start;
>>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>>> +     struct xsk_queue *q;
>>> +     unsigned long pfn;
>>> +     struct page *qpg;
>>> +
>>> +     if (!xs->umem)
>>> +             return -EINVAL;
>>> +
>>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>>> +             q = xs->umem->fq;
>>> +     else
>>> +             return -EINVAL;
>>> +
>>> +     qpg = virt_to_head_page(q->ring);
>
> Is it assured that q is initialized with a call to setsockopt
> XDP_UMEM_FILL_RING before the call the mmap?

Unfortunately not, so this is a bug. Case in point for running
syzkaller below, definitely.

> In general, with such an extensive new API, it might be worthwhile to
> run syzkaller locally on a kernel with these patches. It is pretty
> easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
> though it also needs to be taught about any new APIs.

Good idea. Will set this up and have it torture the API.

Thanks: Magnus

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox