Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 12/16] arm64: prefer __section from compiler_attributes.h
From: Miguel Ojeda @ 2019-08-15  9:08 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Will Deacon, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Catalin Marinas,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Andrey Konovalov, Greg Kroah-Hartman, Enrico Weigelt,
	Suzuki K Poulose, Thomas Gleixner, Masayoshi Mizuma,
	Shaokun Zhang, Alexios Zavras, Allison Randal, Linux ARM,
	linux-kernel, Network Development, bpf
In-Reply-To: <CAKwvOdk4hca8WzWzhcPEvxXnJVLbXGnhBdDZbeL_W_H91Ttjqw@mail.gmail.com>

On Thu, Aug 15, 2019 at 12:20 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> This lone patch of the series is just cosmetic, but patch 14/16 fixes
> a real boot issue:
> https://github.com/ClangBuiltLinux/linux/issues/619
> Miguel, I'd like to get that one landed ASAP; the rest are just for consistency.

Ah, interesting. It would be best to have sent that one independently
to the others, plus adding a commit message mentioning this in
particular. Let's talk about that in the thread.

> Miguel, how do you want to take the rest of these patches? Will picked
> up the arm64 one, I think the SuperH one got picked up.  There was
> feedback to add more info to individual commits' commit messages.

Yes, I told Will I would pick up whatever is not already picked up by
individual maintainers.

> I kept these tree wide changes separate to improve the likelihood that
> they'd backport to stable cleanly, but could always squash if you'd
> prefer to have 1 patch instead of a series.  Just let me know.

Since you already did the splitting work, let's take advantage of it.
I prefer them to be split anyway, since that gives maintainers a
chance to pick them up individually if they prefer to do so.

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH 12/16] arm64: prefer __section from compiler_attributes.h
From: Miguel Ojeda @ 2019-08-15  9:12 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Will Deacon, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Catalin Marinas,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Andrey Konovalov, Greg Kroah-Hartman, Enrico Weigelt,
	Suzuki K Poulose, Thomas Gleixner, Masayoshi Mizuma,
	Shaokun Zhang, Alexios Zavras, Allison Randal, Linux ARM,
	linux-kernel, Network Development, bpf
In-Reply-To: <CANiq72mGoGpx7EAVUPcGuhVkLit8sB3bR-k1XBDyeM8HBUaDZw@mail.gmail.com>

On Thu, Aug 15, 2019 at 11:08 AM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> On Thu, Aug 15, 2019 at 12:20 AM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > This lone patch of the series is just cosmetic, but patch 14/16 fixes
> > a real boot issue:
> > https://github.com/ClangBuiltLinux/linux/issues/619
> > Miguel, I'd like to get that one landed ASAP; the rest are just for consistency.
>
> Ah, interesting. It would be best to have sent that one independently
> to the others, plus adding a commit message mentioning this in
> particular. Let's talk about that in the thread.

Btw, I guess that is the Oops you were mentioning in the cover letter?

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH net] tunnel: fix dev null pointer dereference when send pkg larger than mtu in collect_md mode
From: Eric Dumazet @ 2019-08-15  9:16 UTC (permalink / raw)
  To: Hangbin Liu, netdev
  Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller
In-Reply-To: <20190815060904.19426-1-liuhangbin@gmail.com>



On 8/15/19 8:09 AM, Hangbin Liu wrote:
> When we send a packet larger than PMTU, we need to reply with
> icmp_send(ICMP_FRAG_NEEDED) or icmpv6_send(ICMPV6_PKT_TOOBIG).
> 
> But in collect_md mode, kernel will crash while accessing the dst dev
> as __metadata_dst_init() init dst->dev to NULL by default. Here is what
> the code path looks like, for GRE:
> 
> - ip6gre_tunnel_xmit
>   - ip6gre_xmit_ipv4
>     - __gre6_xmit
>       - ip6_tnl_xmit
>         - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
>     - icmp_send
>       - net = dev_net(rt->dst.dev); <-- here
>   - ip6gre_xmit_ipv6
>     - __gre6_xmit
>       - ip6_tnl_xmit
>         - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
>     - icmpv6_send
>       ...
>       - decode_session4
>         - oif = skb_dst(skb)->dev->ifindex; <-- here
>       - decode_session6
>         - oif = skb_dst(skb)->dev->ifindex; <-- here
> 
> Fix it by updating the dst dev if not set.
> 
> The reproducer is easy:
> 
> ovs-vsctl add-br br0
> ip link set br0 up
> ovs-vsctl add-port br0 gre0 -- \
> 	  set interface gre0 type=gre options:remote_ip=$dst_addr
> ip link set gre0 up
> ip addr add ${local_gre6}/64 dev br0
> ping6 $remote_gre6 -s 1500
> 
> Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
> Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/ipv4/ip_tunnel.c  |  3 +++
>  net/ipv6/ip6_tunnel.c | 13 +++++++++----
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> index 38c02bb62e2c..c6713c7287df 100644
> --- a/net/ipv4/ip_tunnel.c
> +++ b/net/ipv4/ip_tunnel.c
> @@ -597,6 +597,9 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
>  		goto tx_error;
>  	}
>  
> +	if (skb_dst(skb) && !skb_dst(skb)->dev)
> +		skb_dst(skb)->dev = rt->dst.dev;
> +


IMO this looks wrong.
This dst seems shared. 
Once set, we will reuse the same dev ?

If intended, why not doing this in __metadata_dst_init() instead of in the fast path ?

>  	if (key->tun_flags & TUNNEL_DONT_FRAGMENT)
>  		df = htons(IP_DF);
>  	if (tnl_update_pmtu(dev, skb, rt, df, inner_iph, tunnel_hlen,

^ permalink raw reply

* Re: [PATCH] tun: fix use-after-free when register netdev failed
From: Jason Wang @ 2019-08-15  9:21 UTC (permalink / raw)
  To: Yang Yingliang, netdev; +Cc: xiyou.wangcong, davem
In-Reply-To: <1565857122-24660-1-git-send-email-yangyingliang@huawei.com>


On 2019/8/15 下午4:18, Yang Yingliang wrote:
> I got a UAF repport in tun driver when doing fuzzy test:
>
> [  466.269490] ==================================================================
> [  466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0
> [  466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699
> [  466.271810]
> [  466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427
> [  466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [  466.271838] Call Trace:
> [  466.271858]  dump_stack+0xca/0x13e
> [  466.271871]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271890]  print_address_description+0x79/0x440
> [  466.271906]  ? vprintk_func+0x5e/0xf0
> [  466.271920]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271935]  __kasan_report+0x15c/0x1df
> [  466.271958]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271976]  kasan_report+0xe/0x20
> [  466.271987]  tun_chr_read_iter+0x2ca/0x2d0
> [  466.272013]  do_iter_readv_writev+0x4b7/0x740
> [  466.272032]  ? default_llseek+0x2d0/0x2d0
> [  466.272072]  do_iter_read+0x1c5/0x5e0
> [  466.272110]  vfs_readv+0x108/0x180
> [  466.299007]  ? compat_rw_copy_check_uvector+0x440/0x440
> [  466.299020]  ? fsnotify+0x888/0xd50
> [  466.299040]  ? __fsnotify_parent+0xd0/0x350
> [  466.299064]  ? fsnotify_first_mark+0x1e0/0x1e0
> [  466.304548]  ? vfs_write+0x264/0x510
> [  466.304569]  ? ksys_write+0x101/0x210
> [  466.304591]  ? do_preadv+0x116/0x1a0
> [  466.304609]  do_preadv+0x116/0x1a0
> [  466.309829]  do_syscall_64+0xc8/0x600
> [  466.309849]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.309861] RIP: 0033:0x4560f9
> [  466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [  466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127
> [  466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9
> [  466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
> [  466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000
> [  466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10
> [  466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000
> [  466.323057]
> [  466.323064] Allocated by task 2605:
> [  466.335165]  save_stack+0x19/0x80
> [  466.336240]  __kasan_kmalloc.constprop.8+0xa0/0xd0
> [  466.337755]  kmem_cache_alloc+0xe8/0x320
> [  466.339050]  getname_flags+0xca/0x560
> [  466.340229]  user_path_at_empty+0x2c/0x50
> [  466.341508]  vfs_statx+0xe6/0x190
> [  466.342619]  __do_sys_newstat+0x81/0x100
> [  466.343908]  do_syscall_64+0xc8/0x600
> [  466.345303]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.347034]
> [  466.347517] Freed by task 2605:
> [  466.348471]  save_stack+0x19/0x80
> [  466.349476]  __kasan_slab_free+0x12e/0x180
> [  466.350726]  kmem_cache_free+0xc8/0x430
> [  466.351874]  putname+0xe2/0x120
> [  466.352921]  filename_lookup+0x257/0x3e0
> [  466.354319]  vfs_statx+0xe6/0x190
> [  466.355498]  __do_sys_newstat+0x81/0x100
> [  466.356889]  do_syscall_64+0xc8/0x600
> [  466.358037]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.359567]
> [  466.360050] The buggy address belongs to the object at ffff888372139100
> [  466.360050]  which belongs to the cache names_cache of size 4096
> [  466.363735] The buggy address is located 336 bytes inside of
> [  466.363735]  4096-byte region [ffff888372139100, ffff88837213a100)
> [  466.367179] The buggy address belongs to the page:
> [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
> [  466.371582] flags: 0x2fffff80010200(slab|head)
> [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
> [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
> [  466.377778] page dumped because: kasan: bad access detected
> [  466.379730]
> [  466.380288] Memory state around the buggy address:
> [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.388257]                                                  ^
> [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.394667] ==================================================================
>
> tun_chr_read_iter() accessed the memory which freed by free_netdev()
> called by tun_set_iff():
>
> 	CPUA				CPUB
>      tun_set_iff()
>        alloc_netdev_mqs()
>        tun_attach()
> 				    tun_chr_read_iter()
> 				      tun_get()
>        register_netdevice()
>        tun_detach_all()
>          synchronize_net()
> 				      tun_do_read()
> 				        tun_ring_recv()
> 				          schedule()
>        free_netdev()
> 				      tun_put() <-- UAF
>
> Set a new bit in tun->flag if register_netdevice() successed,
> without this bit, tun_get() returns NULL to avoid using a
> freed tun pointer.


Good catch.

Some comments inline.


>
> Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>   drivers/net/tun.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index db16d7a13e00..cbd60c276c40 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -115,6 +115,7 @@ do {								\
>   /* High bits in flags field are unused. */
>   #define TUN_VNET_LE     0x80000000
>   #define TUN_VNET_BE     0x40000000
> +#define TUN_DEV_REGISTERED	0x20000000
>   
>   #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
>   		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> @@ -719,8 +720,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
>   			netif_carrier_off(tun->dev);
>   
>   			if (!(tun->flags & IFF_PERSIST) &&
> -			    tun->dev->reg_state == NETREG_REGISTERED)
> +			    tun->dev->reg_state == NETREG_REGISTERED) {
>   				unregister_netdevice(tun->dev);
> +				tun->flags &= ~TUN_DEV_REGISTERED;
> +			}
>   		}
>   		if (tun)
>   			xdp_rxq_info_unreg(&tfile->xdp_rxq);
> @@ -884,8 +887,10 @@ static struct tun_struct *tun_get(struct tun_file *tfile)
>   
>   	rcu_read_lock();
>   	tun = rcu_dereference(tfile->tun);
> -	if (tun)
> +	if (tun && (tun->flags & TUN_DEV_REGISTERED))
>   		dev_hold(tun->dev);
> +	else
> +		tun = NULL;
>   	rcu_read_unlock();
>   
>   	return tun;
> @@ -2836,6 +2841,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>   		err = register_netdevice(tun->dev);
>   		if (err < 0)
>   			goto err_detach;
> +		tun->flags |= TUN_DEV_REGISTERED;
>   	}
>   
>   	netif_carrier_on(tun->dev);


This looks just a duplicated of netdev->state? However it lacks 
sufficient synchronization like barriers or locks. How about:

- call tun_set_real_num_queues() before register_netdevice() this can 
have the same result as what  eb0fb363f920 did.
- move tun_attach() after register_netdevice() this makes sure we won't 
publish tfile->tun until we are sure at least one refcnt is held by 
register_netdevice()?

Thanks


^ permalink raw reply

* Re: [PATCH] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15  9:22 UTC (permalink / raw)
  To: 冉 jiang, mst@redhat.com
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
	jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB320512CCA27487548DDAA57FA6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>


On 2019/8/15 下午4:36, 冉 jiang wrote:
> On 2019/8/15 11:17, Jason Wang wrote:
>> On 2019/8/15 上午11:11, 冉 jiang wrote:
>>> On 2019/8/15 11:01, Jason Wang wrote:
>>>> On 2019/8/14 上午10:06, ? jiang wrote:
>>>>> This change lowers ring buffer reclaim threshold from 1/2*queue to
>>>>> budget
>>>>> for better performance. According to our test with qemu + dpdk, packet
>>>>> dropping happens when the guest is not able to provide free buffer in
>>>>> avail ring timely with default 1/2*queue. The value in the patch has
>>>>> been
>>>>> tested and does show better performance.
>>>> Please add your tests setup and result here.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
>>>>> ---
>>>>>     drivers/net/virtio_net.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 0d4115c9e20b..bc08be7925eb 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue
>>>>> *rq, int budget,
>>>>>             }
>>>>>         }
>>>>>     -    if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
>>>>> +    if (rq->vq->num_free > min((unsigned int)budget,
>>>>> virtqueue_get_vring_size(rq->vq)) / 2) {
>>>>>             if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>>>>>                 schedule_delayed_work(&vi->refill, 0);
>>>>>         }
>>> Sure, here are the details:
>>
>> Thanks for the details, but I meant it's better if you could summarize
>> you test result in the commit log in a compact way.
>>
>> Btw, some comments, see below:
>>
>>
>>>
>>> Test setup & result:
>>>
>>> ----------------------------------------------------
>>>
>>> Below is the snippet from our test result. Test1 was done with default
>>> driver with the value of 1/2 * queue, while test2 is with my patch. We
>>> can see average
>>> drop packets do decrease a lot in test2.
>>>
>>> test1Time    avgDropPackets    test2Time    avgDropPackets pps
>>>
>>> 16:21.0    12.295    56:50.4    0 300k
>>> 17:19.1    15.244    56:50.4    0    300k
>>> 18:17.5    18.789    56:50.4    0    300k
>>> 19:15.1    14.208    56:50.4    0    300k
>>> 20:13.2    20.818    56:50.4    0.267    300k
>>> 21:11.2    12.397    56:50.4    0    300k
>>> 22:09.3    12.599    56:50.4    0    300k
>>> 23:07.3    15.531    57:48.4    0    300k
>>> 24:05.5    13.664    58:46.5    0    300k
>>> 25:03.7    13.158    59:44.5    4.73    300k
>>> 26:01.1    2.486    00:42.6    0    300k
>>> 26:59.1    11.241    01:40.6    0    300k
>>> 27:57.2    20.521    02:38.6    0    300k
>>> 28:55.2    30.094    03:36.7    0    300k
>>> 29:53.3    16.828    04:34.7    0.963    300k
>>> 30:51.3    46.916    05:32.8    0    400k
>>> 31:49.3    56.214    05:32.8    0    400k
>>> 32:47.3    58.69    05:32.8    0    400k
>>> 33:45.3    61.486    05:32.8    0    400k
>>> 34:43.3    72.175    05:32.8    0.598    400k
>>> 35:41.3    56.699    05:32.8    0    400k
>>> 36:39.3    61.071    05:32.8    0    400k
>>> 37:37.3    43.355    06:30.8    0    400k
>>> 38:35.4    44.644    06:30.8    0    400k
>>> 39:33.4    72.336    06:30.8    0    400k
>>> 40:31.4    70.676    06:30.8    0    400k
>>> 41:29.4    108.009    06:30.8    0    400k
>>> 42:27.4    65.216    06:30.8    0    400k
>>
>> Why there're difference in test time? Could you summarize them like:
>>
>> Test setup: e.g testpmd or pktgen to generate packets to guest
>>
>> avg packets drop before: XXX
>>
>> avg packets drop after: YYY(-ZZZ%)
>>
>> Thanks
>>
>>
>>>
>>> Data to prove why the patch helps:
>>>
>>> ----------------------------------------------------
>>>
>>> We did have completed several rounds of test with setting the value to
>>> budget (64 as the default value). It does improve a lot with pps is
>>> below 400pps for a single stream. We are confident that it runs out
>>> of free
>>> buffer in avail ring when packet dropping happens with below systemtap:
>>>
>>> Just a snippet:
>>>
>>> probe module("virtio_ring").function("virtqueue_get_buf")
>>> {
>>>         x = (@cast($_vq, "vring_virtqueue")->vring->used->idx)-
>>> (@cast($_vq, "vring_virtqueue")->last_used_idx) ---> we use this one
>>> to verify if the queue is full, which means guest is not able to take
>>> buffer from the queue timely
>>>
>>>         if (x<0 && (x+65535)<4096)
>>>             x = x+65535
>>>
>>>         if((x==1024) && @cast($_vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             netrxcount[x] <<< gettimeofday_s()
>>> }
>>>
>>>
>>> probe module("virtio_ring").function("virtqueue_add_inbuf")
>>> {
>>>         y = (@cast($vq, "vring_virtqueue")->vring->avail->idx)-
>>> (@cast($vq, "vring_virtqueue")->vring->used->idx) ---> we use this one
>>> to verify if we run out of free buffer in avail ring
>>>         if (y<0 && (y+65535)<4096)
>>>             y = y+65535
>>>
>>>         if(@2=="debugon")
>>>         {
>>>             if(y==0 && @cast($vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             {
>>>                 netrxfreecount[y] <<< gettimeofday_s()
>>>
>>>                 printf("no avail ring left seen, printing most recent 5
>>> num free, vq: %lx, current index: %d\n", $vq, recentfreecount)
>>>                 for(i=recentfreecount; i!=((recentfreecount+4) % 5);
>>> i=((i+1) % 5))
>>>                 {
>>>                     printf("index: %d, num free: %d\n", i,
>>> recentfree[$vq,
>>> i])
>>>                 }
>>>
>>>                 printf("index: %d, num free: %d\n", i, recentfree[$vq,
>>> i])
>>>                 //exit()
>>>             }
>>>         }
>>> }
>>>
>>>
>>> probe
>>> module("virtio_net").statement("virtnet_receive@drivers/net/virtio_net.c:732")
>>>
>>>
>>> {
>>>         recentfreecount++
>>>         recentfreecount = recentfreecount % 5
>>>         recentfree[$rq->vq, recentfreecount] = $rq->vq->num_free --->
>>> record the num_free for the last 5 calls to virtnet_receive, so we can
>>> see if lowering the bar helps.
>>> }
>>>
>>>
>>> Here is the result:
>>>
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 561
>>> index: 2, num free: 305
>>> index: 3, num free: 369
>>> index: 4, num free: 433
>>> index: 0, num free: 497
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 543
>>> index: 2, num free: 463
>>> index: 3, num free: 469
>>> index: 4, num free: 476
>>> index: 0, num free: 479
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 2
>>> index: 2, num free: 555
>>> index: 3, num free: 414
>>> index: 4, num free: 420
>>> index: 0, num free: 427
>>> index: 1, num free: 491
>>>
>>> We can see in the last 4 calls to virtnet_receive before we run out
>>> of free buffer and start to relaim, num_free is quite high. So if we
>>> can do the reclaim earlier, it will certainly help.
>>>
>>> Jiang
>>
>> Right, but I think there's no need to put those thing in the commit log.
>>
>> Thanks
>>
>>
> Sure, here is the info:
>
>
> Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k)
>
> avg packets drop before: 2842
>
> avg packets drop after: 360(-87.3%)
>
>
> Just let me know if it looks good enough. Thx.
>
> Jiang


Looks good, please post a V2 and include the above result in the commit log.

Thanks

>

^ permalink raw reply

* Re: [PATCH] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15  9:25 UTC (permalink / raw)
  To: 冉 jiang, mst@redhat.com
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
	jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB320512CCA27487548DDAA57FA6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>


On 2019/8/15 下午4:36, 冉 jiang wrote:
> On 2019/8/15 11:17, Jason Wang wrote:
>> On 2019/8/15 上午11:11, 冉 jiang wrote:
>>> On 2019/8/15 11:01, Jason Wang wrote:
>>>> On 2019/8/14 上午10:06, ? jiang wrote:
>>>>> This change lowers ring buffer reclaim threshold from 1/2*queue to
>>>>> budget
>>>>> for better performance. According to our test with qemu + dpdk, packet
>>>>> dropping happens when the guest is not able to provide free buffer in
>>>>> avail ring timely with default 1/2*queue. The value in the patch has
>>>>> been
>>>>> tested and does show better performance.
>>>> Please add your tests setup and result here.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
>>>>> ---
>>>>>     drivers/net/virtio_net.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 0d4115c9e20b..bc08be7925eb 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue
>>>>> *rq, int budget,
>>>>>             }
>>>>>         }
>>>>>     -    if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
>>>>> +    if (rq->vq->num_free > min((unsigned int)budget,
>>>>> virtqueue_get_vring_size(rq->vq)) / 2) {
>>>>>             if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>>>>>                 schedule_delayed_work(&vi->refill, 0);
>>>>>         }
>>> Sure, here are the details:
>>
>> Thanks for the details, but I meant it's better if you could summarize
>> you test result in the commit log in a compact way.
>>
>> Btw, some comments, see below:
>>
>>
>>>
>>> Test setup & result:
>>>
>>> ----------------------------------------------------
>>>
>>> Below is the snippet from our test result. Test1 was done with default
>>> driver with the value of 1/2 * queue, while test2 is with my patch. We
>>> can see average
>>> drop packets do decrease a lot in test2.
>>>
>>> test1Time    avgDropPackets    test2Time    avgDropPackets pps
>>>
>>> 16:21.0    12.295    56:50.4    0 300k
>>> 17:19.1    15.244    56:50.4    0    300k
>>> 18:17.5    18.789    56:50.4    0    300k
>>> 19:15.1    14.208    56:50.4    0    300k
>>> 20:13.2    20.818    56:50.4    0.267    300k
>>> 21:11.2    12.397    56:50.4    0    300k
>>> 22:09.3    12.599    56:50.4    0    300k
>>> 23:07.3    15.531    57:48.4    0    300k
>>> 24:05.5    13.664    58:46.5    0    300k
>>> 25:03.7    13.158    59:44.5    4.73    300k
>>> 26:01.1    2.486    00:42.6    0    300k
>>> 26:59.1    11.241    01:40.6    0    300k
>>> 27:57.2    20.521    02:38.6    0    300k
>>> 28:55.2    30.094    03:36.7    0    300k
>>> 29:53.3    16.828    04:34.7    0.963    300k
>>> 30:51.3    46.916    05:32.8    0    400k
>>> 31:49.3    56.214    05:32.8    0    400k
>>> 32:47.3    58.69    05:32.8    0    400k
>>> 33:45.3    61.486    05:32.8    0    400k
>>> 34:43.3    72.175    05:32.8    0.598    400k
>>> 35:41.3    56.699    05:32.8    0    400k
>>> 36:39.3    61.071    05:32.8    0    400k
>>> 37:37.3    43.355    06:30.8    0    400k
>>> 38:35.4    44.644    06:30.8    0    400k
>>> 39:33.4    72.336    06:30.8    0    400k
>>> 40:31.4    70.676    06:30.8    0    400k
>>> 41:29.4    108.009    06:30.8    0    400k
>>> 42:27.4    65.216    06:30.8    0    400k
>>
>> Why there're difference in test time? Could you summarize them like:
>>
>> Test setup: e.g testpmd or pktgen to generate packets to guest
>>
>> avg packets drop before: XXX
>>
>> avg packets drop after: YYY(-ZZZ%)
>>
>> Thanks
>>
>>
>>>
>>> Data to prove why the patch helps:
>>>
>>> ----------------------------------------------------
>>>
>>> We did have completed several rounds of test with setting the value to
>>> budget (64 as the default value). It does improve a lot with pps is
>>> below 400pps for a single stream. We are confident that it runs out
>>> of free
>>> buffer in avail ring when packet dropping happens with below systemtap:
>>>
>>> Just a snippet:
>>>
>>> probe module("virtio_ring").function("virtqueue_get_buf")
>>> {
>>>         x = (@cast($_vq, "vring_virtqueue")->vring->used->idx)-
>>> (@cast($_vq, "vring_virtqueue")->last_used_idx) ---> we use this one
>>> to verify if the queue is full, which means guest is not able to take
>>> buffer from the queue timely
>>>
>>>         if (x<0 && (x+65535)<4096)
>>>             x = x+65535
>>>
>>>         if((x==1024) && @cast($_vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             netrxcount[x] <<< gettimeofday_s()
>>> }
>>>
>>>
>>> probe module("virtio_ring").function("virtqueue_add_inbuf")
>>> {
>>>         y = (@cast($vq, "vring_virtqueue")->vring->avail->idx)-
>>> (@cast($vq, "vring_virtqueue")->vring->used->idx) ---> we use this one
>>> to verify if we run out of free buffer in avail ring
>>>         if (y<0 && (y+65535)<4096)
>>>             y = y+65535
>>>
>>>         if(@2=="debugon")
>>>         {
>>>             if(y==0 && @cast($vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             {
>>>                 netrxfreecount[y] <<< gettimeofday_s()
>>>
>>>                 printf("no avail ring left seen, printing most recent 5
>>> num free, vq: %lx, current index: %d\n", $vq, recentfreecount)
>>>                 for(i=recentfreecount; i!=((recentfreecount+4) % 5);
>>> i=((i+1) % 5))
>>>                 {
>>>                     printf("index: %d, num free: %d\n", i,
>>> recentfree[$vq,
>>> i])
>>>                 }
>>>
>>>                 printf("index: %d, num free: %d\n", i, recentfree[$vq,
>>> i])
>>>                 //exit()
>>>             }
>>>         }
>>> }
>>>
>>>
>>> probe
>>> module("virtio_net").statement("virtnet_receive@drivers/net/virtio_net.c:732")
>>>
>>>
>>> {
>>>         recentfreecount++
>>>         recentfreecount = recentfreecount % 5
>>>         recentfree[$rq->vq, recentfreecount] = $rq->vq->num_free --->
>>> record the num_free for the last 5 calls to virtnet_receive, so we can
>>> see if lowering the bar helps.
>>> }
>>>
>>>
>>> Here is the result:
>>>
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 561
>>> index: 2, num free: 305
>>> index: 3, num free: 369
>>> index: 4, num free: 433
>>> index: 0, num free: 497
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 543
>>> index: 2, num free: 463
>>> index: 3, num free: 469
>>> index: 4, num free: 476
>>> index: 0, num free: 479
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 2
>>> index: 2, num free: 555
>>> index: 3, num free: 414
>>> index: 4, num free: 420
>>> index: 0, num free: 427
>>> index: 1, num free: 491
>>>
>>> We can see in the last 4 calls to virtnet_receive before we run out
>>> of free buffer and start to relaim, num_free is quite high. So if we
>>> can do the reclaim earlier, it will certainly help.
>>>
>>> Jiang
>>
>> Right, but I think there's no need to put those thing in the commit log.
>>
>> Thanks
>>
>>
> Sure, here is the info:
>
>
> Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k)


Please also note that type of packets e.g TCP or UDP.

Thanks


>
> avg packets drop before: 2842
>
> avg packets drop after: 360(-87.3%)
>
>
> Just let me know if it looks good enough. Thx.
>
> Jiang
>

^ permalink raw reply

* [PATCH bpf-next v5 0/2] net: xdp: XSKMAP improvements
From: Björn Töpel @ 2019-08-15  9:30 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, jonathan.lemon,
	bjorn.topel, bruce.richardson, songliubraving, bpf

This series (v5 and counting) add two improvements for the XSKMAP,
used by AF_XDP sockets.

1. Automatic cleanup when an AF_XDP socket goes out of scope/is
   released. Instead of require that the user manually clears the
   "released" state socket from the map, this is done
   automatically. Each socket tracks which maps it resides in, and
   remove itself from those maps at relase. A notable implementation
   change, is that the sockets references the map, instead of the map
   referencing the sockets. Which implies that when the XSKMAP is
   freed, it is by definition cleared of sockets.

2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
   which this patch addresses.


Thanks,
Björn

v1->v2: Fixed deadlock and broken cleanup. (Daniel)
v2->v3: Rebased onto bpf-next
v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
        Socket release/map update race. (Daniel)
v4->v5: Avoid use-after-free on XSKMAP self-assignment [1]. (Daniel)
        Removed redundant assignment in xsk_map_update_elem().
        Variable name consistency; Use map_entry everywhere.

[1] https://lore.kernel.org/bpf/20190802081154.30962-1-bjorn.topel@gmail.com/T/#mc68439e97bc07fa301dad9fc4850ed5aa392f385

Björn Töpel (2):
  xsk: remove AF_XDP socket from map when the socket is released
  xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP

 include/net/xdp_sock.h |  18 ++++++
 kernel/bpf/xskmap.c    | 133 ++++++++++++++++++++++++++++++++++-------
 net/xdp/xsk.c          |  50 ++++++++++++++++
 3 files changed, 179 insertions(+), 22 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH bpf-next v5 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Björn Töpel @ 2019-08-15  9:30 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, jonathan.lemon,
	bruce.richardson, songliubraving, bpf
In-Reply-To: <20190815093014.31174-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  18 ++++++
 kernel/bpf/xskmap.c    | 125 ++++++++++++++++++++++++++++++++++-------
 net/xdp/xsk.c          |  50 +++++++++++++++++
 3 files changed, 173 insertions(+), 20 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..066e3ae446a8 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -50,6 +50,16 @@ struct xdp_umem {
 	struct list_head xsk_list;
 };
 
+/* Nodes are linked in the struct xdp_sock map_list field, and used to
+ * track which maps a certain socket reside in.
+ */
+struct xsk_map;
+struct xsk_map_node {
+	struct list_head node;
+	struct xsk_map *map;
+	struct xdp_sock **map_entry;
+};
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -75,6 +85,9 @@ struct xdp_sock {
 	/* Protects generic receive. */
 	spinlock_t rx_lock;
 	u64 rx_dropped;
+	struct list_head map_list;
+	/* Protects map_list */
+	spinlock_t map_list_lock;
 };
 
 struct xdp_buff;
@@ -96,6 +109,11 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
 void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
 
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+			     struct xdp_sock **map_entry);
+int xsk_map_inc(struct xsk_map *map);
+void xsk_map_put(struct xsk_map *map);
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
 	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9bb96ace9fa1..16031d489173 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -13,8 +13,71 @@ struct xsk_map {
 	struct bpf_map map;
 	struct xdp_sock **xsk_map;
 	struct list_head __percpu *flush_list;
+	spinlock_t lock; /* Synchronize map updates */
 };
 
+int xsk_map_inc(struct xsk_map *map)
+{
+	struct bpf_map *m = &map->map;
+
+	m = bpf_map_inc(m, false);
+	return IS_ERR(m) ? PTR_ERR(m) : 0;
+}
+
+void xsk_map_put(struct xsk_map *map)
+{
+	bpf_map_put(&map->map);
+}
+
+static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
+					       struct xdp_sock **map_entry)
+{
+	struct xsk_map_node *node;
+	int err;
+
+	node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
+	if (!node)
+		return NULL;
+
+	err = xsk_map_inc(map);
+	if (err) {
+		kfree(node);
+		return ERR_PTR(err);
+	}
+
+	node->map = map;
+	node->map_entry = map_entry;
+	return node;
+}
+
+static void xsk_map_node_free(struct xsk_map_node *node)
+{
+	xsk_map_put(node->map);
+	kfree(node);
+}
+
+static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
+{
+	spin_lock_bh(&xs->map_list_lock);
+	list_add_tail(&node->node, &xs->map_list);
+	spin_unlock_bh(&xs->map_list_lock);
+}
+
+static void xsk_map_sock_delete(struct xdp_sock *xs,
+				struct xdp_sock **map_entry)
+{
+	struct xsk_map_node *n, *tmp;
+
+	spin_lock_bh(&xs->map_list_lock);
+	list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
+		if (map_entry == n->map_entry) {
+			list_del(&n->node);
+			xsk_map_node_free(n);
+		}
+	}
+	spin_unlock_bh(&xs->map_list_lock);
+}
+
 static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 {
 	struct xsk_map *m;
@@ -34,6 +97,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	bpf_map_init_from_attr(&m->map, attr);
+	spin_lock_init(&m->lock);
 
 	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
 	cost += sizeof(struct list_head) * num_possible_cpus();
@@ -71,21 +135,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 static void xsk_map_free(struct bpf_map *map)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
-	int i;
 
 	bpf_clear_redirect_map(map);
 	synchronize_net();
-
-	for (i = 0; i < map->max_entries; i++) {
-		struct xdp_sock *xs;
-
-		xs = m->xsk_map[i];
-		if (!xs)
-			continue;
-
-		sock_put((struct sock *)xs);
-	}
-
 	free_percpu(m->flush_list);
 	bpf_map_area_free(m->xsk_map);
 	kfree(m);
@@ -164,8 +216,9 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 			       u64 map_flags)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xdp_sock *xs, *old_xs, **map_entry;
 	u32 i = *(u32 *)key, fd = *(u32 *)value;
-	struct xdp_sock *xs, *old_xs;
+	struct xsk_map_node *node;
 	struct socket *sock;
 	int err;
 
@@ -192,32 +245,64 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EOPNOTSUPP;
 	}
 
-	sock_hold(sock->sk);
+	map_entry = &m->xsk_map[i];
+	node = xsk_map_node_alloc(m, map_entry);
+	if (IS_ERR(node)) {
+		sockfd_put(sock);
+		return PTR_ERR(node);
+	}
 
-	old_xs = xchg(&m->xsk_map[i], xs);
+	spin_lock_bh(&m->lock);
+	old_xs = READ_ONCE(*map_entry);
+	if (old_xs == xs) {
+		err = 0;
+		goto out;
+	}
+	xsk_map_sock_add(xs, node);
+	WRITE_ONCE(*map_entry, xs);
 	if (old_xs)
-		sock_put((struct sock *)old_xs);
-
+		xsk_map_sock_delete(old_xs, map_entry);
+	spin_unlock_bh(&m->lock);
 	sockfd_put(sock);
 	return 0;
+
+out:
+	spin_unlock_bh(&m->lock);
+	sockfd_put(sock);
+	xsk_map_node_free(node);
+	return err;
 }
 
 static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
-	struct xdp_sock *old_xs;
+	struct xdp_sock *old_xs, **map_entry;
 	int k = *(u32 *)key;
 
 	if (k >= map->max_entries)
 		return -EINVAL;
 
-	old_xs = xchg(&m->xsk_map[k], NULL);
+	spin_lock_bh(&m->lock);
+	map_entry = &m->xsk_map[k];
+	old_xs = xchg(map_entry, NULL);
 	if (old_xs)
-		sock_put((struct sock *)old_xs);
+		xsk_map_sock_delete(old_xs, map_entry);
+	spin_unlock_bh(&m->lock);
 
 	return 0;
 }
 
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+			     struct xdp_sock **map_entry)
+{
+	spin_lock_bh(&map->lock);
+	if (READ_ONCE(*map_entry) == xs) {
+		WRITE_ONCE(*map_entry, NULL);
+		xsk_map_sock_delete(xs, map_entry);
+	}
+	spin_unlock_bh(&map->lock);
+}
+
 const struct bpf_map_ops xsk_map_ops = {
 	.map_alloc = xsk_map_alloc,
 	.map_free = xsk_map_free,
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59b57d708697..c3d027aa693d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -362,6 +362,52 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 	dev_put(dev);
 }
 
+static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
+					      struct xdp_sock ***map_entry)
+{
+	struct xsk_map *map = NULL;
+	struct xsk_map_node *node;
+
+	*map_entry = NULL;
+
+	spin_lock_bh(&xs->map_list_lock);
+	node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
+					node);
+	if (node) {
+		WARN_ON(xsk_map_inc(node->map));
+		map = node->map;
+		*map_entry = node->map_entry;
+	}
+	spin_unlock_bh(&xs->map_list_lock);
+	return map;
+}
+
+static void xsk_delete_from_maps(struct xdp_sock *xs)
+{
+	/* This function removes the current XDP socket from all the
+	 * maps it resides in. We need to take extra care here, due to
+	 * the two locks involved. Each map has a lock synchronizing
+	 * updates to the entries, and each socket has a lock that
+	 * synchronizes access to the list of maps (map_list). For
+	 * deadlock avoidance the locks need to be taken in the order
+	 * "map lock"->"socket map list lock". We start off by
+	 * accessing the socket map list, and take a reference to the
+	 * map to guarantee existence between the
+	 * xsk_get_map_list_entry() and xsk_map_try_sock_delete()
+	 * calls. Then we ask the map to remove the socket, which
+	 * tries to remove the socket from the map. Note that there
+	 * might be updates to the map between
+	 * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
+	 */
+	struct xdp_sock **map_entry = NULL;
+	struct xsk_map *map;
+
+	while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
+		xsk_map_try_sock_delete(map, xs, map_entry);
+		xsk_map_put(map);
+	}
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -381,6 +427,7 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	xsk_delete_from_maps(xs);
 	xsk_unbind_dev(xs);
 
 	xskq_destroy(xs->rx);
@@ -855,6 +902,9 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 	spin_lock_init(&xs->rx_lock);
 	spin_lock_init(&xs->tx_completion_lock);
 
+	INIT_LIST_HEAD(&xs->map_list);
+	spin_lock_init(&xs->map_list_lock);
+
 	mutex_lock(&net->xdp.lock);
 	sk_add_node_rcu(sk, &net->xdp.list);
 	mutex_unlock(&net->xdp.lock);
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next v5 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
From: Björn Töpel @ 2019-08-15  9:30 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, jonathan.lemon,
	bruce.richardson, songliubraving, bpf
In-Reply-To: <20190815093014.31174-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 kernel/bpf/xskmap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 16031d489173..4cc28e226398 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -226,8 +226,6 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EINVAL;
 	if (unlikely(i >= m->map.max_entries))
 		return -E2BIG;
-	if (unlikely(map_flags == BPF_NOEXIST))
-		return -EEXIST;
 
 	sock = sockfd_lookup(fd, &err);
 	if (!sock)
@@ -257,6 +255,12 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (old_xs == xs) {
 		err = 0;
 		goto out;
+	} else if (old_xs && map_flags == BPF_NOEXIST) {
+		err = -EEXIST;
+		goto out;
+	} else if (!old_xs && map_flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto out;
 	}
 	xsk_map_sock_add(xs, node);
 	WRITE_ONCE(*map_entry, xs);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] tun: fix use-after-free when register netdev failed
From: Eric Dumazet @ 2019-08-15  9:35 UTC (permalink / raw)
  To: Yang Yingliang, netdev; +Cc: jasowang, xiyou.wangcong, davem
In-Reply-To: <1565857122-24660-1-git-send-email-yangyingliang@huawei.com>



On 8/15/19 10:18 AM, Yang Yingliang wrote:
> I got a UAF repport in tun driver when doing fuzzy test:
> 
>
> [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
> [  466.371582] flags: 0x2fffff80010200(slab|head)
> [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
> [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
> [  466.377778] page dumped because: kasan: bad access detected
> [  466.379730]
> [  466.380288] Memory state around the buggy address:
> [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.388257]                                                  ^
> [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.394667] ==================================================================
> 
> tun_chr_read_iter() accessed the memory which freed by free_netdev()
> called by tun_set_iff():
> 
> 	CPUA				CPUB
>     tun_set_iff()
>       alloc_netdev_mqs()
>       tun_attach()
> 				    tun_chr_read_iter()
> 				      tun_get()
>       register_netdevice()
>       tun_detach_all()
>         synchronize_net()
> 				      tun_do_read()
> 				        tun_ring_recv()
> 				          schedule()
>       free_netdev()
> 				      tun_put() <-- UAF

UAF on what exactly ? The dev_hold() should prevent the free_netdev().

> 
> Set a new bit in tun->flag if register_netdevice() successed,
> without this bit, tun_get() returns NULL to avoid using a
> freed tun pointer.
> 
> Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>  drivers/net/tun.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index db16d7a13e00..cbd60c276c40 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -115,6 +115,7 @@ do {								\
>  /* High bits in flags field are unused. */
>  #define TUN_VNET_LE     0x80000000
>  #define TUN_VNET_BE     0x40000000
> +#define TUN_DEV_REGISTERED	0x20000000
>  
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
>  		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> @@ -719,8 +720,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
>  			netif_carrier_off(tun->dev);
>  
>  			if (!(tun->flags & IFF_PERSIST) &&
> -			    tun->dev->reg_state == NETREG_REGISTERED)
> +			    tun->dev->reg_state == NETREG_REGISTERED) {
>  				unregister_netdevice(tun->dev);
> +				tun->flags &= ~TUN_DEV_REGISTERED;

Isn't this done too late ?

> +			}
>  		}
>  		if (tun)
>  			xdp_rxq_info_unreg(&tfile->xdp_rxq);
> @@ -884,8 +887,10 @@ static struct tun_struct *tun_get(struct tun_file *tfile)
>  
>  	rcu_read_lock();
>  	tun = rcu_dereference(tfile->tun);
> -	if (tun)
> +	if (tun && (tun->flags & TUN_DEV_REGISTERED))
>  		dev_hold(tun->dev);
> +	else
> +		tun = NULL;
>  	rcu_read_unlock();
>  
>  	return tun;
> @@ -2836,6 +2841,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>  		err = register_netdevice(tun->dev);
>  		if (err < 0)
>  			goto err_detach;
> +		tun->flags |= TUN_DEV_REGISTERED;
>  	}
>  
>  	netif_carrier_on(tun->dev);
> 


So tun_get() will return NULL as long as  tun_set_iff() (TUNSETIFF ioctl()) has not yet been called ?

This could break some applications, since tun_get() is used from poll() and other syscalls.


^ permalink raw reply

* [PATCH net-next] net/rds: Add RDS6_INFO_SOCKETS and RDS6_INFO_RECV_MESSAGES options
From: Ka-Cheong Poon @ 2019-08-15  9:36 UTC (permalink / raw)
  To: netdev; +Cc: santosh.shilimkar, davem, rds-devel

Add support of the socket options RDS6_INFO_SOCKETS and
RDS6_INFO_RECV_MESSAGES which update the RDS_INFO_SOCKETS and
RDS_INFO_RECV_MESSAGES options respectively.  The old options work
for IPv4 sockets only.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
---
 net/rds/af_rds.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 3 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 2b969f9..e7b082a 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2006, 2019 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -741,6 +741,10 @@ static void rds_sock_inc_info(struct socket *sock, unsigned int len,
 	spin_lock_bh(&rds_sock_lock);
 
 	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		/* This option only supports IPv4 sockets. */
+		if (!ipv6_addr_v4mapped(&rs->rs_bound_addr))
+			continue;
+
 		read_lock(&rs->rs_recv_lock);
 
 		/* XXX too lazy to maintain counts.. */
@@ -762,21 +766,60 @@ static void rds_sock_inc_info(struct socket *sock, unsigned int len,
 	lens->each = sizeof(struct rds_info_message);
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static void rds6_sock_inc_info(struct socket *sock, unsigned int len,
+			       struct rds_info_iterator *iter,
+			       struct rds_info_lengths *lens)
+{
+	struct rds_incoming *inc;
+	unsigned int total = 0;
+	struct rds_sock *rs;
+
+	len /= sizeof(struct rds6_info_message);
+
+	spin_lock_bh(&rds_sock_lock);
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		read_lock(&rs->rs_recv_lock);
+
+		list_for_each_entry(inc, &rs->rs_recv_queue, i_item) {
+			total++;
+			if (total <= len)
+				rds6_inc_info_copy(inc, iter, &inc->i_saddr,
+						   &rs->rs_bound_addr, 1);
+		}
+
+		read_unlock(&rs->rs_recv_lock);
+	}
+
+	spin_unlock_bh(&rds_sock_lock);
+
+	lens->nr = total;
+	lens->each = sizeof(struct rds6_info_message);
+}
+#endif
+
 static void rds_sock_info(struct socket *sock, unsigned int len,
 			  struct rds_info_iterator *iter,
 			  struct rds_info_lengths *lens)
 {
 	struct rds_info_socket sinfo;
+	unsigned int cnt = 0;
 	struct rds_sock *rs;
 
 	len /= sizeof(struct rds_info_socket);
 
 	spin_lock_bh(&rds_sock_lock);
 
-	if (len < rds_sock_count)
+	if (len < rds_sock_count) {
+		cnt = rds_sock_count;
 		goto out;
+	}
 
 	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		/* This option only supports IPv4 sockets. */
+		if (!ipv6_addr_v4mapped(&rs->rs_bound_addr))
+			continue;
 		sinfo.sndbuf = rds_sk_sndbuf(rs);
 		sinfo.rcvbuf = rds_sk_rcvbuf(rs);
 		sinfo.bound_addr = rs->rs_bound_addr_v4;
@@ -786,15 +829,51 @@ static void rds_sock_info(struct socket *sock, unsigned int len,
 		sinfo.inum = sock_i_ino(rds_rs_to_sk(rs));
 
 		rds_info_copy(iter, &sinfo, sizeof(sinfo));
+		cnt++;
 	}
 
 out:
-	lens->nr = rds_sock_count;
+	lens->nr = cnt;
 	lens->each = sizeof(struct rds_info_socket);
 
 	spin_unlock_bh(&rds_sock_lock);
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static void rds6_sock_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	struct rds6_info_socket sinfo6;
+	struct rds_sock *rs;
+
+	len /= sizeof(struct rds6_info_socket);
+
+	spin_lock_bh(&rds_sock_lock);
+
+	if (len < rds_sock_count)
+		goto out;
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		sinfo6.sndbuf = rds_sk_sndbuf(rs);
+		sinfo6.rcvbuf = rds_sk_rcvbuf(rs);
+		sinfo6.bound_addr = rs->rs_bound_addr;
+		sinfo6.connected_addr = rs->rs_conn_addr;
+		sinfo6.bound_port = rs->rs_bound_port;
+		sinfo6.connected_port = rs->rs_conn_port;
+		sinfo6.inum = sock_i_ino(rds_rs_to_sk(rs));
+
+		rds_info_copy(iter, &sinfo6, sizeof(sinfo6));
+	}
+
+ out:
+	lens->nr = rds_sock_count;
+	lens->each = sizeof(struct rds6_info_socket);
+
+	spin_unlock_bh(&rds_sock_lock);
+}
+#endif
+
 static void rds_exit(void)
 {
 	sock_unregister(rds_family_ops.family);
@@ -808,6 +887,10 @@ static void rds_exit(void)
 	rds_bind_lock_destroy();
 	rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
 	rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+#if IS_ENABLED(CONFIG_IPV6)
+	rds_info_deregister_func(RDS6_INFO_SOCKETS, rds6_sock_info);
+	rds_info_deregister_func(RDS6_INFO_RECV_MESSAGES, rds6_sock_inc_info);
+#endif
 }
 module_exit(rds_exit);
 
@@ -845,6 +928,10 @@ static int rds_init(void)
 
 	rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info);
 	rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+#if IS_ENABLED(CONFIG_IPV6)
+	rds_info_register_func(RDS6_INFO_SOCKETS, rds6_sock_info);
+	rds_info_register_func(RDS6_INFO_RECV_MESSAGES, rds6_sock_inc_info);
+#endif
 
 	goto out;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH 08/16] mips: prefer __section from compiler_attributes.h
From: Paul Burton @ 2019-08-15  9:38 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm@linux-foundation.org, sedat.dilek@gmail.com,
	jpoimboe@redhat.com, yhs@fb.com, miguel.ojeda.sandonis@gmail.com,
	clang-built-linux@googlegroups.com, Ralf Baechle, James Hogan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	linux-mips@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, bpf@vger.kernel.org
In-Reply-To: <20190812215052.71840-8-ndesaulniers@google.com>

Hi Nick,

On Mon, Aug 12, 2019 at 02:50:41PM -0700, Nick Desaulniers wrote:
> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

It would be good to add a commit message, even if it's just a line
repeating the subject & preferably describing the motivation.

> ---
>  arch/mips/include/asm/cache.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/mips/include/asm/cache.h b/arch/mips/include/asm/cache.h
> index 8b14c2706aa5..af2d943580ee 100644
> --- a/arch/mips/include/asm/cache.h
> +++ b/arch/mips/include/asm/cache.h
> @@ -14,6 +14,6 @@
>  #define L1_CACHE_SHIFT		CONFIG_MIPS_L1_CACHE_SHIFT
>  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
>  
> -#define __read_mostly __attribute__((__section__(".data..read_mostly")))
> +#define __read_mostly __section(.data..read_mostly)
>  
>  #endif /* _ASM_CACHE_H */
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog

I'm not copied on the rest of the series so I'm not sure what your
expectations are about where this should be applied. Let me know if
you'd prefer this to go through mips-next, otherwise:

    Acked-by: Paul Burton <paul.burton@mips.com>

Thanks,
    Paul

^ permalink raw reply

* [PATCH net-next 0/2] net: phy: realtek: map vendor-specific EEE registers to standard MMD registers
From: Heiner Kallweit @ 2019-08-15  9:45 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org

EEE-related registers on newer integrated PHY's have the standard
layout, but are accessible not via MMD but via vendor-specific
registers. Emulating the standard MMD registers allows to use the
generic functions for EEE control and to significantly simplify
the r8169 driver.

Heiner Kallweit (2):
  net: phy: realtek: add support for EEE registers on integrated PHY's
  r8169: use the generic EEE management functions

 drivers/net/ethernet/realtek/r8169_main.c | 172 +++-------------------
 drivers/net/phy/realtek.c                 |  43 ++++++
 2 files changed, 67 insertions(+), 148 deletions(-)

-- 
2.22.1


^ permalink raw reply

* [PATCH net-next 1/2] net: phy: realtek: add support for EEE registers on integrated PHY's
From: Heiner Kallweit @ 2019-08-15  9:46 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <4a6878bf-344e-2df5-df00-b80c7c0982d1@gmail.com>

EEE-related registers on newer integrated PHY's have the standard
layout, but are accessible not via MMD but via vendor-specific
registers. Emulating the standard MMD registers allows to use the
generic functions for EEE control.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/realtek.c | 43 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index c49a1fb13..2635ad1ff 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -266,6 +266,45 @@ static int rtl8366rb_config_init(struct phy_device *phydev)
 	return ret;
 }
 
+static int rtlgen_read_mmd(struct phy_device *phydev, int devnum, u16 regnum)
+{
+	int ret;
+
+	if (devnum == MDIO_MMD_PCS && regnum == MDIO_PCS_EEE_ABLE) {
+		rtl821x_write_page(phydev, 0xa5c);
+		ret = __phy_read(phydev, 0x12);
+		rtl821x_write_page(phydev, 0);
+	} else if (devnum == MDIO_MMD_AN && regnum == MDIO_AN_EEE_ADV) {
+		rtl821x_write_page(phydev, 0xa5d);
+		ret = __phy_read(phydev, 0x10);
+		rtl821x_write_page(phydev, 0);
+	} else if (devnum == MDIO_MMD_AN && regnum == MDIO_AN_EEE_LPABLE) {
+		rtl821x_write_page(phydev, 0xa5d);
+		ret = __phy_read(phydev, 0x11);
+		rtl821x_write_page(phydev, 0);
+	} else {
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static int rtlgen_write_mmd(struct phy_device *phydev, int devnum, u16 regnum,
+			    u16 val)
+{
+	int ret;
+
+	if (devnum == MDIO_MMD_AN && regnum == MDIO_AN_EEE_ADV) {
+		rtl821x_write_page(phydev, 0xa5d);
+		ret = __phy_write(phydev, 0x10, val);
+		rtl821x_write_page(phydev, 0);
+	} else {
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
 static int rtl8125_get_features(struct phy_device *phydev)
 {
 	int val;
@@ -422,6 +461,8 @@ static struct phy_driver realtek_drvs[] = {
 		.resume		= genphy_resume,
 		.read_page	= rtl821x_read_page,
 		.write_page	= rtl821x_write_page,
+		.read_mmd	= rtlgen_read_mmd,
+		.write_mmd	= rtlgen_write_mmd,
 	}, {
 		.name		= "RTL8125 2.5Gbps internal",
 		.match_phy_device = rtl8125_match_phy_device,
@@ -432,6 +473,8 @@ static struct phy_driver realtek_drvs[] = {
 		.resume		= genphy_resume,
 		.read_page	= rtl821x_read_page,
 		.write_page	= rtl821x_write_page,
+		.read_mmd	= rtlgen_read_mmd,
+		.write_mmd	= rtlgen_write_mmd,
 	}, {
 		PHY_ID_MATCH_EXACT(0x001cc961),
 		.name		= "RTL8366RB Gigabit Ethernet",
-- 
2.22.1



^ permalink raw reply related

* [PATCH net-next 2/2] r8169: use the generic EEE management functions
From: Heiner Kallweit @ 2019-08-15  9:47 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <4a6878bf-344e-2df5-df00-b80c7c0982d1@gmail.com>

Now that the Realtek PHY driver maps the vendor-specific EEE registers
to the standard MMD registers, we can remove all special handling and
use the generic functions phy_ethtool_get/set_eee.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 172 +++-------------------
 1 file changed, 24 insertions(+), 148 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 7dd75c047..bd9077f85 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -758,6 +758,13 @@ static bool rtl_is_8168evl_up(struct rtl8169_private *tp)
 	       tp->mac_version <= RTL_GIGA_MAC_VER_51;
 }
 
+static bool rtl_supports_eee(struct rtl8169_private *tp)
+{
+	return tp->mac_version >= RTL_GIGA_MAC_VER_34 &&
+	       tp->mac_version != RTL_GIGA_MAC_VER_37 &&
+	       tp->mac_version != RTL_GIGA_MAC_VER_39;
+}
+
 static void rtl_read_mac_from_reg(struct rtl8169_private *tp, u8 *mac, int reg)
 {
 	int i;
@@ -2014,144 +2021,40 @@ static int rtl_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
 	return 0;
 }
 
-static int rtl_get_eee_supp(struct rtl8169_private *tp)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		ret = phy_read_paged(phydev, 0x0a5c, 0x12);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
-static int rtl_get_eee_lpadv(struct rtl8169_private *tp)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_LPABLE);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		ret = phy_read_paged(phydev, 0x0a5d, 0x11);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
-static int rtl_get_eee_adv(struct rtl8169_private *tp)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		ret = phy_read_paged(phydev, 0x0a5d, 0x10);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
-static int rtl_set_eee_adv(struct rtl8169_private *tp, int val)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret = 0;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_write_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV, val);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		phy_write_paged(phydev, 0x0a5d, 0x10, val);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
 static int rtl8169_get_eee(struct net_device *dev, struct ethtool_eee *data)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	struct device *d = tp_to_dev(tp);
 	int ret;
 
+	if (!rtl_supports_eee(tp))
+		return -EOPNOTSUPP;
+
 	pm_runtime_get_noresume(d);
 
 	if (!pm_runtime_active(d)) {
 		ret = -EOPNOTSUPP;
-		goto out;
+	} else {
+		ret = phy_ethtool_get_eee(tp->phydev, data);
 	}
 
-	/* Get Supported EEE */
-	ret = rtl_get_eee_supp(tp);
-	if (ret < 0)
-		goto out;
-	data->supported = mmd_eee_cap_to_ethtool_sup_t(ret);
-
-	/* Get advertisement EEE */
-	ret = rtl_get_eee_adv(tp);
-	if (ret < 0)
-		goto out;
-	data->advertised = mmd_eee_adv_to_ethtool_adv_t(ret);
-	data->eee_enabled = !!data->advertised;
-
-	/* Get LP advertisement EEE */
-	ret = rtl_get_eee_lpadv(tp);
-	if (ret < 0)
-		goto out;
-	data->lp_advertised = mmd_eee_adv_to_ethtool_adv_t(ret);
-	data->eee_active = !!(data->advertised & data->lp_advertised);
-out:
 	pm_runtime_put_noidle(d);
-	return ret < 0 ? ret : 0;
+
+	return ret;
 }
 
 static int rtl8169_set_eee(struct net_device *dev, struct ethtool_eee *data)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	struct device *d = tp_to_dev(tp);
-	int old_adv, adv = 0, cap, ret;
+	int ret;
+
+	if (!rtl_supports_eee(tp))
+		return -EOPNOTSUPP;
 
 	pm_runtime_get_noresume(d);
 
-	if (!dev->phydev || !pm_runtime_active(d)) {
+	if (!pm_runtime_active(d)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
@@ -2162,38 +2065,10 @@ static int rtl8169_set_eee(struct net_device *dev, struct ethtool_eee *data)
 		goto out;
 	}
 
-	/* Get Supported EEE */
-	ret = rtl_get_eee_supp(tp);
-	if (ret < 0)
-		goto out;
-	cap = ret;
-
-	ret = rtl_get_eee_adv(tp);
-	if (ret < 0)
-		goto out;
-	old_adv = ret;
-
-	if (data->eee_enabled) {
-		adv = !data->advertised ? cap :
-		      ethtool_adv_to_mmd_eee_adv_t(data->advertised) & cap;
-		/* Mask prohibited EEE modes */
-		adv &= ~dev->phydev->eee_broken_modes;
-	}
-
-	if (old_adv != adv) {
-		ret = rtl_set_eee_adv(tp, adv);
-		if (ret < 0)
-			goto out;
-
-		/* Restart autonegotiation so the new modes get sent to the
-		 * link partner.
-		 */
-		ret = phy_restart_aneg(dev->phydev);
-	}
-
+	ret = phy_ethtool_set_eee(tp->phydev, data);
 out:
 	pm_runtime_put_noidle(d);
-	return ret < 0 ? ret : 0;
+	return ret;
 }
 
 static const struct ethtool_ops rtl8169_ethtool_ops = {
@@ -2220,10 +2095,11 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
 
 static void rtl_enable_eee(struct rtl8169_private *tp)
 {
-	int supported = rtl_get_eee_supp(tp);
+	struct phy_device *phydev = tp->phydev;
+	int supported = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
 
 	if (supported > 0)
-		rtl_set_eee_adv(tp, supported);
+		phy_write_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV, supported);
 }
 
 static void rtl8169_get_mac_version(struct rtl8169_private *tp)
-- 
2.22.1



^ permalink raw reply related

* Re: [PATCH v2] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15  9:55 UTC (permalink / raw)
  To: ? jiang, mst@redhat.com
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
	jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB32058F4B2AD162F5421BB9B4A6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>


On 2019/8/15 下午5:42, ? jiang wrote:
> This change lowers ring buffer reclaim threshold from 1/2*queue to budget
> for better performance. According to our test with qemu + dpdk, packet
> dropping happens when the guest is not able to provide free buffer in
> avail ring timely with default 1/2*queue. The value in the patch has been
> tested and does show better performance.
>
> Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k, UDP)
> avg packets drop before: 2842
> avg packets drop after: 360(-87.3%)
>
> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
> ---
>   drivers/net/virtio_net.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0d4115c9e20b..bc08be7925eb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   		}
>   	}
>   
> -	if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
> +	if (rq->vq->num_free > min((unsigned int)budget, virtqueue_get_vring_size(rq->vq)) / 2) {
>   		if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>   			schedule_delayed_work(&vi->refill, 0);
>   	}


Acked-by: Jason Wang <jasowang@redhat.com>



^ permalink raw reply

* Re: [PATCH] net: hns: hns_enet: Add of_node_put in hns_nic_dev_probe()
From: Yonglong Liu @ 2019-08-15  9:59 UTC (permalink / raw)
  To: Nishka Dasgupta, yisen.zhuang, salil.mehta, davem, netdev
In-Reply-To: <20190815062837.6015-1-nishkadg.linux@gmail.com>



On 2019/8/15 14:28, Nishka Dasgupta wrote:
> The local variable ae_node in function hns_nic_dev_probe takes the
> return value of of_parse_phandle, which gets a node but does not put it.
> This may cause a memory leak. Hence put ae_node after the last time it
> is invoked.
> Issue found with Coccinelle.
> 
> Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_enet.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> index 2235dd55fab2..b26e84929e1e 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> @@ -2309,6 +2309,7 @@ static int hns_nic_dev_probe(struct platform_device *pdev)
>  			goto out_read_prop_fail;
>  		}
>  		priv->fwnode = &ae_node->fwnode;
> +		of_node_put(ae_node);
>  	} else if (is_acpi_node(dev->fwnode)) {
>  		struct fwnode_reference_args args;
>  
> 

Hi, Nishka:
This patch is wrong, we put the node in hns_nic_dev_remove().

The following patch had solved this problem:
263c6d75f9a5 (net: hns: Fix for missing of_node_put() after of_parse_phandle())


^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-15 10:26 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190814170715.GJ2820@mini-arch>

On 2019/08/15 2:07, Stanislav Fomichev wrote:
> On 08/13, Toshiaki Makita wrote:
>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> Can this be instead implemented with a new hook that will be called
> for TC events? This hook can write to perf event buffer and control
> plane will insert/remove/modify flow tables in the BPF maps (contol
> plane will also install xdp program).
> 
> Why do we need UMH? What am I missing?

So you suggest doing everything in xdp_flow kmod?
I also thought about that. There are two phases so let's think about them separately.

1) TC block (qdisc) creation / eBPF load

I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
done from userland, not from kernel, to run the verifier for safety.
However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
allow such programs to be loaded from kernel? I currently don't have the will
to make such an API as loading can be done with current UMH mechanism.

2) flow insertion / eBPF map update

Not sure if this needs to be done from userland. One concern is that eBPF maps can
be modified by unrelated processes and we need to handle all unexpected state of maps.
Such handling tends to be difficult and may cause unexpected kernel behavior.
OTOH updating maps from kmod may reduces the latency of flow insertion drastically.

Alexei, Daniel, what do you think?

Toshiaki Makita

^ permalink raw reply

* Re: [PATCH] net: usbnet: fix a memory leak bug
From: Oliver Neukum @ 2019-08-15 10:42 UTC (permalink / raw)
  To: Wenwen Wang
  Cc: David S. Miller, open list, open list:USB NETWORKING DRIVERS,
	open list:USB USBNET DRIVER FRAMEWORK
In-Reply-To: <1565804493-7758-1-git-send-email-wenwen@cs.uga.edu>

Am Mittwoch, den 14.08.2019, 12:41 -0500 schrieb Wenwen Wang:
> In usbnet_start_xmit(), 'urb->sg' is allocated through kmalloc_array() by
> invoking build_dma_sg(). Later on, if 'CONFIG_PM' is defined and the if
> branch is taken, the execution will go to the label 'deferred'. However,
> 'urb->sg' is not deallocated on this execution path, leading to a memory
> leak bug.

Just to make this clear:

> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
NACK

For the reason Jack explained. Deferral is not a failure.

	Regards
		Oliver


^ permalink raw reply

* INFO: task hung in tls_sw_release_resources_tx
From: syzbot @ 2019-08-15 10:54 UTC (permalink / raw)
  To: ast, aviadye, borisp, bpf, daniel, davejwatson, davem,
	jakub.kicinski, john.fastabend, kafai, linux-kernel, netdev,
	songliubraving, syzkaller-bugs, yhs

Hello,

syzbot found the following crash on:

HEAD commit:    6d5afe20 sctp: fix memleak in sctp_send_reset_streams
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=16e5536a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=a4c9e9f08e9e8960
dashboard link: https://syzkaller.appspot.com/bug?extid=6a9ff159672dfbb41c95
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17cb0502600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14d5dc22600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+6a9ff159672dfbb41c95@syzkaller.appspotmail.com

INFO: task syz-executor153:10198 blocked for more than 143 seconds.
       Not tainted 5.3.0-rc3+ #162
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor153 D27672 10198  10179 0x80000002
Call Trace:
  context_switch kernel/sched/core.c:3254 [inline]
  __schedule+0x755/0x1580 kernel/sched/core.c:3880
  schedule+0xa8/0x270 kernel/sched/core.c:3944
  schedule_timeout+0x717/0xc50 kernel/time/timer.c:1783
  do_wait_for_common kernel/sched/completion.c:83 [inline]
  __wait_for_common kernel/sched/completion.c:104 [inline]
  wait_for_common kernel/sched/completion.c:115 [inline]
  wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
  crypto_wait_req include/linux/crypto.h:685 [inline]
  crypto_wait_req include/linux/crypto.h:680 [inline]
  tls_sw_release_resources_tx+0x4ee/0x6b0 net/tls/tls_sw.c:2075
  tls_sk_proto_cleanup net/tls/tls_main.c:275 [inline]
  tls_sk_proto_close+0x686/0x970 net/tls/tls_main.c:305
  inet_release+0xed/0x200 net/ipv4/af_inet.c:427
  inet6_release+0x53/0x80 net/ipv6/af_inet6.c:470
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x92f/0x2e50 kernel/exit.c:879
  do_group_exit+0x135/0x360 kernel/exit.c:983
  __do_sys_exit_group kernel/exit.c:994 [inline]
  __se_sys_exit_group kernel/exit.c:992 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:992
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43ff88
Code: 00 00 be 3c 00 00 00 eb 19 66 0f 1f 84 00 00 00 00 00 48 89 d7 89 f0  
0f 05 48 3d 00 f0 ff ff 77 21 f4 48 89 d7 44 89 c0 0f 05 <48> 3d 00 f0 ff  
ff 76 e0 f7 d8 64 41 89 01 eb d8 0f 1f 84 00 00 00
RSP: 002b:00007ffd1c2d0f78 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043ff88
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf890 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
INFO: lockdep is turned off.
NMI backtrace for cpu 0
CPU: 0 PID: 1057 Comm: khungtaskd Not tainted 5.3.0-rc3+ #162
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline]
  watchdog+0x9d0/0xef0 kernel/hung_task.c:289
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0xe/0x10  
arch/x86/include/asm/irqflags.h:60


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-15 10:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <f6160572-8fa8-0199-8d81-6159dd4cd5ff@gmail.com>

On 2019/08/14 16:33, Toshiaki Makita wrote:
>>>    bpf, hashtab: Compare keys in long
>>
>> 3Mpps vs 4Mpps just from this patch ?
>> or combined with i40 prefech patch ?
> 
> Combined.
> 
>>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>>
>> Could you share "perf report" for just hash tab optimization
>> and for i40 ?
> 
> Sure, I'll get some more data and post them.

Here are perf report and performance numbers.
This time for some reason the performance is better than before.
Something in my env may have changed but could not identify that.

I cut and paste top 10 functions from perf report with drop rate for each case.
perf report is run with --no-child option, so does not include child functions load.
It looks like the hottest function is always xdp_flow BPF program for XDP,
but the shown function name is some meaningless one, like __this_module+0x800000007446.

- No prefetch, no long-compare

   3.3 Mpps

     25.22%  ksoftirqd/4      [kernel.kallsyms]             [k] __this_module+0x800000007446
     21.64%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     14.93%  ksoftirqd/4      [kernel.kallsyms]             [k] memcmp
      7.07%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.57%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.60%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.44%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      2.69%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      2.29%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      1.51%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

- With prefetch, no long-compare

   3.7 Mpps

     25.02%  ksoftirqd/4      [kernel.kallsyms]             [k] mirred_list_lock+0x800000008052
     21.52%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     13.20%  ksoftirqd/4      [kernel.kallsyms]             [k] memcmp
      7.38%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.09%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.57%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.50%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      2.86%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      2.84%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      1.79%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

- No prefetch, with long-compare

   4.2 Mpps

     24.64%  ksoftirqd/4      [kernel.kallsyms]             [k] __this_module+0x800000008f47
     24.42%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
      6.91%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.04%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      3.53%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.14%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      3.13%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      2.34%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      1.76%  ksoftirqd/4      [kernel.kallsyms]             [k] key_equal
      1.37%  ksoftirqd/4      [kernel.kallsyms]             [k] zero_key+0x800000010e93

   NOTE: key_equal is called in place of memcmp.

- With prefetch, with long-compare

   4.6 Mpps

     26.68%  ksoftirqd/4      [kernel.kallsyms]             [k] mirred_list_lock+0x80000000a109
     22.37%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     10.79%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.74%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      4.09%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      3.97%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.79%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.09%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      2.45%  ksoftirqd/4      [kernel.kallsyms]             [k] key_equal
      1.91%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

Toshiaki Makita

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] Add support for SKIP_BPF flag for AF_XDP sockets
From: Toke Høiland-Jørgensen @ 2019-08-15 11:12 UTC (permalink / raw)
  To: Sridhar Samudrala, magnus.karlsson, bjorn.topel, netdev, bpf,
	sridhar.samudrala, intel-wired-lan, maciej.fijalkowski,
	tom.herbert
In-Reply-To: <1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com>

Sridhar Samudrala <sridhar.samudrala@intel.com> writes:

> This patch series introduces XDP_SKIP_BPF flag that can be specified
> during the bind() call of an AF_XDP socket to skip calling the BPF 
> program in the receive path and pass the buffer directly to the socket.
>
> When a single AF_XDP socket is associated with a queue and a HW
> filter is used to redirect the packets and the app is interested in
> receiving all the packets on that queue, we don't need an additional 
> BPF program to do further filtering or lookup/redirect to a socket.
>
> Here are some performance numbers collected on 
>   - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>   - Intel 40Gb Ethernet NIC (i40e)
>
> All tests use 2 cores and the results are in Mpps.
>
> turbo on (default)
> ---------------------------------------------	
>                       no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           21.9         38.5 
> l2fwd  zerocopy           17.0         20.5
> rxdrop copy               11.1         13.3
> l2fwd  copy                1.9          2.0
>
> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> ---------------------------------------------	
>                       no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           15.4         29.0
> l2fwd  zerocopy           11.8         18.2
> rxdrop copy                8.2         10.5
> l2fwd  copy                1.7          1.7
> ---------------------------------------------

You're getting this performance boost by adding more code in the fast
path for every XDP program; so what's the performance impact of that for
cases where we do run an eBPF program?

Also, this is basically a special-casing of a particular deployment
scenario. Without a way to control RX queue assignment and traffic
steering, you're basically hard-coding a particular app's takeover of
the network interface; I'm not sure that is such a good idea...

-Toke

^ permalink raw reply

* [PATCH net-next] net: phy: read MII_CTRL1000 in genphy_read_status only if needed
From: Heiner Kallweit @ 2019-08-15 11:15 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org

Value of MII_CTRL1000 is needed only if LPA_1000MSFAIL is set.
Therefore move reading this register.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy_device.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 54f80af31..4f4ddc05c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1794,7 +1794,7 @@ EXPORT_SYMBOL(genphy_update_link);
  */
 int genphy_read_status(struct phy_device *phydev)
 {
-	int adv, lpa, lpagb, err, old_link = phydev->link;
+	int lpa, lpagb, err, old_link = phydev->link;
 
 	/* Update the link, but return if there was an error */
 	err = genphy_update_link(phydev);
@@ -1816,11 +1816,12 @@ int genphy_read_status(struct phy_device *phydev)
 			if (lpagb < 0)
 				return lpagb;
 
-			adv = phy_read(phydev, MII_CTRL1000);
-			if (adv < 0)
-				return adv;
-
 			if (lpagb & LPA_1000MSFAIL) {
+				int adv = phy_read(phydev, MII_CTRL1000);
+
+				if (adv < 0)
+					return adv;
+
 				if (adv & CTL1000_ENABLE_MASTER)
 					phydev_err(phydev, "Master/Slave resolution failed, maybe conflicting manual settings?\n");
 				else
-- 
2.22.0


^ permalink raw reply related

* [PATCH net-next] net: phy: swphy: emulate register MII_ESTATUS
From: Heiner Kallweit @ 2019-08-15 11:19 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org

When the genphy driver binds to a swphy it will call
genphy_read_abilites that will try to read MII_ESTATUS if BMSR_ESTATEN
is set in MII_BMSR. So far this would read the default value 0xffff
and 1000FD and 1000HD are reported as supported just by chance.
Better add explicit support for emulating MII_ESTATUS.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/swphy.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/phy/swphy.c b/drivers/net/phy/swphy.c
index dad22481d..53c214a22 100644
--- a/drivers/net/phy/swphy.c
+++ b/drivers/net/phy/swphy.c
@@ -22,6 +22,7 @@ struct swmii_regs {
 	u16 bmsr;
 	u16 lpa;
 	u16 lpagb;
+	u16 estat;
 };
 
 enum {
@@ -48,6 +49,7 @@ static const struct swmii_regs speed[] = {
 	[SWMII_SPEED_1000] = {
 		.bmsr  = BMSR_ESTATEN,
 		.lpagb = LPA_1000FULL | LPA_1000HALF,
+		.estat = ESTATUS_1000_TFULL | ESTATUS_1000_THALF,
 	},
 };
 
@@ -56,11 +58,13 @@ static const struct swmii_regs duplex[] = {
 		.bmsr  = BMSR_ESTATEN | BMSR_100HALF,
 		.lpa   = LPA_10HALF | LPA_100HALF,
 		.lpagb = LPA_1000HALF,
+		.estat = ESTATUS_1000_THALF,
 	},
 	[SWMII_DUPLEX_FULL] = {
 		.bmsr  = BMSR_ESTATEN | BMSR_100FULL,
 		.lpa   = LPA_10FULL | LPA_100FULL,
 		.lpagb = LPA_1000FULL,
+		.estat = ESTATUS_1000_TFULL,
 	},
 };
 
@@ -112,6 +116,7 @@ int swphy_read_reg(int reg, const struct fixed_phy_status *state)
 {
 	int speed_index, duplex_index;
 	u16 bmsr = BMSR_ANEGCAPABLE;
+	u16 estat = 0;
 	u16 lpagb = 0;
 	u16 lpa = 0;
 
@@ -125,6 +130,7 @@ int swphy_read_reg(int reg, const struct fixed_phy_status *state)
 	duplex_index = state->duplex ? SWMII_DUPLEX_FULL : SWMII_DUPLEX_HALF;
 
 	bmsr |= speed[speed_index].bmsr & duplex[duplex_index].bmsr;
+	estat |= speed[speed_index].estat & duplex[duplex_index].estat;
 
 	if (state->link) {
 		bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
@@ -151,6 +157,8 @@ int swphy_read_reg(int reg, const struct fixed_phy_status *state)
 		return lpa;
 	case MII_STAT1000:
 		return lpagb;
+	case MII_ESTATUS:
+		return estat;
 
 	/*
 	 * We do not support emulating Clause 45 over Clause 22 register
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-15 11:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814220545.co5pucyo5jk3weiv@ast-mbp.dhcp.thefacebook.com>

On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>
> > If eBPF is genuinely not usable by programs that are not fully trusted
> > by the admin, then no kernel changes at all are needed. Programs that
> > want to reduce their own privileges can easily fork() a privileged
> > subprocess or run a little helper to which they delegate BPF
> > operations. This is far more flexible than anything that will ever be
> > in the kernel because it allows the helper to verify that the rest of
> > the program is doing exactly what it's supposed to and restrict eBPF
> > operations to exactly the subset that is needed. So a container
> > manager or network manager that drops some provilege could have a
> > little bpf-helper that manages its BPF XDP, firewalling, etc
> > configuration. The two processes would talk over a socketpair.
>
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
>
> > The interesting cases you're talking about really do involved
> > unprivileged or less privileged eBPF, though. Let's see:
> > systemd --user: systemd --user is not privileged at all. There's no
> > issue of reducing privilege, since systemd --user doesn't have any
> > privilege to begin with. But systemd supports some eBPF features, and
> > presumably it would like to support them in the systemd --user case.
> > This is unprivileged eBPF.
>
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.
>
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.
>

systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
Granting them cap_bpf is the same as granting it to every other unprivileged user
process. Also unprivileged user process can start systemd --user process with any
command they like.

Jordan

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox