Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH v2] hv_netvsc: Add per-cpu ethtool stats for netvsc
From: Yidong Ren @ 2018-06-13 22:03 UTC (permalink / raw)
  To: Stephen Hemminger, Yidong Ren
  Cc: Stephen Hemminger, netdev@vger.kernel.org, Haiyang Zhang,
	linux-kernel@vger.kernel.org, devel@linuxdriverproject.org,
	David S. Miller
In-Reply-To: <20180613144813.708f26bf@xeon-e3>

> From: devel <driverdev-devel-bounces@linuxdriverproject.org> On Behalf
> Of Stephen Hemminger
> > +/* statistics per queue (rx/tx packets/bytes) */ #define
> > +NETVSC_PCPU_STATS_LEN (num_present_cpus() *
> ARRAY_SIZE(pcpu_stats))
> 
> Even though Hyper-V/Azure does not support hot plug cpu's it might be
> better to num_cpu_possible to avoid any possible future surprises.

That will create a very long output (num_cpu_possible = 128 on my machine) for ethtool,
While doesn't provide additional info.
num_present_cpus() would cause problem only if someone removed cpu 
between netvsc_get_sset_count() and netvsc_get_strings() and netvsc_get_ethtool_stats(). 

An alternative way could be: Check all stats, and only output if not zero. 
This need to be done in two pass. First pass to get the correct count, second pass to print the number.
Is there an elegant way to do this? 

^ permalink raw reply

* KASAN: slab-out-of-bounds Read in bpf_skb_change_head
From: syzbot @ 2018-06-13 22:17 UTC (permalink / raw)
  To: ast, daniel, davem, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16bd21af800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=a601a80fec461d44
dashboard link: https://syzkaller.appspot.com/bug?extid=567faa843005dda30737
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1039185f800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11e85cff800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+567faa843005dda30737@syzkaller.appspotmail.com

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
==================================================================
BUG: KASAN: slab-out-of-bounds in ____bpf_skb_change_head  
net/core/filter.c:2921 [inline]
BUG: KASAN: slab-out-of-bounds in bpf_skb_change_head+0x80c/0x9d0  
net/core/filter.c:2917
Read of size 4 at addr ffff8801d94ea680 by task syz-executor991/4551

CPU: 0 PID: 4551 Comm: syz-executor991 Not tainted 4.17.0-rc7+ #38
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
  ____bpf_skb_change_head net/core/filter.c:2921 [inline]
  bpf_skb_change_head+0x80c/0x9d0 net/core/filter.c:2917

Allocated by task 0:
(stack is not available)

Freed by task 0:
(stack is not available)

The buggy address belongs to the object at ffff8801d94ea580
  which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 24 bytes to the right of
  232-byte region [ffff8801d94ea580, ffff8801d94ea668)
The buggy address belongs to the page:
page:ffffea0007653a80 count:1 mapcount:0 mapping:ffff8801d94ea080 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffff8801d94ea080 0000000000000000 000000010000000c
raw: ffffea0006b2e720 ffffea00076595e0 ffff8801d9450e40 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801d94ea580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801d94ea600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff8801d94ea680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                    ^
  ffff8801d94ea700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801d94ea780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH v2] hv_netvsc: Add per-cpu ethtool stats for netvsc
From: Stephen Hemminger @ 2018-06-13 22:18 UTC (permalink / raw)
  To: Yidong Ren
  Cc: Yidong Ren, Stephen Hemminger, netdev@vger.kernel.org,
	Haiyang Zhang, linux-kernel@vger.kernel.org,
	devel@linuxdriverproject.org, David S. Miller
In-Reply-To: <SN6PR2101MB11353820325BBF4712076573847E0@SN6PR2101MB1135.namprd21.prod.outlook.com>

On Wed, 13 Jun 2018 22:03:34 +0000
Yidong Ren <Yidong.Ren@microsoft.com> wrote:

> > From: devel <driverdev-devel-bounces@linuxdriverproject.org> On Behalf
> > Of Stephen Hemminger  
> > > +/* statistics per queue (rx/tx packets/bytes) */ #define
> > > +NETVSC_PCPU_STATS_LEN (num_present_cpus() *  
> > ARRAY_SIZE(pcpu_stats))
> > 
> > Even though Hyper-V/Azure does not support hot plug cpu's it might be
> > better to num_cpu_possible to avoid any possible future surprises.  
> 
> That will create a very long output (num_cpu_possible = 128 on my machine) for ethtool,
> While doesn't provide additional info.
> num_present_cpus() would cause problem only if someone removed cpu 
> between netvsc_get_sset_count() and netvsc_get_strings() and netvsc_get_ethtool_stats(). 
> 
> An alternative way could be: Check all stats, and only output if not zero. 
> This need to be done in two pass. First pass to get the correct count, second pass to print the number.
> Is there an elegant way to do this? 

Ok, but there is a race between getting names and getting statistics.
If a cpu was added/removed then statistics would not match.

^ permalink raw reply

* Re: [RFC PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
From: Anchal Agarwal @ 2018-06-13 22:20 UTC (permalink / raw)
  To: Roger Pau Monn??
  Cc: jgross, len.brown, kamatam, eduval, vallish, netdev, fllinden,
	x86, rjw, linux-kernel, anchalag, cyberax, mingo, pavel, hpa,
	linux-pm, xen-devel, boris.ostrovsky, guruanb, tglx
In-Reply-To: <20180613082428.bjdko4k6cnq6eid3@mac>

Hi Roger,
To answer your question, due to the lack of mentioned commit
(commit 12ea729645ac ("xen/blkback: unmap all persistent grants when
frontend gets disconnected") in the older dom0 kernels(<3.2),resume from
hibernation can fail on guest side. In the absence of the commit,
Persistant Grants are not unmapped immediately when frontend is 
disconnected from backend and hence leave the block device in an 
inconsistent state. To avoid this unstability and work with larger set 
of kernel versions, this approach had been used. Once you don't have 
any pending req/resp it is safer for guest to resume from hibernation.

Thanks,
Anchal

On Wed, Jun 13, 2018 at 10:24:28AM +0200, Roger Pau Monn?? wrote:
> On Tue, Jun 12, 2018 at 08:56:13PM +0000, Anchal Agarwal wrote:
> > From: Munehisa Kamata <kamatam@amazon.com>
> > 
> > Add freeze and restore callbacks for PM suspend and hibernation support.
> > The freeze handler stops a block-layer queue and disconnect the frontend
> > from the backend while freeing ring_info and associated resources. The
> > restore handler re-allocates ring_info and re-connect to the backedend,
> > so the rest of the kernel can continue to use the block device
> > transparently.Also, the handlers are used for both PM
> > suspend and hibernation so that we can keep the existing suspend/resume
> > callbacks for Xen suspend without modification.
> > If a backend doesn't have commit 12ea729645ac ("xen/blkback: unmap all
> > persistent grants when frontend gets disconnected"), the frontend may see
> > massive amount of grant table warning when freeing resources.
> > 
> >  [   36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
> >  [   36.855089] xen:grant_table: WARNING: g.e. 0x112 still in use!
> > 
> > In this case, persistent grants would need to be disabled.
> > 
> > Ensure no reqs/rsps in rings before disconnecting. When disconnecting
> > the frontend from the backend in blkfront_freeze(), there still may be
> > unconsumed requests or responses in the rings, especially when the
> > backend is backed by network-based device. If the frontend gets
> > disconnected with such reqs/rsps remaining there, it can cause
> > grant warnings and/or losing reqs/rsps by freeing pages afterward.
> 
> I'm not sure why having pending requests can cause grant warnings or
> lose of requests. If handled properly this shouldn't be an issue.
> Linux blkfront already does live migration (which also involves a
> reconnection of the frontend) with pending requests and that doesn't
> seem to be an issue.
> 
> > This can lead resumed kernel into unrecoverable state like unexpected
> > freeing of grant page and/or hung task due to the lost reqs or rsps.
> > Therefore we have to ensure that there is no unconsumed requests or
> > responses before disconnecting.
> 
> Given that we have multiqueue, plus multipage rings, I'm not sure
> waiting for the requests on the rings to complete is a good idea.
> 
> Why can't you just disconnect the frontend and requeue all the
> requests in flight? When the frontend connects on resume those
> requests will be queued again.
> 
> Thanks, Roger.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply

* Re: KASAN: slab-out-of-bounds Read in bpf_skb_change_head
From: Daniel Borkmann @ 2018-06-13 23:01 UTC (permalink / raw)
  To: syzbot, ast, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <0000000000009920d3056e8d57a9@google.com>

On 06/14/2018 12:17 AM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
> git tree:       bpf-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16bd21af800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a601a80fec461d44
> dashboard link: https://syzkaller.appspot.com/bug?extid=567faa843005dda30737
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1039185f800000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11e85cff800000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+567faa843005dda30737@syzkaller.appspotmail.com

#syz fix: bpf: reject passing modified ctx to helper functions

^ permalink raw reply

* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Martin KaFai Lau @ 2018-06-13 23:26 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Wang Nan, Jiri Olsa, Namhyung Kim, Ingo Molnar
In-Reply-To: <20180612204126.GC22039@kernel.org>

On Tue, Jun 12, 2018 at 05:41:26PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Jun 12, 2018 at 05:31:24PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, Jun 07, 2018 at 01:07:01PM -0700, Martin KaFai Lau escreveu:
> > > On Thu, Jun 07, 2018 at 04:30:29PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > So this must be available in a newer llvm version? Which one?
> > 
> > > I should have put in the details in my last email or
> > > in the commit message, my bad.
> >  
> > > 1. The tools/testing/selftests/bpf/Makefile has the CLANG_FLAGS and
> > >    LLC_FLAGS needed to compile the bpf prog.  It requires a new
> > >    "-mattr=dwarf" llc option which was added to the future
> > >    llvm 7.0.
> 
> > [root@jouet bpf]# pahole hello.o
> > struct clang version 5.0.1 (tags/RELEASE_501/final) {
> > 	clang version 5.0.1 (tags/RELEASE_501/final) clang version 5.0.1 (tags/RELEASE_501/final); /*     0     4 */
> > 	clang version 5.0.1 (tags/RELEASE_501/final) clang version 5.0.1 (tags/RELEASE_501/final); /*     4     4 */
> > 	clang version 5.0.1 (tags/RELEASE_501/final) clang version 5.0.1 (tags/RELEASE_501/final); /*     8     4 */
> > 	clang version 5.0.1 (tags/RELEASE_501/final) clang version 5.0.1 (tags/RELEASE_501/final); /*    12     4 */
> > 
> > 	/* size: 16, cachelines: 1, members: 4 */
> > 	/* last cacheline: 16 bytes */
> > };
> > [root@jouet bpf]# 
> > 
> > Ok, I guess I saw this case in the llvm/clang git logs, so this one was
> > generated with the older clang, will regenerate and add that "-mattr=dwarf"
> > part.
> 
> [root@jouet bpf]# pahole hello.o
> struct clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78e82f54be8fb0bb5f02e3ca674fbde10ef34) {
> 	clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78 clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78e82f54be8fb0bb5f02e3ca674fbde10ef34); /*     0     4 */
> 	clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78 clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78e82f54be8fb0bb5f02e3ca674fbde10ef34); /*     4     4 */
> 	clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78 clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78e82f54be8fb0bb5f02e3ca674fbde10ef34); /*     8     4 */
> 	clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78 clang version 7.0.0 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_clang.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=_Qzsu689xEjjl9JvYCvJsIZLZZKDLB6rM-Uc0gqQvyg&e= 8c873daccce7ee5339b9fd82c81fe02b73543b65) (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_git_llvm.git&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=4d495SlcvobgBOFahId75gM-V2su4Qq2wiLOGkU-adI&s=cFz6VP_YIYy_hubsx05WDqpTDyXl0Wnx_RAmAl1dbpg&e= 98c78e82f54be8fb0bb5f02e3ca674fbde10ef34); /*    12     4 */
> 
> 	/* size: 16, cachelines: 1, members: 4 */
> 	/* last cacheline: 16 bytes */
> };
That means the "-mattr=dwarf" is not effective.
Can you share your clang and llc command to create hello.o?

> [root@jouet bpf]#
> 
> Ideas?
> 
> [root@jouet bpf]# trace -e open*,hello.c
> clang-6.0: error: unknown argument: '-mattr=dwarf'
> ERROR:	unable to compile hello.c
> Hint:	Check error message shown above.
> Hint:	You can also pre-compile it into .o using:
>      		clang -target bpf -O2 -c hello.c
>      	with proper -I and -D options.
> event syntax error: 'hello.c'
>                      \___ Failed to load hello.c from source: Error when compiling BPF scriptlet
> 
> (add -v to see detail)
> Run 'perf list' for a list of valid events
> 
>  Usage: perf trace [<options>] [<command>]
>     or: perf trace [<options>] -- <command> [<options>]
>     or: perf trace record [<options>] [<command>]
>     or: perf trace record [<options>] -- <command> [<options>]
> 
>     -e, --event <event>   event/syscall selector. use 'perf list' to list available events
> [root@jouet bpf]#
> 
> [root@jouet bpf]# trace -v -e open*,hello.c
> bpf: builtin compilation failed: -95, try external compiler
> Kernel build dir is set to /lib/modules/4.17.0-rc5/build
> set env: KBUILD_DIR=/lib/modules/4.17.0-rc5/build
> unset env: KBUILD_OPTS
> include option is set to  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
> set env: NR_CPUS=4
> set env: LINUX_VERSION_CODE=0x41100
> set env: CLANG_EXEC=/usr/local/bin/clang
> set env: CLANG_OPTIONS=-g -mattr=dwarf
> set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
> set env: PERF_BPF_INC_OPTIONS=-I/home/acme/lib/include/perf/bpf
> set env: WORKING_DIR=/lib/modules/4.17.0-rc5/build
> set env: CLANG_SOURCE=/home/acme/bpf/hello.c
> llvm compiling command template: $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $KERNEL_INC_OPTIONS $PERF_BPF_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o -
> llvm compiling command : /usr/local/bin/clang -D__KERNEL__ -D__NR_CPUS__=4 -DLINUX_VERSION_CODE=0x41100 -g -mattr=dwarf  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h  -I/home/acme/lib/include/perf/bpf -Wno-unused-value -Wno-pointer-sign -working-directory /lib/modules/4.17.0-rc5/build -c /home/acme/bpf/hello.c -target bpf -O2 -o -
> clang-6.0: error: unknown argument: '-mattr=dwarf'
> ERROR:	unable to compile hello.c
> Hint:	Check error message shown above.
> Hint:	You can also pre-compile it into .o using:
>      		clang -target bpf -O2 -c hello.c
>      	with proper -I and -D options.
> event syntax error: 'hello.c'
>                      \___ Failed to load hello.c from source: Error when compiling BPF scriptlet
> 

^ permalink raw reply

* [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Xin Long @ 2018-06-13 23:37 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: Marcelo Ricardo Leitner, Neil Horman, davem, eric.dumazet

Now sctp GSO uses skb_gro_receive() to append the data into head
skb frag_list. However it actually only needs very few code from
skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
of its members are not needed here.

This patch is to add sctp_packet_gso_append() to build GSO frames
instead of skb_gro_receive(), and it would avoid many unnecessary
checks and make the code clearer.

Note that sctp will use page frags instead of frag_list to build
GSO frames in another patch. But it may take time, as sctp's GSO
frames may have different size. skb_segment() can only split it
into the frags with the same size, which would break the border
of sctp chunks.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/sctp/structs.h |  5 +++++
 net/sctp/output.c          | 28 ++++++++++++++++++----------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index ebf809e..dbe1b91 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1133,6 +1133,11 @@ struct sctp_input_cb {
 };
 #define SCTP_INPUT_CB(__skb)	((struct sctp_input_cb *)&((__skb)->cb[0]))
 
+struct sctp_output_cb {
+	struct sk_buff *last;
+};
+#define SCTP_OUTPUT_CB(__skb)	((struct sctp_output_cb *)&((__skb)->cb[0]))
+
 static inline const struct sk_buff *sctp_gso_headskb(const struct sk_buff *skb)
 {
 	const struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk;
diff --git a/net/sctp/output.c b/net/sctp/output.c
index e672dee..7f849b0 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -409,6 +409,21 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	refcount_inc(&sk->sk_wmem_alloc);
 }
 
+static void sctp_packet_gso_append(struct sk_buff *head, struct sk_buff *skb)
+{
+	if (SCTP_OUTPUT_CB(head)->last == head)
+		skb_shinfo(head)->frag_list = skb;
+	else
+		SCTP_OUTPUT_CB(head)->last->next = skb;
+	SCTP_OUTPUT_CB(head)->last = skb;
+
+	head->truesize += skb->truesize;
+	head->data_len += skb->len;
+	head->len += skb->len;
+
+	__skb_header_release(skb);
+}
+
 static int sctp_packet_pack(struct sctp_packet *packet,
 			    struct sk_buff *head, int gso, gfp_t gfp)
 {
@@ -422,7 +437,7 @@ static int sctp_packet_pack(struct sctp_packet *packet,
 
 	if (gso) {
 		skb_shinfo(head)->gso_type = sk->sk_gso_type;
-		NAPI_GRO_CB(head)->last = head;
+		SCTP_OUTPUT_CB(head)->last = head;
 	} else {
 		nskb = head;
 		pkt_size = packet->size;
@@ -503,15 +518,8 @@ static int sctp_packet_pack(struct sctp_packet *packet,
 					 &packet->chunk_list);
 		}
 
-		if (gso) {
-			if (skb_gro_receive(&head, nskb)) {
-				kfree_skb(nskb);
-				return 0;
-			}
-			if (WARN_ON_ONCE(skb_shinfo(head)->gso_segs >=
-					 sk->sk_gso_max_segs))
-				return 0;
-		}
+		if (gso)
+			sctp_packet_gso_append(head, nskb);
 
 		pkt_count++;
 	} while (!list_empty(&packet->chunk_list));
-- 
2.1.0

^ permalink raw reply related

* [PULL v2] vhost, virtio
From: Michael S. Tsirkin @ 2018-06-13 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: ohad, kevin, kvm, mst, netdev, linux-remoteproc, linux-kernel,
	stable, bjorn.andersson, syzbot+87cfa083e727a224754b,
	virtualization

Page hints are reworked - I dropped them for now.

The following changes since commit 29dcea88779c856c7dc92040a0c01233263101d4:

  Linux 4.17 (2018-06-03 14:15:21 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 2eb98105f8c7f4b867f7f714a998f5b8c1bb009b:

  virtio: update the comments for transport features (2018-06-12 04:59:29 +0300)

----------------------------------------------------------------
virtio, vhost: features, fixes

VF support for virtio.
DMA barriers for virtio strong barriers.
Bugfixes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Michael S. Tsirkin (2):
      virtio_ring: switch to dma_XX barriers for rpmsg
      vhost: fix info leak due to uninitialized memory

Tiwei Bie (2):
      virtio_pci: support enabling VFs
      virtio: update the comments for transport features

 drivers/vhost/vhost.c              |  3 +++
 drivers/virtio/virtio_pci_common.c | 30 ++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c | 14 ++++++++++++++
 include/linux/virtio_ring.h        |  4 ++--
 include/uapi/linux/virtio_config.h | 16 ++++++++++++----
 5 files changed, 61 insertions(+), 6 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Marcelo Ricardo Leitner @ 2018-06-13 23:42 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Neil Horman, davem, eric.dumazet
In-Reply-To: <d41bb7dda9b5c176d5c0a23a8705744f49fcb570.1528933022.git.lucien.xin@gmail.com>

On Thu, Jun 14, 2018 at 07:37:02AM +0800, Xin Long wrote:
> Now sctp GSO uses skb_gro_receive() to append the data into head
> skb frag_list. However it actually only needs very few code from
> skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
> of its members are not needed here.
> 
> This patch is to add sctp_packet_gso_append() to build GSO frames
> instead of skb_gro_receive(), and it would avoid many unnecessary
> checks and make the code clearer.
> 
> Note that sctp will use page frags instead of frag_list to build
> GSO frames in another patch. But it may take time, as sctp's GSO
> frames may have different size. skb_segment() can only split it
> into the frags with the same size, which would break the border
> of sctp chunks.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

I cannot test it, but it looks good to me. Thanks Xin

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  include/net/sctp/structs.h |  5 +++++
>  net/sctp/output.c          | 28 ++++++++++++++++++----------
>  2 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index ebf809e..dbe1b91 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1133,6 +1133,11 @@ struct sctp_input_cb {
>  };
>  #define SCTP_INPUT_CB(__skb)	((struct sctp_input_cb *)&((__skb)->cb[0]))
>  
> +struct sctp_output_cb {
> +	struct sk_buff *last;
> +};
> +#define SCTP_OUTPUT_CB(__skb)	((struct sctp_output_cb *)&((__skb)->cb[0]))
> +
>  static inline const struct sk_buff *sctp_gso_headskb(const struct sk_buff *skb)
>  {
>  	const struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk;
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index e672dee..7f849b0 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -409,6 +409,21 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  	refcount_inc(&sk->sk_wmem_alloc);
>  }
>  
> +static void sctp_packet_gso_append(struct sk_buff *head, struct sk_buff *skb)
> +{
> +	if (SCTP_OUTPUT_CB(head)->last == head)
> +		skb_shinfo(head)->frag_list = skb;
> +	else
> +		SCTP_OUTPUT_CB(head)->last->next = skb;
> +	SCTP_OUTPUT_CB(head)->last = skb;
> +
> +	head->truesize += skb->truesize;
> +	head->data_len += skb->len;
> +	head->len += skb->len;
> +
> +	__skb_header_release(skb);
> +}
> +
>  static int sctp_packet_pack(struct sctp_packet *packet,
>  			    struct sk_buff *head, int gso, gfp_t gfp)
>  {
> @@ -422,7 +437,7 @@ static int sctp_packet_pack(struct sctp_packet *packet,
>  
>  	if (gso) {
>  		skb_shinfo(head)->gso_type = sk->sk_gso_type;
> -		NAPI_GRO_CB(head)->last = head;
> +		SCTP_OUTPUT_CB(head)->last = head;
>  	} else {
>  		nskb = head;
>  		pkt_size = packet->size;
> @@ -503,15 +518,8 @@ static int sctp_packet_pack(struct sctp_packet *packet,
>  					 &packet->chunk_list);
>  		}
>  
> -		if (gso) {
> -			if (skb_gro_receive(&head, nskb)) {
> -				kfree_skb(nskb);
> -				return 0;
> -			}
> -			if (WARN_ON_ONCE(skb_shinfo(head)->gso_segs >=
> -					 sk->sk_gso_max_segs))
> -				return 0;
> -		}
> +		if (gso)
> +			sctp_packet_gso_append(head, nskb);
>  
>  		pkt_count++;
>  	} while (!list_empty(&packet->chunk_list));
> -- 
> 2.1.0
>

^ permalink raw reply

* Re: [PATCH net-queue] i40e: Fix incorrect skb reserved size on rx
From: Toshiaki Makita @ 2018-06-14  0:25 UTC (permalink / raw)
  To: Daniel Borkmann, Jeff Kirsher; +Cc: John Fastabend, intel-wired-lan, netdev
In-Reply-To: <8ec63217-2aab-b240-ce95-530e5f1c1f15@iogearbox.net>

On 2018/06/13 18:06, Daniel Borkmann wrote:
> On 06/13/2018 10:08 AM, Toshiaki Makita wrote:
>> i40e_build_skb() reserves I40E_SKB_PAD + (xdp->data -
>> xdp->data_hard_start) but obviously I40E_SKB_PAD is unnecessary here
>> and mac_header/data feilds in skb becomes incorrect, and breaks normal
>> skb receive path as well as XDP receive path.
>>
>> Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> Thanks Toshiaki, I sent a complete fix yesterday here:
> 
> https://lkml.org/lkml/2018/6/12/843

Oh I was not aware of it. Thanks, let's go with your patch.

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH bpf] xdp: Fix handling of devmap in generic XDP
From: Toshiaki Makita @ 2018-06-14  0:33 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <201806131752.fqIMeuHk%fengguang.wu@intel.com>

On 2018/06/13 18:27, kbuild test robot wrote:
> Hi Toshiaki,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on bpf/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Toshiaki-Makita/xdp-Fix-handling-of-devmap-in-generic-XDP/20180613-161204
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
> config: i386-randconfig-a1-201823 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from net//bpf/test_run.c:7:0:
>>> include/linux/bpf.h:594:16: warning: 'struct sk_buff' declared inside parameter list
>             struct bpf_prog *xdp_prog)
>                    ^
>>> include/linux/bpf.h:594:16: warning: its scope is only this definition or declaration, which is probably not what you want
> 
> vim +594 include/linux/bpf.h
> 
>    591	
>    592	static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
>    593						   struct sk_buff *skb,
>  > 594						   struct bpf_prog *xdp_prog)
>    595	{
>    596		return 0;
>    597	}
>    598	

Ugh I did build test for the entire tree with CONFIG_BPF_SYSCALL, and
net/core and kernel/bpf without CONFIG_BPF_SYSCALL but not net/bpf.
I'll make sure to test the entire tree next. will send v2.

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Neil Horman @ 2018-06-14  0:46 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, davem,
	eric.dumazet
In-Reply-To: <d41bb7dda9b5c176d5c0a23a8705744f49fcb570.1528933022.git.lucien.xin@gmail.com>

On Thu, Jun 14, 2018 at 07:37:02AM +0800, Xin Long wrote:
> Now sctp GSO uses skb_gro_receive() to append the data into head
> skb frag_list. However it actually only needs very few code from
> skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
> of its members are not needed here.
> 
> This patch is to add sctp_packet_gso_append() to build GSO frames
> instead of skb_gro_receive(), and it would avoid many unnecessary
> checks and make the code clearer.
> 
> Note that sctp will use page frags instead of frag_list to build
> GSO frames in another patch. But it may take time, as sctp's GSO
> frames may have different size. skb_segment() can only split it
> into the frags with the same size, which would break the border
> of sctp chunks.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Do you have any performance numbers to compare with and without this patch?
Adding a function like this implies that any fixes that go into skb_gro_receive
now need to be evaluated for this function too, which means theres an implied
overhead in maintaining it.  If we're signing up for that, I'd like to know that
theres a significant performance benefit.

Neil

^ permalink raw reply

* [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14  0:53 UTC (permalink / raw)
  To: Tariq Toukan, Martin KaFai Lau; +Cc: netdev, Saeed Mahameed, Eric Dumazet

When a new rx packet arrives, the rx path will decide whether to reuse
the same page or not according to the current rx frag page offset and
frag size, i.e:
release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;

Martin debugged this and he showed that this can cause mlx4 XDP to reuse
buffers when it shouldn't.

Using frag_info->frag_size is wrong since frag size is always the port
mtu and the frag stride might be larger, especially in XDP case where
frag_stride == PAGE_SIZE.

In XDP there is an assumption to have a page per packet and reuse can
break such assumption and might cause packet data corruptions, since in 
XDP frag_offset will always reset to the beginning of the page when
refilling the rx buffer.

Fix this by using the stride size rather than frag size in "release"
condition evaluation.

For non XDP setup this will yield the same behavior since frag_stride
already equals to ALIGN(frag_size, SMP_CACHE_BYTES) and on XDP setup the
"release" condition will always be true as it is supposed to be since
frag_stride == PAGE_SIZE.

Fixes: 34db548bfb95 ("mlx4: add page recycling in receive path")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
CC: Eric Dumazet <edumazet@google.com>

---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da..f63dde0288b7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -504,7 +504,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 			u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);
 
 			frags->page_offset += sz_align;
-			release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
+			release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;
 		}
 		if (release) {
 			dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
-- 
2.17.0

^ permalink raw reply related

* [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Andrei Vagin @ 2018-06-14  0:56 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Andrei Vagin, Maciej Żenczykowski, Eric Dumazet

The commit f396922d862a added a check to not allow changing
SO_REUSEADDR/SO_REUSEPORT on bound sockets. First, it doesn't
take into account that TCP_REPAIR changes SO_REUSEADDR. Second, now it
is impossible to restore a socket state and set SO_REUSEADDR,
because the kernel always sets SO_REUSEADDR into zero after disabling
the repair mode.

Currently, sk_reuse can have three values: SK_NO_REUSE, SK_CAN_REUSE,
SK_FORCE_REUSE. SK_CAN_REUSE is set by SOL_REUSEADDR.  SK_FORCE_REUSE is
used to ignore bind conflicts for sockets in the repair mode.

This patch makes sk->sk_reuse back into a boolean and adds
sk->sk_force_reuse to track SK_FORCE_REUSE separatly.

In this case, the tcp_repair mode doesn't affect sk_reuse and
it's value can be set before switching a socketn into the repair mode.

Fixes: f396922d862a ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
---
 include/net/sock.h              | 13 ++++---------
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/tcp.c                  |  4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b75419eafe..8ad19286ab9e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -130,6 +130,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_force_reuse: ignore bind conflicts
  *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
@@ -174,7 +175,8 @@ struct sock_common {
 
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse:4;
+	unsigned char		skc_reuse:1;
+	unsigned char		skc_force_reuse:1;
 	unsigned char		skc_reuseport:1;
 	unsigned char		skc_ipv6only:1;
 	unsigned char		skc_net_refcnt:1;
@@ -339,6 +341,7 @@ struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_force_reuse		__sk_common.skc_force_reuse
 #define sk_reuseport		__sk_common.skc_reuseport
 #define sk_ipv6only		__sk_common.skc_ipv6only
 #define sk_net_refcnt		__sk_common.skc_net_refcnt
@@ -502,16 +505,8 @@ enum sk_pacing {
 #define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
 #define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
 
-/*
- * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
- * or not whether his port will be reused by someone else. SK_FORCE_REUSE
- * on a socket means that the socket will reuse everybody else's port
- * without looking at the other's sk_reuse value.
- */
-
 #define SK_NO_REUSE	0
 #define SK_CAN_REUSE	1
-#define SK_FORCE_REUSE	2
 
 int sk_set_peek_off(struct sock *sk, int val);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 33a88e045efd..2ac1c591b60c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -306,7 +306,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 		goto fail_unlock;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
-		if (sk->sk_reuse == SK_FORCE_REUSE)
+		if (sk->sk_force_reuse)
 			goto success;
 
 		if ((tb->fastreuse > 0 && reuse) ||
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba..70bfdd5a2fc4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2810,11 +2810,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EPERM;
 		else if (val == 1) {
 			tp->repair = 1;
-			sk->sk_reuse = SK_FORCE_REUSE;
+			sk->sk_force_reuse = 1;
 			tp->repair_queue = TCP_NO_QUEUE;
 		} else if (val == 0) {
 			tp->repair = 0;
-			sk->sk_reuse = SK_NO_REUSE;
+			sk->sk_force_reuse = 0;
 			tcp_send_window_probe(sk);
 		} else
 			err = -EINVAL;
-- 
2.17.0

^ permalink raw reply related

* Re: [bpf PATCH 4/6] bpf: sockmap, tcp_disconnect to listen transition
From: Martin KaFai Lau @ 2018-06-14  0:56 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180613175014.4264.84395.stgit@john-Precision-Tower-5810>

On Wed, Jun 13, 2018 at 10:50:14AM -0700, John Fastabend wrote:
> After adding checks to ensure TCP is in ESTABLISHED state when a
> sock is added we need to also ensure that user does not transition
> through tcp_disconnect() and back into ESTABLISHED state without
> sockmap removing the sock.
> 
> To do this add unhash hook and remove sock from map there.
In bpf_tcp_init():
        sk->sk_prot = &tcp_bpf_proto;

I may have missed a lock while reading sockmap.c.
Is it possible that tcp_disconnect() is being called while
the above assignment is also being done (e.g. through BPF_MAP_UPDATE_ELEM)?
The same situation go for the ESTABLISHED check.

> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  kernel/bpf/sockmap.c |   67 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 50 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 2e848cd..91c7b47 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -130,6 +130,7 @@ struct smap_psock {
>  
>  	struct proto *sk_proto;
>  	void (*save_close)(struct sock *sk, long timeout);
> +	void (*save_unhash)(struct sock *sk);
>  	void (*save_data_ready)(struct sock *sk);
>  	void (*save_write_space)(struct sock *sk);
>  };
> @@ -141,6 +142,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
>  			    int offset, size_t size, int flags);
>  static void bpf_tcp_close(struct sock *sk, long timeout);
> +static void bpf_tcp_unhash(struct sock *sk);
>  
>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>  {
> @@ -182,6 +184,7 @@ static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
>  {
>  	prot[SOCKMAP_BASE]			= *base;
>  	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
> +	prot[SOCKMAP_BASE].unhash		= bpf_tcp_unhash;
>  	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
>  	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
>  
> @@ -215,6 +218,7 @@ static int bpf_tcp_init(struct sock *sk)
>  	}
>  
>  	psock->save_close = sk->sk_prot->close;
> +	psock->save_unhash = sk->sk_prot->unhash;
>  	psock->sk_proto = sk->sk_prot;
>  
>  	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
> @@ -302,28 +306,12 @@ struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
>  	return e;
>  }
>  
> -static void bpf_tcp_close(struct sock *sk, long timeout)
> +static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
>  {
> -	void (*close_fun)(struct sock *sk, long timeout);
>  	struct smap_psock_map_entry *e;
>  	struct sk_msg_buff *md, *mtmp;
> -	struct smap_psock *psock;
>  	struct sock *osk;
>  
> -	rcu_read_lock();
> -	psock = smap_psock_sk(sk);
> -	if (unlikely(!psock)) {
> -		rcu_read_unlock();
> -		return sk->sk_prot->close(sk, timeout);
> -	}
> -
> -	/* The psock may be destroyed anytime after exiting the RCU critial
> -	 * section so by the time we use close_fun the psock may no longer
> -	 * be valid. However, bpf_tcp_close is called with the sock lock
> -	 * held so the close hook and sk are still valid.
> -	 */
> -	close_fun = psock->save_close;
> -
>  	if (psock->cork) {
>  		free_start_sg(psock->sock, psock->cork);
>  		kfree(psock->cork);
> @@ -378,6 +366,51 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>  		}
>  		e = psock_map_pop(sk, psock);
>  	}
> +}
> +
> +static void bpf_tcp_unhash(struct sock *sk)
> +{
> +	void (*unhash_fun)(struct sock *sk);
> +	struct smap_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return sk->sk_prot->unhash(sk);
> +	}
> +
> +	/* The psock may be destroyed anytime after exiting the RCU critial
> +	 * section so by the time we use close_fun the psock may no longer
> +	 * be valid. However, bpf_tcp_close is called with the sock lock
> +	 * held so the close hook and sk are still valid.
> +	 */
> +	unhash_fun = psock->save_unhash;
> +	bpf_tcp_remove(sk, psock);
> +	rcu_read_unlock();
> +	unhash_fun(sk);
> +
> +}
> +
> +static void bpf_tcp_close(struct sock *sk, long timeout)
> +{
> +	void (*close_fun)(struct sock *sk, long timeout);
> +	struct smap_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return sk->sk_prot->close(sk, timeout);
> +	}
> +
> +	/* The psock may be destroyed anytime after exiting the RCU critial
> +	 * section so by the time we use close_fun the psock may no longer
> +	 * be valid. However, bpf_tcp_close is called with the sock lock
> +	 * held so the close hook and sk are still valid.
> +	 */
> +	close_fun = psock->save_close;
> +	bpf_tcp_remove(sk, psock);
>  	rcu_read_unlock();
>  	close_fun(sk, timeout);
>  }
> 

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-14  1:02 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Jason Wang, Alexander Duyck, virtio-dev,
	qemu-devel, Jiri Pirko, Jakub Kicinski, Netdev, Brandeburg, Jesse,
	virtualization, aaron.f.brown
In-Reply-To: <b9c88515-cf41-2a9a-078e-9c9e5adbbf14@intel.com>

On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>>
>> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>>>
>>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>>>> device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the STANDBY
>>>>>>> feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>>>>> via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>
>>>>>> So I do not think we can commit to this interface: we
>>>>>> really need to control visibility of the primary device.
>>>>>
>>>>> The problem is legacy guest won't use primary device at all if we do
>>>>> this.
>>>>
>>>> And that's by design - I think it's the only way to ensure the
>>>> legacy guest isn't confused.
>>>
>>> Yes. I think so. But i am not sure if Qemu is the right place to control
>>> the visibility
>>> of the primary device. The primary device may not be specified as an
>>> argument to Qemu. It
>>> may be plugged in later.
>>> The cloud service provider is providing a feature that enables low
>>> latency datapath and live
>>> migration capability.
>>> A tenant can use this feature only if he is running a VM that has
>>> virtio-net with failover support.
>>
>> Well live migration is there already. The new feature is low latency
>> data path.
>
>
> we get live migration with just virtio.  But I meant live migration with VF
> as
> primary device.
>
>>
>> And it's the guest that needs failover support not the VM.
>
>
> Isn't guest and VM synonymous?
>
>
>>
>>
>>> I think Qemu should check if guest virtio-net supports this feature and
>>> provide a mechanism for
>>> an upper layer indicating if the STANDBY feature is successfully
>>> negotiated or not.
>>> The upper layer can then decide if it should hot plug a VF with the same
>>> MAC and manage the 2 links.
>>> If VF is successfully hot plugged, virtio-net link should be disabled.
>>
>> Did you even talk to upper layer management about it?
>> Just list the steps they need to do and you will see
>> that's a lot of machinery to manage by the upper layer.
>>
>> What do we gain in flexibility? As far as I can see the
>> only gain is some resources saved for legacy VMs.
>>
>> That's not a lot as tenant of the upper layer probably already has
>> at least a hunch that it's a new guest otherwise
>> why bother specifying the feature at all - you
>> save even more resources without it.
>>
>
> I am not all that familiar with how Qemu manages network devices. If we can
> do all the
> required management of the primary/standby devices within Qemu, that is
> definitely a better
> approach without upper layer involvement.

Right. I would imagine in the extreme case the upper layer doesn't
have to be involved at all if QEMU manages all hot plug/unplug logic.
The management tool can supply passthrough device and virtio with the
same group UUID, QEMU auto-manages the presence of the primary, and
hot plug the device as needed before or after the migration.

-Siwei

>
>
>>
>>
>>>>> How about control the visibility of standby device?
>>>>>
>>>>> Thanks
>>>>
>>>> standy the always there to guarantee no downtime.
>>>>
>>>>>> However just for testing purposes, we could add a non-stable
>>>>>> interface "x-standby" with the understanding that as any
>>>>>> x- prefix it's unstable and will be changed down the road,
>>>>>> likely in the next release.
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>>     2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>>                          true),
>>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>>> SPEED_UNKNOWN),
>>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> +                      false),
>>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>>     };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>>                                          * Steering */
>>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23       /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for
>>>>>>> another device
>>>>>>> +                                         * with the same MAC.
>>>>>>> +                                         */
>>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63        /* Device set
>>>>>>> linkspeed and duplex */
>>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>>>>>> --
>>>>>>> 2.14.3
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>
>

^ permalink raw reply

* Re: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Maciej Żenczykowski @ 2018-06-14  1:10 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: David S. Miller, Linux NetDev, Eric Dumazet
In-Reply-To: <20180614005606.1057-1-avagin@openvz.org>

>  #define SK_NO_REUSE    0
>  #define SK_CAN_REUSE   1

since it's a boolean now these should go away too I believe.

should there simply/also be a separate privileged socket option to
set/get force reuse?

^ permalink raw reply

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Xin Long @ 2018-06-14  1:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, davem,
	Eric Dumazet
In-Reply-To: <20180614004643.GA12409@neilslaptop.think-freely.org>

On Thu, Jun 14, 2018 at 8:46 AM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Thu, Jun 14, 2018 at 07:37:02AM +0800, Xin Long wrote:
>> Now sctp GSO uses skb_gro_receive() to append the data into head
>> skb frag_list. However it actually only needs very few code from
>> skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
>> of its members are not needed here.
>>
>> This patch is to add sctp_packet_gso_append() to build GSO frames
>> instead of skb_gro_receive(), and it would avoid many unnecessary
>> checks and make the code clearer.
>>
>> Note that sctp will use page frags instead of frag_list to build
>> GSO frames in another patch. But it may take time, as sctp's GSO
>> frames may have different size. skb_segment() can only split it
>> into the frags with the same size, which would break the border
>> of sctp chunks.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> Do you have any performance numbers to compare with and without this patch?
> Adding a function like this implies that any fixes that go into skb_gro_receive
> now need to be evaluated for this function too, which means theres an implied
> overhead in maintaining it.  If we're signing up for that, I'd like to know that
> theres a significant performance benefit.
Hi Neil,

I don't think there's a noticeable performance benefit since it's
just avoided some checks and variables settings.

The new function makes SCTP GSO code clearer and readable,
as skb_gro_receive() should only be used in the GRO code paths,
it's confusing in sctp tx path.

We're doing this, actually because skb_gro_receive() is being
changed now, it would not be suitable for SCTP GSO, see:
https://www.spinics.net/lists/netdev/msg507716.html

And also, we believe page frags will be used to replace frag_list
to build SCTP GSO frames soon. After that, this function will also
be dropped.

^ permalink raw reply

* Re: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Maciej Żenczykowski @ 2018-06-14  1:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrey Vagin, David Miller, netdev
In-Reply-To: <CANn89iJ6Yj8VPBQHsJcEcJUePRJz6NpZrL3OenSfiYccKDbDFA@mail.gmail.com>

> Hi Andrey
>
> This commit was reverted, do we still need  this patch ?

I think it still makes things easier to understand...

^ permalink raw reply

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: David Miller @ 2018-06-14  2:05 UTC (permalink / raw)
  To: nhorman; +Cc: lucien.xin, netdev, linux-sctp, marcelo.leitner, eric.dumazet
In-Reply-To: <20180614004643.GA12409@neilslaptop.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 13 Jun 2018 20:46:43 -0400

> Do you have any performance numbers to compare with and without this
> patch?  Adding a function like this implies that any fixes that go
> into skb_gro_receive now need to be evaluated for this function too,
> which means theres an implied overhead in maintaining it.  If we're
> signing up for that, I'd like to know that theres a significant
> performance benefit.

Neil, I asked Xin and Marcelo to do this.

There is no reason for GSO code to use a GRO helper.

And this is, in particular, blocking some skb_gro_receive() surgery
I plan to perform.

^ permalink raw reply

* [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Toshiaki Makita @ 2018-06-14  2:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer

Commit 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue") changed
the return value type of __devmap_lookup_elem() from struct net_device *
to struct bpf_dtab_netdev * but forgot to modify generic XDP code
accordingly.
Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
net_device is expected, then skb->dev was set to invalid value.

v2:
- Fix compiler warning without CONFIG_BPF_SYSCALL.

Fixes: 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/bpf.h    | 12 ++++++++++++
 include/linux/filter.h | 16 ++++++++++++++++
 kernel/bpf/devmap.c    | 14 ++++++++++++++
 net/core/filter.c      | 21 ++++-----------------
 4 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 995c3b1..7df32a3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -488,12 +488,15 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 
 /* Map specifics */
 struct xdp_buff;
+struct sk_buff;
 
 struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
+			     struct bpf_prog *xdp_prog);
 
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
 void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
@@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+struct sk_buff;
+
+static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
+					   struct sk_buff *skb,
+					   struct bpf_prog *xdp_prog)
+{
+	return 0;
+}
+
 static inline
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
 {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..8ddff1f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -19,6 +19,7 @@
 #include <linux/cryptohash.h>
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
+#include <linux/if_vlan.h>
 
 #include <net/sch_generic.h>
 
@@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
+					   struct net_device *fwd)
+{
+	unsigned int len;
+
+	if (unlikely(!(fwd->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
+	if (skb->len > len)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a7cc7b3..642c97f 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -345,6 +345,20 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return bq_enqueue(dst, xdpf, dev_rx);
 }
 
+int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
+			     struct bpf_prog *xdp_prog)
+{
+	int err;
+
+	err = __xdp_generic_ok_fwd_dev(skb, dst->dev);
+	if (unlikely(err))
+		return err;
+	skb->dev = dst->dev;
+	generic_xdp_tx(skb, xdp_prog);
+
+	return 0;
+}
+
 static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
diff --git a/net/core/filter.c b/net/core/filter.c
index 3d9ba7e..e7f12e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3214,20 +3214,6 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
-static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
-{
-	unsigned int len;
-
-	if (unlikely(!(fwd->flags & IFF_UP)))
-		return -ENETDOWN;
-
-	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
-	if (skb->len > len)
-		return -EMSGSIZE;
-
-	return 0;
-}
-
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3256,10 +3242,11 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 	}
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
-		if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
+		struct bpf_dtab_netdev *dst = fwd;
+
+		err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		if (unlikely(err))
 			goto err;
-		skb->dev = fwd;
-		generic_xdp_tx(skb, xdp_prog);
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
 		struct xdp_sock *xs = fwd;
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Eric Dumazet @ 2018-06-14  2:30 UTC (permalink / raw)
  To: Saeed Mahameed, Tariq Toukan, Martin KaFai Lau; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180614005309.17357-1-saeedm@mellanox.com>



On 06/13/2018 05:53 PM, Saeed Mahameed wrote:
> When a new rx packet arrives, the rx path will decide whether to reuse
> the same page or not according to the current rx frag page offset and
> frag size, i.e:
> release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> 
> Martin debugged this and he showed that this can cause mlx4 XDP to reuse
> buffers when it shouldn't.
> 
> Using frag_info->frag_size is wrong since frag size is always the port
> mtu and the frag stride might be larger, especially in XDP case where
> frag_stride == PAGE_SIZE.

Hmmm... why frag_size is not PAGE_SIZE for XDP then ?

This patch is a bit strange, since we do :

u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);

frags->page_offset += sz_align;
release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;

So the @release condition is really to have enough space from frags->page_offset
to hold up to frag_info->frag_size bytes.

(NIC wont push more than frag_info->frag_size bytes, regardless of how big is out frag_stride )

Your patch would prevent a page being reused if say the amount of available bytes is 1536,
but frag_stride = 2048


I would suggest a patch like the following instead.

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5edd0cf2999cbde37b3404aafd700584696af940..83d6b17b102bc9a22bfd8e68863d079f38670501 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1082,7 +1082,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
         * This only works when num_frags == 1.
         */
        if (priv->tx_ring_num[TX_XDP]) {
-               priv->frag_info[0].frag_size = eff_mtu;
+               priv->frag_info[0].frag_size = PAGE_SIZE;
                /* This will gain efficient xdp frame recycling at the
                 * expense of more costly truesize accounting
                 */

^ permalink raw reply related

* Re: [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Y Song @ 2018-06-14  2:56 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <1528942062-2353-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

On Wed, Jun 13, 2018 at 7:07 PM, Toshiaki Makita
<makita.toshiaki@lab.ntt.co.jp> wrote:
> Commit 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue") changed
> the return value type of __devmap_lookup_elem() from struct net_device *
> to struct bpf_dtab_netdev * but forgot to modify generic XDP code
> accordingly.
> Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
> net_device is expected, then skb->dev was set to invalid value.
>
> v2:
> - Fix compiler warning without CONFIG_BPF_SYSCALL.
>
> Fixes: 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue")
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  include/linux/bpf.h    | 12 ++++++++++++
>  include/linux/filter.h | 16 ++++++++++++++++
>  kernel/bpf/devmap.c    | 14 ++++++++++++++
>  net/core/filter.c      | 21 ++++-----------------
>  4 files changed, 46 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 995c3b1..7df32a3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -488,12 +488,15 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>
>  /* Map specifics */
>  struct xdp_buff;
> +struct sk_buff;
>
>  struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __dev_map_flush(struct bpf_map *map);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>                     struct net_device *dev_rx);
> +int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
> +                            struct bpf_prog *xdp_prog);
>
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
> @@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>         return 0;
>  }
>
> +struct sk_buff;
> +
> +static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> +                                          struct sk_buff *skb,
> +                                          struct bpf_prog *xdp_prog)
> +{
> +       return 0;

should you return an error code here?

> +}
> +
>  static inline
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>  {
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 45fc0f5..8ddff1f 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -19,6 +19,7 @@
>  #include <linux/cryptohash.h>
>  #include <linux/set_memory.h>
>  #include <linux/kallsyms.h>
> +#include <linux/if_vlan.h>
>
>  #include <net/sch_generic.h>
>
> @@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
>
> +static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
> +                                          struct net_device *fwd)

Previously this function is only used in filter.c and now it is only
used in devmap.c. Maybe this function should be in devmap.c
until it will be used cross different files?

> +{
> +       unsigned int len;
> +
> +       if (unlikely(!(fwd->flags & IFF_UP)))
> +               return -ENETDOWN;
> +
> +       len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
> +       if (skb->len > len)
> +               return -EMSGSIZE;
> +
> +       return 0;
> +}
> +
>  /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
>   * same cpu context. Further for best results no more than a single map
>   * for the do_redirect/do_flush pair should be used. This limitation is
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index a7cc7b3..642c97f 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -345,6 +345,20 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>         return bq_enqueue(dst, xdpf, dev_rx);
>  }
>
> +int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
> +                            struct bpf_prog *xdp_prog)
> +{
> +       int err;
> +
> +       err = __xdp_generic_ok_fwd_dev(skb, dst->dev);
> +       if (unlikely(err))
> +               return err;
> +       skb->dev = dst->dev;
> +       generic_xdp_tx(skb, xdp_prog);
> +
> +       return 0;
> +}
> +
>  static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
>  {
>         struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3d9ba7e..e7f12e9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3214,20 +3214,6 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>
> -static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
> -{
> -       unsigned int len;
> -
> -       if (unlikely(!(fwd->flags & IFF_UP)))
> -               return -ENETDOWN;
> -
> -       len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
> -       if (skb->len > len)
> -               return -EMSGSIZE;
> -
> -       return 0;
> -}
> -
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>                                        struct sk_buff *skb,
>                                        struct xdp_buff *xdp,
> @@ -3256,10 +3242,11 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>         }
>
>         if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
> -               if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
> +               struct bpf_dtab_netdev *dst = fwd;
> +
> +               err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +               if (unlikely(err))
>                         goto err;
> -               skb->dev = fwd;
> -               generic_xdp_tx(skb, xdp_prog);
>         } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
>                 struct xdp_sock *xs = fwd;
>
> --
> 1.8.3.1
>
>

^ permalink raw reply

* Re: [PATCH iproute2-next v2] ip-xfrm: Add support for OUTPUT_MARK
From: David Ahern @ 2018-06-14  3:39 UTC (permalink / raw)
  To: Lorenzo Colitti, Subash Abhinov Kasiviswanathan
  Cc: netdev, Stephen Hemminger, Steffen Klassert
In-Reply-To: <CAKD1Yr119qtuabrPL=MbVGeUgRykTeqCmjvCb1buAQU2yZCKjw@mail.gmail.com>

On 6/12/18 9:14 PM, Lorenzo Colitti wrote:
> On Wed, Jun 13, 2018 at 3:48 AM Subash Abhinov Kasiviswanathan
> <subashab@codeaurora.org> wrote:
>>
>> src 192.168.1.1 dst 192.168.1.2
>>         proto esp spi 0x00004321 reqid 0 mode tunnel
>>         replay-window 0 flag af-unspec
>>         mark 0x10000/0x3ffff
>>         output-mark 0x20000
> 
> Nit: I don't know what guarantees we provide (if any) that the output
> format of "ip xfrm state" does not change except to add new lines at
> the end. Personally, I feel that an app or script that depends on
> "auth-trunc" (or anything else, really) being on the line immediately
> after "mark" is brittle and should be fixed. This is particularly true
> since in general between the mark and the encryption there might be an
> auth-trunc line, or an auth line, or neither. As such, adding this
> line here seems OK to me.

any reason to put output-mark on its own line? Why not
	mark 0x10000/0x3ffff output-mark 0x20000

is the documentation clear on the difference between mark and output-mark?

> 
>> @@ -61,6 +61,7 @@ static void usage(void)
>>         fprintf(stderr, "        [ flag FLAG-LIST ] [ sel SELECTOR ] [ LIMIT-LIST ] [ encap ENCAP ]\n");
>>         fprintf(stderr, "        [ coa ADDR[/PLEN] ] [ ctx CTX ] [ extra-flag EXTRA-FLAG-LIST ]\n");
>>         fprintf(stderr, "        [ offload [dev DEV] dir DIR ]\n");
>> +       fprintf(stderr, "        [ output-mark OUTPUT-MARK]\n");
> 
> Nit: I think you want a space between OUTPUT-MARK and ].

yes.

> 
> Other than that,
> 
> Acked-by: Lorenzo Colitti <lorenzo@google.com>
> 

^ permalink raw reply

* Re: [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Toshiaki Makita @ 2018-06-14  3:40 UTC (permalink / raw)
  To: Y Song; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <CAH3MdRX7NdAdmEp-zmdR1tY6QZTm3yDx7j+wHoFxT4soOM=zRQ@mail.gmail.com>

On 2018/06/14 11:56, Y Song wrote:
...
>> @@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>>         return 0;
>>  }
>>
>> +struct sk_buff;
>> +
>> +static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
>> +                                          struct sk_buff *skb,
>> +                                          struct bpf_prog *xdp_prog)
>> +{
>> +       return 0;
> 
> should you return an error code here?

My understanding is that this function never be called if
CONFIG_BPF_SYSCALL is not set so any value is OK.
I aligned it with other functions of devmap, specifically
dev_map_enqueue() which returns 0.

> 
>> +}
>> +
>>  static inline
>>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>>  {
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 45fc0f5..8ddff1f 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/cryptohash.h>
>>  #include <linux/set_memory.h>
>>  #include <linux/kallsyms.h>
>> +#include <linux/if_vlan.h>
>>
>>  #include <net/sch_generic.h>
>>
>> @@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
>>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>>                                        const struct bpf_insn *patch, u32 len);
>>
>> +static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
>> +                                          struct net_device *fwd)
> 
> Previously this function is only used in filter.c and now it is only
> used in devmap.c. Maybe this function should be in devmap.c
> until it will be used cross different files?

This function is also called from xdp_do_generic_redirect() in
net/core/filter.c, so I can't move it to devmap.c.

Thanks,
Toshiaki Makita

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox